slides

Ph.D. defense
December 21st
2015
Contributions of hybrid architectures
to depth imaging: a CPU, APU and GPU
comparative study
Issam SAID

Energy supply and demand
• 40% more energy is needed by 2035
• No choice but Oil, Gas and Coal
• Sophisticated seismic methods
I. Said Ph.D. defense 12/21/2015 1/50

Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Shot = source activation + data collection (receivers)
Seismic survey
• Air-gun array
• Hydrophones
Shot record

Acquisition Processing
Noise at-
tenuation
Demul-
tiple
Interpo-
lation
Imaging
Interpretation
{Subsurface image

Calculate seismic attributes
• Dip
• Azimuth
• Coherence

Calculate seismic attributes
• Dip
• Azimuth
• Coherence (courtesy of Total)

Reverse Time Migration (RTM)
• The reference computer based imaging algorithm in the industry
• Repositions seismic events into their true location in the subsurface

Reverse Time Migration (RTM)
• The reference computer based imaging algorithm in the industry
• Repositions seismic events into their true location in the subsurface
• Sub-salt and steep dips imaging
• Accurate (full wave equation (two-way))
• Requires massive compute resources (compute and storage)

RTM workﬂow
Forward modeling (FWD)

RTM workﬂow
Forward modeling (FWD) Backward modeling (BWD)

RTM workﬂow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition

The underlying theory of the RTM algorithm
The RTM operator
Img(x) =
H
0
T
0
Sh(x, t) ∗ Rh(x, T − t)dt dh
The Cauchy problem



1
c2
∂2
u(x, t)
∂t2
− ∆u(x, t) = s(t), in Ω
u(x, 0) = 0
∂u(x, 0)
∂t
= 0
Boundary condition
u = 0 on ∂Ω

Finite Diﬀerence Time Domain for RTM
• Finite Diﬀerence Time Domain (8th
order in space, 2nd
order in time)
• Regular grids
• Perfectly Matched Layers (PML) as an absorbing boundary condition
Un+1
i,j,k = 2Un
i,j,k − Un−1
i,j,k + c2
i,j,k∆t2
∆Un
i,j,k + c2
i,j,k∆t2
sn
• Heavy computation (hours to days of processing time)
• Terabytes of temporary data
• Requires High Performance Computing

HPC solutions for RTM
CPU clusters are the reference
• Process large data sets across interconnected multi-core CPUs
• Advanced optimization techniques (vectorization, cache blocking)
Hardware accelerators and co-processors
• RTM is massively parallel
• GPU, FPGA, Intel Xeon Phi
• Dominance of GPUs:
• Huge compute power (up to 5 TFlop/s)
• High memory bandwidth (up to 300 GB/s)
• Possible PCI overheads (sustained bandwidth up to 12 GB/s)
• Data snapshotting
• MPI communications with neighbors (multi-GPU)
• Limited memory capacities
• A high-end GPU has only 12 GB at most
• CPU based compute nodes have 128 GB
• High power consumptions 400 W (CPU+GPU(not standalone))

GPU based solutions for RTM
• Possible software techniques to overcome RTM limits on GPUs:
• Temporal blocking (PCI overhead)
• Overlapping CPU-GPU transfers with computations (PCI overhead)
• Out-of-core algorithms (memory limitation)
• Extensive efforts and investments
• Hardware solution with an acceptable performance/efforts trade-off?

Towards unifying CPUs and GPUs
GPU main memory
Dispatch units
L2
CU0
L1Local memory
Register file
PE
CU1
L1Local memory
Register file
PE
CUN-1
L1Local memory
Register file
PE
CPU
System memory
CPU0
L1 WC
L2
CPUs-1
L1 WC
L2
L3
FPUFPU
PCI Express Bus
CPU
System memory
CPU3
L1 WC
L2
FPU
CPU2
L1 WC
L2
FPU
CPU0
L1 WC
FPU
CPU1
L1 WC
L2
FPU
Quad-core CPU module
Dispatch units
CU0
TEX L1Local memory
Register file
PE
CU1
TEX L1Local memory
Register file
PE
CUN-1
TEX L1Local memory
Register file
PE
Integrated GPU moduleUNB GPUmemorycontroller
Memorycontroller
GARLIC
ONION
Accelerated Processing Unit (APU)
CPU+discrete GPU

Towards unifying CPUs and GPUs
Strengths
• No PCI Express bus
• Integrated GPUs can address the entire memory
• Low power processors ( 95 W TDP at most):
• CPU 150 W TDP at most
• GPU 300 W at most
Weaknesses
• Low compute power as compared to GPUs:
• Kaveri APU 730 GFlop/s (integrated GPU)
• Phenom CPU 150 GFlop/s
• Tahiti GPU 3700 GFlop/s
• An order of magnitude less memory bandwidth than GPUs:
• APU up to 25 GB/s memory bandwidth
• GPU 300 GB/s

Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Is the APU a valuable HPC solution for depth imaging that:
• may be more efficient than CPU solutions?
• is likely to overcome GPU based solutions limitations?
Program-
ming models
OpenCLOpenACC

Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
diﬀerence
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
eﬃciency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC

Evaluation of the APU technology
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
diﬀerence
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
eﬃciency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC

The APU memory subsystem
• Onion: coherent bus (slow)
• Garlic: non coherent bus (full memory bandwidth)

• c: regular CPU memory (size depends on the RAM)

• g: ﬁxed size (512 MB to 4 GB)
• cg: explicit copy from CPU memory to GPU memory
• gc: explicit copy from GPU memory to CPU memory

• u: zero-copy and non coherent (read-only accesses from GPU cores)
• Fixed and limited size (up to 1 GB)

• z: zero-copy and coherent memory

• z: zero-copy and coherent memory
• Variable size (up to the maximum CPU memory size)

Data placement strategies on APU
• OpenCL data copy kernel
• From buffer A to buffer B
• Store buffers A and B in different memory locations
• Evaluate different combinations:
• For example cggc (explicit copy):
• zz (zero-copy):

Data placement benchmark results on APU
0
5
10
15
20
25
30
35
cggc zgc ugc zz uz
time(ms)lowerisbetter
Data placement strategies
buffer size: 128 MB
kernel(copy time) GPU-to-CPU CPU-to-GPU
• Using zero-copy = 60% maximum sustained bandwidth
• Select the most relevant strategies: cggc, ugc and zz

Applicative benchmarks on APU
Matrix multiplication
• Compute bound algorithm
• Evaluate the sustained compute gap between GPUs and APUs
8th
order 3D ﬁnite diﬀerence stencil
• Memory bound algorithm
• Building block of the Reverse Time Migration
• Evaluate the APU memory performance
Impact of data placement strategies on the APU performance

Finite diﬀerence stencils
∆Un
i,j,k =
1
∆x2
p/2
l=−p/2
alUn
i+l,j,k+
1
∆y2
p/2
l=−p/2
alUn
i,j+l,k+
1
∆z2
p/2
l=−p/2
alUn
i,j,k+l, p = 8
• Compute complexity O(N3
)
• Storage complexity O(N3
)
• Data snapshotting (K ∈ [1 − 10])

Stencils: implementation details
• 2D work-item grid on the 3D domain
• 1 column along the Z axis/work-item
• Register blocking when traversing the
Z dimension
• Implementations:
• scalar: global memory
• local scalar: local memory to exploit
memory access redundancies
• vectorized: global memory +
explicit vectorization
• local vectorized: local memory +
explicit vectorization

Stencil computations on CPU
2
4
6
8
10
12
14
16
18
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
vectorized
local vectorized
openmp
• Explicit vectorization helped to deliver the best performance (SSE)
• OpenCL ≥ OpenMP

Stencil computations on GPU
0
100
200
300
400
500
600
64 128 256 512 1024
NxNx32
scalar
local scalar
vectorized
local vectorized
• Scalar ≥ vectorized thanks to GCN (Graphics Core Next)
• Scalar code + OpenCL local memory oﬀered the best performance

Stencil computations on APU
10
20
30
40
50
60
70
80
90
64 128 256 512 1024
NxNx32
scalar
local scalar
vectorized
local vectorized
• Local scalar gives the best performance numbers for N ≥ 128
• Vectorization is not needed thanks to GCN

Stencils: data placement strategies
• Fixed problem size (1024 × 1024 × 32)
• One snapshot every K computation (1 ≤ K ≤ 10)
• Select the best OpenCL implementations (scalar, local scalar)
• Combine them with data placement strategies: cggc, ugc, zz
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10
K computations + 1 snapshot
problem size: 1024x1024x32 (128 MB)
best
scalar-cggc
scalar-ugc
scalar-zz
local scalar-cggc
local scalar-ugc
local scalar-zz
• Best: local scalar (zz) for 1 ≤ K ≤ 3 and (cggc) for 3 ≤ K ≤ 10
• Select explicit copy (cggc) and zero-copy (zz) for RTM

Stencils: performance comparison
8
16
32
64
128
256
512
1 2 3 4 5 6 7 8 9 10
performance projection
CPU
GPU
APU
APU(Onion=Garlic)
• APU > CPU ∀K
• GPU > APU, 2 ≤ K ≤ 10
• APU > GPU when performing one snapshot after each iteration

Stencils: conclusion
• APU can be an attractive solution:
• For a high rate of data snapshotting (ﬁnite diﬀerence)
• For medium sized problems (matrix multiplication)
• An order of magnitude of theoretical performance gap GPU/APU:
• But only 3× to 4× only in practice
• Performance only: the GPU remains the privileged solution
• Power is gaining interest in the HPC community (Green500)
• Power wall and Exascale
• What about power consumption?

Power measurement methodology
• Raritan PX (DPXR8A-16) PDU to monitor the power consumption
• Performance per Watt (PPW) metric
Methodology
• The power drawn by the system as a whole:
• Same functional hardware components for the 3 architectures
• CPU+GPU for GPU based solutions
• Importance of Power Supply Units (PSUs) electric eﬃciency

Stencils: power efficiency comparison
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10
GFlop/s/Whigherisbetter
up to 62 W
up to 222 W
up to 159 W
CPU GPU APU
• CPU offers a very low power efficiency (0.08 GFlop/s/W)
• APU is 13% more power efficient that the GPU
• Higher gain for compute bound algorithm (matrix multiplication):
• Flops consume less power than memory accesses

RTM on one HPC node
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
diﬀerence
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
eﬃciency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC

One-node RTM GPU/APU implementations
• Multiple OpenCL kernels (PML):
• Reduce compute/memory divergence
• Stencils study conclusions:
• Stencil optimizations and auto-tuning
• scalar and local scalar
• Data placement strategies (APU)
• Imaging condition on CPU
• Evaluate:
• Kernels (kernels only)
• Full application (overall)
Absorbing boundaries

Z

Y
X
Physical domain
Free surface
Absorbing boundaries
Case study
• 3D SEG/EAGE Salt velocity model
• Compute grid that ﬁts in one GPU compute node (less than 3 GB)
• Selective checkpointing frequency K=10

One-node RTM on GPU/APU
kernels only (GFlop/s) overall (GFlop/s) %loss
GPU 141.77 29.65 79%
APU(explicit copy) 32.42 15.93 50%
APU(zero-copy) 15.2 11.45 24%
GPU
• Best: scalar implementation
• Impact of PCI+IO (snapshotting) on performance
APU
• Best: scalar with using explicit data copies (cggc)
• Local memory is beneﬁcial when using zero-copy memory objects

One-node RTM: performance comparison
0
20
40
60
80
100
120
140
160
one node/architecture
CPU
APU(zero-copy)
APU(explicitcopy)
GPU
RTM(kernels only) RTM(overall)
• Poor performance on the Phenom CPU (OpenCL)
• Gap between GPU and APU:
• 4.4× with kernels only
• 1.8× only when considering overall

One-node RTM: power efficiency comparison
0
0.05
0.1
0.15
0.2
0.25
0.3
GFlop/s/Whigherisbetter
CPU
APU(zero-copy)
APU(explicit copy)
GPU
137 W
62 W
62 W
198 W
• Performance numbers based on overall timings
• Poor power efficiency on the Phenom CPU (0.013 GFlop/s/W)
• APU can be more power efficient than the GPU:
• 1.80× (explicit copy)
• 1.23× (zero-copy)

One-node RTM: conclusion
• RTM(kernels only): huge gap between APU and GPU
• RTM(overall): the performance gap is reduced
• Performance + power: APU is almost twice more eﬃcient than GPU

RTM on multi-node hybrid architectures
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
diﬀerence
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
eﬃciency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC

RTM on multi-node hybrid architectures
Motivations
• Real world cases generate large amounts of data ( 1 Terabyte)
• Larger than one node memory capacities
• Impact of MPI communications on the PCI overhead (GPU)?
• Impact of zero-copy on MPI communications (APU)?
Clusters (located at Total)
CPU cluster APU cluster GPU cluster
Number of used nodes 16
Processors/node
2×Intel Xeon CPU E5-2670 1×AMD A10-7850K 1×NVIDIA Tesla K40s
(Kaveri) 1×Intel Xeon CPU E5-2680
Case study
• Same velocity model, K=10
• Compute grids size 25 GB

Multi-node RTM: implementation
• 3D domain decomposition
• One-node study conclusions
• Boundaries copied to contiguous
buﬀers:
• For GPUs, using OpenCL kernels
• For GPUs, PCI memory transfers:
• Communications with neighbors
• I/O operations for snapshotting
Z
Y
X

South

North

Back

East

West

Front


Multi-node RTM: MPI overlapping
Problem: ineﬀective non-blocking communications (initial)
Isend(buf) do work(no buf) Wait() use bufProcess P0
Process P1 time
Recv(buf)
no progress thread
Solution: explicit overlap technique (overlap)
MPI communications (blocking)
sync
Update the domain boundaries
Update the inner domain
User thread
Auxiliary thread
time
activate

Multi-CPU RTM(FWD)
0
1
2
3
4
5
initial
overlap
initial
overlap
initial
overlap
initial
overlap
initial
overlap
initial
overlap
initial
overlap
time(s)lowerisbetter
#nodes
1 2 4 8 16 32 64
%: MPI fraction
%: performance gain
comm
out
in
1.09%
4.89%
5.69%
35.57%
43.04%
66.32%
75.93%
max[in,comm]
-26.96%
-10.70%
-16.10%
6.10%
18.20%
48.89%
44.89%
perfect scaling
0
0.1
0.2
0.3
0.4
0.5
643216
43.04%
66.32%
75.93%
18.20%
48.89%
44.89%
• 1 CPU node = 2 sockets (8 cores 2-way SMT each)
• 16 MPI/node (32 threads (SMT))
• The overlap technique is beneﬁcial when MPI fractions are high

Multi-GPU RTM(FWD)
0
0.02
0.04
0.06
0.08
0.1
0.12
initial
overlap
initial
overlap
#nodes
%: MPI fraction
%: performance gain
8 16
comm
d-h-comm
unpack
pack
in+out
out
max[in,comm]
perfect scaling
15.76%
26.49%
12.21%
14.24%
• Only 2 test cases due to memory limitations
• Up to 14% of performance gain (CPU dedicated to communications)
• PCI overheads hinder achieving near to perfect scaling

Multi-APU(explicit copy) RTM(FWD)
0
0.1
0.2
0.3
0.4
0.5
0.6
initial
overlap
initial
overlap
#nodes
%: MPI fraction
%: performance gain
8 16
comm
d-h-comm
unpack
pack
in+out
out
max[in,comm]
perfect scaling
14.76%
19.03%
13.99%
18.89%
• Up to 18% of performance gain (CPU dedicated to communications)
• Lower overhead to copy the boundaries ⇒ near to perfect scaling

Multi-APU(zero-copy) RTM(FWD)
0
1
2
3
4
5
6
7
initial
initial
overlap
initial
overlap
initial
overlap
initial
overlap
#nodes
%: MPI fraction
%: performance gain
1 2 4 8 16
comm
d-h-comm
unpack
pack
in+out
out
max[in,comm]
perfect scaling
0.10%
0.36%
13.42%
17.80%
0.34%
1.80%
8.35%
14.68%
0
0.2
0.4
0.6
0.8
1
168
13.42%
17.80%
8.35%
14.68%
• Up to 14% of performance gain
• Zero-copy ⇒ no CPU-GPU overhead + near to perfect scaling

Multi-node RTM: asynchronous I/O
Synchronous data snapshotting (sync)
Proposed solution (async)

Multi-CPU RTM(BWD)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
sync
async
sync
async
sync
async
sync
async
sync
async
sync
async
sync
async
#nodes
1 2 4 8 16 32 64
%: I/O fraction
%: performance gain
io
img
out
max[in,comm]
perfect scaling
24.40%
20.67%
17.83%
13.69%
4.07%
9.76%
1.74%
22.98%
18.95%
13.09%
10.29%
9.05%
11.51%
-0.26%
0
0.05
0.1
0.15
0.2
0.25
0.3
4.07%
9.76%
1.74%
9.05%
11.51%
-0.26%
• MPI processes pinning
• Background engine for asynchronous I/O (auxiliary thread)
• Asynchronous I/O is beneﬁcial for low nodes count only
• High nodes count: compute nodes are overused

Multi-GPU RTM(BWD)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
sync
async
sync
async
#nodes
%: I/O fraction
%: performance gain
8 16
io
img
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
perfect scaling
34.17%
46.93%
33.73%
40.23%
• Kernel times ⇒ I/O fraction
• Up to 40% performance gain (CPU fully dedicated to MPI+I/O)
• PCI overhead for I/O and communications with neighbors

Multi-APU(explicit copy) RTM(BWD)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
sync
async
sync
async
#nodes
%: I/O fraction
%: performance gain
8 16
io
img
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
perfect scaling
13.06%
13.57%
11.66%
13.33%
• Asynchronous I/O oﬀers up to 13% performance gain
• Lower CPU-GPU data copies overhead

Multi-APU(zero-copy) RTM(BWD)
0
1
2
3
4
5
6
sync
async
sync
async
sync
async
sync
async
sync
async
#nodes
%: I/O fraction
%: performance gain
1 2 4 8 16
io
img
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
perfect scaling
11.58%
11.48%
10.86%
9.21%
10.12%
9.34%
7.34%
8.16%
4.51%
9.10%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
168
9.21%
10.12%
4.51%
9.10%
• Asynchronous I/O oﬀers up to 9% performance gain
• Zero-copy memory = No CPU-GPU data copies prior to I/O

Multi-node RTM: performance comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
16 nodes/cluster
CPU
APU(zero-copy)
APU(explicit copy)
GPU
• APU cluster (explicit copy) > CPU cluster
• 1.6× (node (2 CPU) to node (1 APU))
• 3.2× (socket (1 CPU) to socket (1 APU))
• GPU cluster > APU cluster (explicit copy) by 3.5×
• GPU cluster > APU cluster (zero-copy) by 8.3×
• APU cluster > APU cluster (zero-copy) by 2.3×

Multi-node RTM: estimated power eﬃciency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1600 WCPU
8 nodes
APU(zero-copy)
16 nodes
APU(explicit copy)
16 nodes
GPU
4 nodes
$3200
$12000
• Power budget 1600 W (TDP and maximum power consumption)
• APU cluster (zero-copy) > CPU cluster
• APU cluster (explicit copy) = GPU cluster

Conclusions
• Evaluation of the APU technology:
• Performance standpoint: GPU > APU
• Performance + power: the APU becomes an attractive solution
• Importance of data placement strategies
• One-node RTM study:
• The same conclusions (APU evaluation) were conﬁrmed
• Multi-node study of the RTM:
• GPU/APU: I/O and communications = high fraction of run times
• GPU/APU: overlapping I/O and communications is mandatory
• Kaveri APU 3.2× speedup over Intel Xeon E5-2670
• Kaveri APU falls behind NVIDIA Tesla K40s GPU by 3.5×
• APU = GPU (power eﬃciency)

Conclusions on programming models
• 3 OpenACC based solutions:
• OpenACC only
• OpenACC+HMPPcg (extension to HMPP provided by CAPS)
• OpenACC+code modiﬁcation
APU (GFlop/s) GPU (GFlop/s) #LOC
OpenACC 17.61 77.55 34
OpenCL 32.42 141.77 779
• OpenACC+HMPPcg oﬀers the best directive based performance
• OpenACC+HMPPcg provides only half the OpenCL performance
• But 26× less lines of code (LOC)

Perspectives
• Directive-based approach for multi-node RTM
• Upcoming APU roadmap
• Full memory uniﬁcation (hardware level)
• HBM (High Bandwidth Memory) + compute units count increase
• OpenPower and NVLink
• More complex and realistic RTM algorithms:
• Adding anisotropy
• Elastic media

Thank you for your attention, questions?
List of publications
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Assessing the relevance of APU for high performance scientific computing,
AMD Fusion Developer Summit (AFDS), 2012.
Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil,
21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2013.
Forward seismic modeling on AMD Accelerated Processing Unit,
2013 Rice Oil & Gas HPC Workshop.
• P. Eberhart, I. Said, P. Fortin, H. Calandra,
Hybrid strategy for stencil computations on the APU,
The 1st International Workshop on High-Performance Stencil Computations, 2014.
• F. Jézéquel, J.-L. Lamotte, I. Said,
Estimation of numerical reproducibility on CPU and GPU,
Federated Conference on Computer Science and Information Systems, 2015.
• I. Said, P. Fortin, J.-L. Lamotte and H. Calandra,
Leveraging the Accelerated Processing Units for seismic imaging: a performance and power efficiency comparison against
CPUs and GPUs,
(submitted on October 2015 to an international journal).
• I. Said, P. Fortin, J.-L. Lamotte, H. Calandra,
Efficient Reverse Time Migration on APU clusters,
2016 Rice Oil & Gas HPC Workshop (submitted on November 2015).

APU generations: FD performance
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10
GFlop/s
llano
llano(comp-only)
trinity
trinity(comp-only)
kaveri
kaveri(comp-only)

Weak scaling: multi-CPU RTM
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16 32 64
io
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
7.50%
8.90%
7.37%
7.33%
7.19%
7.00%
10.12%
3.85%
-0.36%
2.74%
2.53%
4.34%
9.32%
9.76%
19.37%
19.19%
15.24%
16.61%
17.32%
21.30%
18.38%
17.47%
16.48%
14.86%
15.08%
12.35%
20.09%
8.74%

Weak scaling: multi-GPU RTM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
fw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16
io
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
45.08%
42.26%
42.16%
42.58%
46.70%
42.94%
40.77%
39.49%
27.52%
42.79%
38.05%
38.11%
40.60%
40.39%
52.67%
34.17%
34.84%
36.65%
42.70%
47.60%

Weak scaling: multi-APU RTM (explicit copy)
0
0.1
0.2
0.3
0.4
0.5
0.6
fw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16
io
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
17.40%
17.10%
16.88%
17.01%
16.35%
17.63%
9.54%
9.41%
9.50%
11.08%
11.19%
11.25%
11.26%
11.55%
11.32%
10.43%
7.30%
7.48%
7.46%
6.90%

Weak scaling: multi-APU RTM (zero-copy)
0
0.2
0.4
0.6
0.8
1
1.2
fw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16
io
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
13.08%
12.87%
12.79%
12.80%
12.78%
11.91%
9.53%
7.70%
8.17%
8.95%
10.79%
11.04%
11.13%
10.99%
10.61%
9.53%
9.49%
7.82%
8.42%
8.79%

Estimated production throughput
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1-cpu-fwd
1-cpu-bwd
1-apu-zz-fwd
1-apu-zz-bwd
8-apu-cggc-fwd8-apu-cggc-bwd
8-gpu-fwd
8-gpu-bwd
time(s)
8 shots in parallel on 8 nodes 8 shots in sequential on 8 nodes, 1 shot/8 nodes
• Loss of parallel eﬃciency as the nodes count increases
• GPU cluster > APU cluster (zero-copy) by 7.6× (8.3×)
• APU cluster > APU cluster (zero-copy) by 2× (2.3×)

slides

Recommended

Recommended

More Related Content

Similar to slides

Similar to slides (20)

slides