8. An Ecosystem for Heterogeneous Parallel Computing
^
Applications
Sequence
Alignment
Molecular
Dynamics
Earthquake
Modeling
Neuroinformatics
CFD for
Mini-Drones
Software
Ecosystem
Heterogeneous Parallel Computing (HPC) Platform
… inter-node (in brief)
with application to BIG DATA (StreamMR)
synergy.cs.vt.edu
18. Traditional Approach to Optimizing Compiler
Architecture-Independent
Optimization
Architecture-Aware
Optimization
C.
del
Mundo,
W.
Feng.
“Towards
a
Performance-‐Portable
FFT
Library
for
Heterogeneous
Computing,”
in
IEEE
IPDPS
‘13.
Phoenix,
AZ,
USA,
May
2014.
(Under
review.)
C.
del
Mundo,
W.
Feng.
“Enabling
Efficient
Intra-‐Warp
Communication
for
Fourier
Transforms
in
a
Many-‐Core
Architecture”,
SC|13,
Denver,
CO,
Nov.
2013.
C.
del
Mundo,
V.
Adhinarayanan,
W.
Feng,
“Accelerating
FFT
for
Wideband
Channelization.”
in
IEEE
ICC
‘13.
Budapest,
Hungary,
June
2013.
synergy.cs.vt.edu
52. Work-share a Region Across the Whole System
#pragma omp acc_region …
OR
#pragma omp parallel …
Original/Master thread
Worker threads
Parallel region
Accelerated region
synergy.cs.vt.edu
53. Scheduling and Load-Balancing by Adaptation
• Measure computational suitability at runtime
eTSAR: Task-Size Adapting Runtime
• Compute new distribution of work through a linear
optimization approach
Program phase
Compute
• Re-distribute Scheduling before each pass
work
Original
A:7
With GPU back−off
1.00
I = total iterations available
0.50
0.25
min(
Adaptive
0.75
n
X
fj = fraction of iterations for compute unit j
0.75
t+ + tj )
j
(8)
fj = 1
j=1
pj = recent time/iteration for compute unit j f ⇤ p(1) f ⇤ p = t+
2
2
1
1
1
n = number of compute devices
f3 ⇤ p 3 f1 ⇤ p 1 = t +
2
+
tj (or tj ) = time over (or under) equal
.
.
.
2
3
4
1
2
3
4
n 1
X GPUs
Number of+
fn ⇤ p n f1 ⇤ p 1 = t +
n
min(
t1 + t1 · · · + t+ 1 + tn 1 )
(2)
n
t1
0.25
0.00
1
) Percentage of time spent on computation
j=1
nd scheduling
n
X
ij = I
(10)
tn
1
(11)
(b) Modified objective and constraints
(3)
Fig. 5: Linear program optimization and performance
j=0
+
1
(9)
t2
Split
0.50
(7)
j=1
ij = iterations for compute unit j
0.00
1.00
n 1
X
synergy.cs.vt.edu
60. Arguments:
Ratio=0.5
Div=4
Dynamic vs. Split Scheduler
Dynamic
500 it
900 it
1,000 it
250 it
250 it
100 it
Pass 1
Split
63 it
113 it
Pass 2
113 it
113 it
113 it
113 it 113 it 113 it
12 it
12 it
1,000 it
500 it
62 it
12 it
12 it
12 it
12 it
12 it
Reschedule Original/Master thread Worker threads Parallel region Accelerated region
88
synergy.cs.vt.edu
62. Arguments:
Ratio=0.5
Div=4
Dynamic vs. Quick Scheduler
900 it
Dynamic
500 it
1,000 it
250 it
250 it
100 it
Pass 1
Quick
63 it
Pass 2
338 it
900 it
1,000 it
500 it
62 it
37 it
100 it
Reschedule Original/Master thread Worker threads Parallel region Accelerated region
90
synergy.cs.vt.edu