HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

Towards an Ecosystem for
Heterogeneous Parallel Computing
Wu
FENG

Dept.
of
Computer
Science

Dept.
of
Electrical
&
Computer
Engineering

Virginia
Bioinforma<cs
Ins<tute

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Japanese ‘Computnik’ Earth Simulator
Shatters U.S. Supercomputer Hegemony

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Importance of High-Performance Computing (HPC)
in the LARGE and in the Small
Competitive Risk From Not Having Access to HEC
Competitive Risk From Not Having Access to HPC
Could exist and compete
3%
34%

Could not exist as a business
Could not compete on quality &
testing issues
47%

16%

Could not compete on time to market
& cost

Data from Council of Competitiveness.
Sponsored Survey Conducted by IDC

Only 3% of companies could exist and
compete without HPC.
ª  200+ participating companies, including
many Fortune 500 (Proctor & Gamble and
biological and chemical companies)
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Computnik 2.0?

•  The Second Coming of Computnik? Computnik 2.0?

–  No
…
“only”
43%
faster
than
the
previous
#1
supercomputer,
but

à
$20M
cheaper
than
the
previous
#1
supercomputer

à
42%
less
power
consump<on

•  The
Second Coming of the “Beowulf Cluster” for HPC
–  The further commoditization of HPC
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

The First Coming of the “Beowulf Cluster”
•  Utilize commodity PCs (with commodity
CPUs) to build a supercomputer

The Second Coming of
the “Beowulf Cluster”
•  Utilize commodity PCs (with commodity
CPUs) to build a supercomputer
+
•  Utilize commodity graphics processing
units (GPUs) to build a supercomputer
Issue: Extracting performance with programming ease and portability à productivity
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

“Holy Grail” Vision
•  An ecosystem for heterogeneous parallel computing

Highest-ranked commodity supercomputer
in U.S. on the Green500 (11/11)

–  Enabling software that tunes parameters of hardware devices
… with respect to performance, programmability, and portability
… via a benchmark suite of dwarfs (i.e., motifs) and mini-apps

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Design
of

Composite

Structures

SoPware

Ecosystem

Heterogeneous
Parallel
Compu<ng
(HPC)
PlaUorm

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

An Ecosystem for Heterogeneous Parallel Computing
^

Applications

Sequence
Alignment

Molecular
Dynamics

Earthquake
Modeling

Neuroinformatics

CFD for
Mini-Drones

Software
Ecosystem

Heterogeneous Parallel Computing (HPC) Platform
… inter-node (in brief)
with application to BIG DATA (StreamMR)

synergy.cs.vt.edu

Performance, Programmability, Portability

Roadmap

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Programming GPUs
CUDA
•  NVIDIA’s proprietary framework

OpenCL
•  Open standard for heterogeneous parallel computing
(Khronos Group)
•  Vendor-neutral environment for CPUs, GPUs,
APUs, and even FPGAs

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Prevalence

First
public
release
February
2007

First
public
release
December
2008

Search Date: 2013-03-25

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

CU2CL:
CUDA-to-OpenCL Source-to-Source Translator†
•  Works as a Clang plug-in to leverage its production-quality
compiler framework.
•  Covers primary CUDA constructs found in CUDA C and
CUDA run-time API.
•  Performs as well as codes manually ported from CUDA to
OpenCL.
See APU’13 Talk (PT-4057) from Tuesday, November 12
•  Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next?
†
“CU2CL:
A
CUDA-‐to-‐OpenCL
Translator
for
Mul<-‐
and
Many-‐core
Architectures,”
17th

IEEE
Int’l
Conf.
on
Parallel
&
Distributed
Systems
(ICPADS),
Dec.
2011.

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

OpenCL: Write Once, Run Anywhere
OpenCL Program

CUDA Program
CU2CL (“cuticle”)

OpenCL-supported CPUs, GPUs, FPGAs

wfeng@vt.edu, 540.231.1192

NVIDIA GPUs

synergy.cs.vt.edu

Application

CUDA Lines

Lines Manually
Changed

% AutoTranslated

bandwidthTest

891

5

98.9

BlackScholes

347

14

96.0

matrixMul

351

9

97.4

vectorAdd

147

0

100.0

Back Propagation

313

24

92.3

Hotspot

328

2

99.4

Needleman-Wunsch

430

3

99.3

SRAD

541

0

100.0

17,768

1,796

89.9

524

15

97.1

8,402

166

98.0

Fen Zi: Molecular Dynamics
GEM: Molecular Modeling
IZ PS: Neural Network

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Why CU2CL?
•  Much larger # of apps implemented in CUDA than in OpenCL
–  Idea
§  Leverage scientists’ investment in CUDA to drive OpenCL adoption

–  Issues (from the perspective of domain scientists)
§  Writing from Scratch: Learning Curve
(OpenCL is too low-level an API compared to CUDA. CUDA also low level.)
§  Porting from CUDA: Tedious and Error-Prone

•  Significant demand from major stakeholders (and Apple …)

Why Not CU2CL?
•  Just start with OpenCL?!
•  CU2CL only does source-to-source translation at present
–  Functional (code) portability? Yes.
–  Performance portability? Mostly no.
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


Roadmap
IEEE ICPADS ’11
P2S2 ’12
Parallel Computing ‘13

wfeng@vt.edu, 540.231.1192

What about
performance portability?

synergy.cs.vt.edu

Traditional Approach to Optimizing Compiler
Architecture-Independent
Optimization

Architecture-Aware
Optimization

C.
del
Mundo,
W.
Feng.
“Towards
a
Performance-‐Portable
FFT
Library
for
Heterogeneous
Computing,”

in
IEEE
IPDPS
‘13.
Phoenix,
AZ,
USA,
May

2014.
(Under
review.)

C.
del
Mundo,
W.
Feng.
“Enabling
Eﬃcient
Intra-‐Warp
Communication
for
Fourier
Transforms
in
a
Many-‐Core
Architecture”,
SC|13,
Denver,
CO,
Nov.
2013.

C.
del
Mundo,
V.
Adhinarayanan,
W.
Feng,
“Accelerating
FFT
for
Wideband
Channelization.”
in
IEEE
ICC
‘13.
Budapest,
Hungary,
June
2013.

synergy.cs.vt.edu

Computational Units Not Created Equal
•  “AMD CPU ≠ Intel CPU” and “AMD GPU ≠ NVIDIA GPU”
•  Initial performance of a CUDA-optimized N-body dwarf

-32%

“Architecture-unaware”
or
“Architecture-independent”
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Basic & Architecture-Unaware Execution
Speedup over hand-tuned SSE

400

371
328

12%

300
224
200

100

192
163
88

0
Basic

Architecture unaware Architecture aware

NVIDIA GTX280

AMD 5870

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Optimization for N-body Molecular Modeling
•  Optimization techniques on
AMD GPUs

Isolated

–  Removing conditions à kernel
splitting
–  Local staging
–  Using vector types
–  Using image memory

•  Speedup over basic OpenCL
GPU implementation

Combined

–  Isolated optimizations
–  Combined optimizations
MT: Max Threads; KS: Kernel Splitting; RA: Register Accumulator;
RP: Register Preloading; LM: Local Memory; IM: Image Memory;
LU{2,4}: Loop Unrolling{2x,4x}; VASM{2,4}: Vectorized Access &
Scalar Math{float2, float4}; VAVM{2,4}: Vectorized Access & Vector
Math{float2, float4}
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

AMD-Optimized N-body Dwarf

-32%

371x speed-up
•  12% better than NVIDIA GTX 280

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

FFT: Fast Fourier Transform
•  A spectral method that is
a critical building block
across many disciplines

33

synergy.cs.vt.edu

Peak SP Performance & Peak Memory Bandwidth

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Summary of Optimizations

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Optimizations
•  RP: Register Preloading
–  All data elements are first preloaded into the register file before use.
–  Computation facilitated solely on registers

•  LM-CM: Local Memory (Communication Only)
–  Data elements are loaded into local memory only for communication
–  Threads swap data elements solely in local memory

•  CGAP: Coalesced Global Access Pattern
–  Threads access memory contiguously

•  VASM{2|4}: Vector Access, Scalar Math, float{2|4}
–  Data elements are loaded as the listed vector type.
–  Arithmetic operations are scalar (float × float).

•  CM-K: Constant Memory for Kernel Arguments
–  The twiddle multiplication state of FFT is precomputed on the CPU
and stored in GPU constant memory for fast look-up.
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Architecture-Optimized FFT

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


Roadmap

GPU Computing Gems
J. Molecular Graphics & Modeling
IEEE HPCC ’12
IEEE HPDC ‘13
IEEE ICPADS ’11
IEEE ICC ‘13

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


Roadmap
FUTURE
WORK

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

The Berkeley View †

Sparse Linear
Algebra

•  Traditional Approach
–  Applications that target existing
hardware and programming
models

Structured
Grids

Spectral
Methods
N-Body
Methods

•  Berkeley Approach
–  Hardware design that keeps
future applications in mind
–  Basis for future applications?
13 computational dwarfs
A computational dwarf is a pattern of
communication & computation that is
common across a set of applications.

Dense Linear
Algebra

Unstructured
Grids
Monte Carlo à
MapReduce

and
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack & Branch+Bound
Graphical Models
Finite State Machine

†
Asanovic,
K.,
et
al.
The
Landscape
of
Parallel
CompuDng
Research:
A
View
from
Berkeley.

Tech.
Rep.
UCB/EECS-‐2006-‐183,
University
of
California,
Berkeley,
Dec.
2006.

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Example of a Computational Dwarf: N-Body
•  Computational Dwarf: Pattern of computation & communication
… that is common across a set of applications
•  N-Body problems are studied in
–  Cosmology, particle physics, biology, and engineering

•  All have similar structures
•  An N-Body benchmark can
provide meaningful insight
to people in all these fields
•  Optimizations may be
generally applicable as well

RoadRunner Universe:
Astrophysics

wfeng@vt.edu, 540.231.1192

GEM:
Molecular Modeling

synergy.cs.vt.edu

First Instantiation: OpenDwarfs
(formerly “OpenCL and the 13 Dwarfs”)

•  Goal
–  Provide common algorithmic methods, i.e., dwarfs, in a language that is
“write once, run anywhere” (CPU, GPU, or even FPGA), i.e., OpenCL

•  Part of a larger umbrella project (2008-2018), funded by the
NSF Center for High-Performance Reconfigurable Computing (CHREC)

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


ACM ICPE ’12:
OpenCL & the 13 Dwarfs

Roadmap

IEEE Cluster ’11, FPGA ’11,
IEEE ICPADS ’11, SAAHPC ’11

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Performance & Power Modeling
•  Goals
– 
– 
– 
– 

Robust framework
Very high accuracy (Target: < 5% prediction error)
Identification of portable predictors for performance and power
Multi-dimensional characterization
§  Performance à sequential, intra-node parallel, inter-node parallel
§  Power à component level, node level, cluster level

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Problem Formulation:

LP-Based Energy-Optimal DVFS Schedule
•  Definitions
–  A DVFS system exports n { (fi, Pi ) } settings.
–  Ti : total execution time of a program running at setting i

•  Given a program with deadline D, find a DVS schedule (t1*, …,
tn*) such that
–  If the program is executed for ti seconds at setting i, the total energy usage E is
minimized, the deadline D is met, and the required work is completed.

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Single-Coefficient β Performance Model

•  Our Formulation
–  Define the relative performance slowdown δ as
T(f) / T(fMAX) – 1
–  Re-formulate two-coefficient model
as a single-coefficient model:

–  The coefficient β is computed at run-time using a regression method on the
past MIPS rates reported from the built-in PMU.
C.
Hsu
and
W.
Feng.

“A
Power-‐Aware
Run-‐Time

System
for
High-‐Performance

Compu<ng,”
SC|05,
Nov.
2005.

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

β – Adaptation on NAS Parallel Benchmarks

C. Hsu and W. Feng.
“A Power-Aware Run-Time System
for High-Performance Computing,”
SC|05, Nov. 2005.
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Need for Better Performance & Power Modeling
•  S

Source:Virginia Tech
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Search for Reliable Predictors
•  Re-visit performance counters used for prediction
–  Applicability of performance counters across generations of
architecture

•  Performance counters monitored
–  NPB and PARSEC benchmarks
–  Platform: Intel Xeon E5645 CPU (Westmere-EP)
Context switches (CS)

Instruction issued (IIS)

L3 data cache misses (L3 DCM)

L2 data cache misses (L2 DCM)

L2 instruction cache misses (L2 ICM) Instruction completed (TOTINS)
Outstanding bus request cycles
(OutReq)

Instruction queue write cycles
(InsQW)

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


Roadmap
ACM/IEEE SC ’05
ICPP ’07
IEEE/ACM CCGrid ‘09
IEEE GreenCom ’10
ACM CF ’11
IEEE Cluster ‘11
ISC ’12
IEEE ICPE ‘13

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


Roadmap

Performance Portability

Performance Portability
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

What is Heterogeneous Task Scheduling?
•  Automatically spreading tasks across heterogeneous compute
resources
–  CPUs, GPUs, APUs, FPGAs, DSPs, and so on

•  Specify tasks at a higher level (currently OpenMP extensions)
•  Run them across available resources automatically

Goal
•  A run-time system that intelligently uses what is available
resource-wise and optimize for performance portability
–  Each user should not have to implement this for themselves!

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Heterogeneous Task Scheduling

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

How to Heterogeneous Task Schedule (HTS)
•  Accelerated OpenMP offers heterogeneous task scheduling with
–  Programmability
–  Functional portability, given underlying compilers
–  Performance portability

•  How?
–  A simple extension to Accelerated OpenMP syntax for programmability
–  Automatically dividing parallel tasks across arbitrary heterogeneous
compute resources for functional portability
§  CPUs
§  GPUs
§  APUs

–  Intelligent runtime task scheduling for performance portability

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Programmability: Why Accelerated OpenMP?
# pragma omp parallel for

shared(in1,in2,out,pow)
for (i=0; i<end; i++){
out[i] = in1[i]*in2[i];
pow[i] = pow[i]*pow[i];
}

Traditional OpenMP

# pragma acc_region_loop

acc_copyin(in1[0:end],in2[0:end])
acc_copyout(out[0:end])

acc_copy(pow[0:end])
out[i] = in1[i] * in2[i];
}

OpenMP Accelerator Directives

# pragma acc_region_loop

acc_copyin(in1[0:end],in2[0:end])
acc_copyout(out[0:end])

acc_copy(pow[0:end])

hetero(<cond>[,<scheduler>[,<ratio>
[,<div>]]])
out[i] = in1[i] * in2[i];
}
Our Proposed Extension
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

OpenMP Accelerator Behavior
Original/Master thread

Worker threads

Parallel region

Accelerated region

#pragma omp acc_region …
Implicit barrier
#pragma omp parallel …

Kernels
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

DESIRED OpenMP Accelerator Behavior

Worker threads

Parallel region

Accelerated region

#pragma omp parallel num_threads(2)


wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Work-share a Region Across the Whole System


OR


Worker threads

Parallel region

Accelerated region
synergy.cs.vt.edu

Scheduling and Load-Balancing by Adaptation

•  Measure computational suitability at runtime
eTSAR: Task-Size Adapting Runtime
•  Compute new distribution of work through a linear
optimization approach
Program phase
Compute
•  Re-distribute Scheduling before each pass
work
Original

A:7

With GPU back−off

1.00

I = total iterations available

0.50

0.25

min(

Adaptive

0.75

n
X

fj = fraction of iterations for compute unit j

0.75

t+ + tj )
j

(8)

fj = 1

j=1

pj = recent time/iteration for compute unit j f ⇤ p(1) f ⇤ p = t+
2
2
1
1
1
n = number of compute devices
f3 ⇤ p 3 f1 ⇤ p 1 = t +
2
+
tj (or tj ) = time over (or under) equal
.
.
.
2
3
4
1
2
3
4
n 1
X GPUs
Number of+
fn ⇤ p n f1 ⇤ p 1 = t +
n
min(
t1 + t1 · · · + t+ 1 + tn 1 )
(2)
n

t1

0.25

0.00
1

) Percentage of time spent on computation
j=1
nd scheduling
n
X

ij = I

(10)

tn

1

(11)

(b) Modiﬁed objective and constraints
(3)

Fig. 5: Linear program optimization and performance
j=0
+

1

(9)

t2

Split

0.50

(7)

j=1

ij = iterations for compute unit j

0.00
1.00

n 1
X

synergy.cs.vt.edu

Heterogeneous Scheduling: Issues and Solutions
•  Issue: Launching overhead is high on GPUs
–  Using a work-queue, many GPU kernels may need to be run

•  Solution: Schedule only at the beginning of a region
–  The overhead is only paid once or a small number of times

•  Issue: Without a work-queue, how do we balance load?
–  The performance of each device must be predicted

•  Solution: Allocate different amounts of work
–  For the first pass, predict a reasonable initial division of work. We use
a ratio between the number of CPU and GPU cores for this.
–  For subsequent passes, use the performance of previous passes to
predict following passes.

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

ly get Equation 6, which can be solved in only
ctions and produces a result within one iteration
program for the common case.
i0
c
i0
c

⇤ pc =
⇤ pc =

the next iteration, we assume that iterations take the sa
amount of time on average from one pass to the next. For
general case with an arbitrary number of compute units,
use a linear program for which Figure I lists the necess
variables. Equation 1 represents the objective function, w
the accompanying constraints in Equations 2 through 5.

Dynamic Ratios

i 0 ⇤ pg
g
(i0 i0 ) ⇤ pg
t
c
((i0 i0 ) ⇤ pg )/pc
t
c
((i0 i0 ) ⇤ pg )/pc
t
c
(i0 ⇤ pg )/pc
t
(i0 ⇤ pg )/pc
t

n 1
X

•  All dynamic schedulers share a common min( t1
i0 =
j=1
c
basis 0
ic =
•  pg the end of a pass, a version of the linear
i0 + (i0 ⇤At)/pc =
c
c
/pc + (i0 ⇤program at right is solved
c pg )/pc =
i 2 ⇤ p2
0
ic ⇤ (pg + pthe simplified case, we actually solve
•  For c ) = it ⇤ pg
i 3 ⇤ p3
– 

i0
c

= (it ⇤ pg )/(pg + pc )

(6)

+

+ t1 · · · + t+
n
n
X

1

+ tn

1)

ij = I

j=0

i 1 ⇤ p1 = t+
1
i 1 ⇤ p1 = t+
2
.
.
.

t1
t2

Purpose where
i n ⇤ pn i 1 ⇤ p 1 = t + 1 t n 1
n
’ is the number of iterations to assign the CPU
n and testing–  i c
indicated that neither of the schedExpressed in words, the linear program calculates
is entirely appropriate intotal iterations of the next pass
certain cases. Thus, we
iteration counts with predicted times that are as close
–  it is the
other schedulers: split and quick.
identical
– schedulingtime per iterationleast (c)pu/(g)pu onbetween all devices as possible. The constrai
p* is the requires more at for
Our dynamic
ecutions of thethe last pass
parallel region to compute an
tio, which works well for applications which
ple passes. However, some parallel regions are
ly a few times, or even just once. Split scheduls these regions. In this case, each pass through© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192
synergy.cs.vt.edu

begins a loop that iterates div times with each
cuting totaltasks/div tasks. Thus, the runtime

Dynamic Schedulers
•  Schedulers
–  Dynamic
–  Split
–  Quick

•  Arguments
–  Ratio
§  Initial split to use, i.e., ratio between CPU and GPU performance

–  Div
§  How far to subdivide the region

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Static Scheduler
Ratio: 0.5

250 it

500 it

500 it

1,000 it
250 it

500 it

Pass 1


Pass 2
Worker threads

Parallel region

wfeng@vt.edu, 540.231.1192

Accelerated region
synergy.cs.vt.edu

Dynamic Scheduler
Ratio: 0.5

250 it

900 it

500 it

1,000 it
250 it

100 it
Pass 1


Pass 2
Worker threads

Parallel region

wfeng@vt.edu, 540.231.1192

Accelerated region
synergy.cs.vt.edu

Split Scheduler
Ratio: 0.5 Div: 4
125 it

125 it

125 it

125 it

500 it

Pass 1

Reschedule Original/Master thread Worker threads Parallel region Accelerated region
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Arguments:
Ratio=0.5
Div=4

Dynamic vs. Split Scheduler

Dynamic
500 it

900 it
1,000 it

250 it
250 it

100 it
Pass 1
Split

63 it

113 it

Pass 2
113 it

113 it

113 it

113 it 113 it 113 it

12 it

12 it

1,000 it

500 it

62 it

12 it

12 it

12 it

12 it

12 it


88

synergy.cs.vt.edu

Quick Scheduler
Ratio: 0.5 Div: 4
375 it

125 it

500 it

Pass 1

Pass 2

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Arguments:
Ratio=0.5
Div=4

Dynamic vs. Quick Scheduler
900 it

Dynamic
500 it

1,000 it

250 it
250 it

100 it
Pass 1
Quick

63 it

Pass 2
338 it

900 it
1,000 it

500 it
62 it

37 it

100 it


90

synergy.cs.vt.edu

Benchmarks
•  NAS CG – Many passes (1,800 for C class)
–  The Conjugate Gradient benchmark from the NAS Parallel
Benchmarks

•  GEM – One pass
–  Molecular Modeling, computes the electrostatic potential along the
surface of a macromolecule

•  K-Means – Few passes
–  Iterative clustering of points

•  Helmholtz – Few passes, GPU Unsuitable
–  Jacobi iterative method implementing the Helmholtz equation

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Experimental Setup
•  System
– 
– 
– 
– 

12-core AMD Opteron 6174 CPU
NVIDIA Tesla C2050 GPU
Linux 2.6.27.19
CCE compiler with OpenMP accelerator extensions

•  Procedures
–  All parameters default, unless otherwise specified
–  Results represent 5 or more runs

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Results across Schedulers for all Benchmarks
Scheduler
CPU

GPU

cg

Static

Dynamic

gem

Split

Quick

helmholtz

kmeans

1.0
8

Speedup over 12−core CPU

1.4

0.8

1.2

1.5
6

1.0

0.6
0.8

1.0

4
0.4

0.6
0.4

0.5

2

0.2

0.2
0.0

0
cg

0.0
gem

0.0
helmholtz

kmeans

Application
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu


Roadmap

IEEE IPDPS ’12

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

An Ecosystem for
^
Heterogeneous
Parallel Computing

Roadmap

Much work still to be done.
•  OpenCL and the 13 Dwarfs
Beta release pending
•  Source-to-Source Translation
CU2CL only & no optimization
•  Architecture-Aware Optimization
Only manual optimizations
•  Performance & Power Modeling
Preliminary & pre-multicore
•  Affinity-Based Cost Modeling
Empirical results; modeling in progress
•  Heterogeneous Task Scheduling
Preliminary with OpenMP

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Intra-Node

An Ecosystem for Heterogeneous Parallel Computing
^

Applications

Sequence
Alignment

Molecular
Dynamics

Earthquake
Modeling

Neuroinformatics

Avionic
Composites

Software
Ecosystem

Heterogeneous Parallel Computing (HPC) Platform

… from Intra-Node to Inter-Node

synergy.cs.vt.edu

Programming CPU-GPU Clusters (e.g., MPI+CUDA)
GPU

GPU

device

memory

device

memory

CPU

main

memory

Node 1

Network

CPU

main

memory

Node 2

Rank = 0

Rank = 1

if(rank == 0)
{
cudaMemcpy(host_buf, dev_buf, D2H)
MPI_Send(host_buf, .. ..)
}

if(rank == 1)
{
MPI_Recv(host_buf, .. ..)
cudaMemcpy(dev_buf, host_buf, H2D)
}

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Goal of Programming CPU-GPU Clusters (MPI + Any Acc)
IEEE HPCC ’12
ACM HPDC ‘13

GPU

device

memory

CPU

main

memory

Network

GPU

device

memory

main

memory

Rank = 0
if(rank == 0)
{
MPI_Send(any_buf, .. ..);
}

CPU

Rank = 1
if(rank == 1)
{
MPI_Recv(any_buf, .. ..);
}
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

“Virtualizing” GPUs …
Compute Node
VOCL Proxy
OpenCL
API
Native OpenCL Library

Compute Node
Compute Node
Application
OpenCL API

Application

MPI

OpenCL
API
VOCL Library

Physical
GPU


MPI
Virtual GPU

Virtual GPU

Physical
GPU

Compute Node
VOCL Proxy
OpenCL
API

Traditional Model

Our Model
wfeng@vt.edu, 540.231.1192

Physical
GPU
synergy.cs.vt.edu

Conclusion
•  An ecosystem for heterogeneous parallel computing

Highest-ranked commodity supercomputer
in U.S. on the Green500 (11/11)

–  Enabling software that tunes parameters of hardware devices
… with respect to performance, programmability, and portability
… via a benchmark suite of dwarfs (i.e., motifs) and mini-apps

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

People

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Funding Acknowledgements

Microsoft featured our “Big Data in the Cloud”
grant in U.S. Congressional Testimony

wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

Wu
Feng,
wfeng@vt.edu,
540-‐231-‐1192

http://synergy.cs.vt.edu/

http://www.chrec.org/

http://www.green500.org/

http://www.mpiblast.org/
http://sss.cs.vt.edu/

“Accelerators ‘R Us”!
http://accel.cs.vt.edu/
http://myvice.cs.vt.edu/
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu

HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

Similar to HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng