SlideShare a Scribd company logo
1 of 75
Download to read offline
Towards an Ecosystem for
Heterogeneous Parallel Computing
Wu	
  FENG	
  
Dept.	
  of	
  Computer	
  Science	
  	
  
Dept.	
  of	
  Electrical	
  &	
  Computer	
  Engineering	
  
Virginia	
  Bioinforma<cs	
  Ins<tute	
  

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Japanese ‘Computnik’ Earth Simulator
Shatters U.S. Supercomputer Hegemony

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Importance of High-Performance Computing (HPC)
in the LARGE and in the Small
Competitive Risk From Not Having Access to HEC
Competitive Risk From Not Having Access to HPC
Could exist and compete
3%
34%

Could not exist as a business
Could not compete on quality &
testing issues
47%

16%

Could not compete on time to market
& cost

Data from Council of Competitiveness.
Sponsored Survey Conducted by IDC

Only 3% of companies could exist and
compete without HPC.
ª  200+ participating companies, including
many Fortune 500 (Proctor & Gamble and
biological and chemical companies)
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Computnik 2.0?	
  

•  The Second Coming of Computnik? Computnik 2.0?	
  
–  No	
  …	
  “only”	
  43%	
  faster	
  than	
  the	
  previous	
  #1	
  supercomputer,	
  but	
  
	
   	
  
	
  à	
  $20M	
  cheaper	
  than	
  the	
  previous	
  #1	
  supercomputer	
  
	
   	
  
	
  à	
  42%	
  less	
  power	
  consump<on	
  

•  The	
  Second Coming of the “Beowulf Cluster” for HPC
–  The further commoditization of HPC
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
The First Coming of the “Beowulf Cluster”
•  Utilize commodity PCs (with commodity
CPUs) to build a supercomputer

The Second Coming of
the “Beowulf Cluster”
•  Utilize commodity PCs (with commodity
CPUs) to build a supercomputer
+
•  Utilize commodity graphics processing
units (GPUs) to build a supercomputer
Issue: Extracting performance with programming ease and portability à productivity
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
“Holy Grail” Vision
•  An ecosystem for heterogeneous parallel computing

Highest-ranked commodity supercomputer
in U.S. on the Green500 (11/11)

–  Enabling software that tunes parameters of hardware devices
… with respect to performance, programmability, and portability
… via a benchmark suite of dwarfs (i.e., motifs) and mini-apps

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Design	
  of	
  
Composite	
  
Structures	
  

SoPware	
  
Ecosystem	
  

Heterogeneous	
  Parallel	
  Compu<ng	
  (HPC)	
  PlaUorm	
  
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
An Ecosystem for Heterogeneous Parallel Computing
^

Applications

Sequence
Alignment

Molecular
Dynamics

Earthquake
Modeling

Neuroinformatics

CFD for
Mini-Drones

Software
Ecosystem

Heterogeneous Parallel Computing (HPC) Platform
… inter-node (in brief)
with application to BIG DATA (StreamMR)

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Programming GPUs
CUDA
•  NVIDIA’s proprietary framework

OpenCL
•  Open standard for heterogeneous parallel computing
(Khronos Group)
•  Vendor-neutral environment for CPUs, GPUs,
APUs, and even FPGAs

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Prevalence

First	
  public	
  release	
  February	
  2007	
  

First	
  public	
  release	
  December	
  2008	
  

Search Date: 2013-03-25

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
CU2CL:
CUDA-to-OpenCL Source-to-Source Translator†
•  Works as a Clang plug-in to leverage its production-quality
compiler framework.
•  Covers primary CUDA constructs found in CUDA C and
CUDA run-time API.
•  Performs as well as codes manually ported from CUDA to
OpenCL.
See APU’13 Talk (PT-4057) from Tuesday, November 12
•  Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next?
† 	
  “CU2CL:	
  A	
  CUDA-­‐to-­‐OpenCL	
  Translator	
  for	
  Mul<-­‐	
  and	
  Many-­‐core	
  Architectures,”	
  17th	
  
IEEE	
  Int’l	
  Conf.	
  on	
  Parallel	
  &	
  Distributed	
  Systems	
  (ICPADS),	
  Dec.	
  2011.	
  	
  
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
OpenCL: Write Once, Run Anywhere
OpenCL Program

CUDA Program
CU2CL (“cuticle”)

OpenCL-supported CPUs, GPUs, FPGAs

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

NVIDIA GPUs

synergy.cs.vt.edu	
  
Application

CUDA Lines

Lines Manually
Changed

% AutoTranslated

bandwidthTest

891

5

98.9

BlackScholes

347

14

96.0

matrixMul

351

9

97.4

vectorAdd

147

0

100.0

Back Propagation

313

24

92.3

Hotspot

328

2

99.4

Needleman-Wunsch

430

3

99.3

SRAD

541

0

100.0

17,768

1,796

89.9

524

15

97.1

8,402

166

98.0

Fen Zi: Molecular Dynamics
GEM: Molecular Modeling
IZ PS: Neural Network

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Why CU2CL?
•  Much larger # of apps implemented in CUDA than in OpenCL
–  Idea
§  Leverage scientists’ investment in CUDA to drive OpenCL adoption

–  Issues (from the perspective of domain scientists)
§  Writing from Scratch: Learning Curve
(OpenCL is too low-level an API compared to CUDA. CUDA also low level.)
§  Porting from CUDA: Tedious and Error-Prone

•  Significant demand from major stakeholders (and Apple …)

Why Not CU2CL?
•  Just start with OpenCL?!
•  CU2CL only does source-to-source translation at present
–  Functional (code) portability? Yes.
–  Performance portability? Mostly no.
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap
IEEE ICPADS ’11
P2S2 ’12
Parallel Computing ‘13

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

What about
performance portability?

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Traditional Approach to Optimizing Compiler
Architecture-Independent
Optimization

Architecture-Aware
Optimization

C.	
  del	
  Mundo,	
  W.	
  Feng.	
  “Towards	
  a	
  Performance-­‐Portable	
  FFT	
  Library	
  for	
  Heterogeneous	
  Computing,”	
  	
  in	
  IEEE	
  IPDPS	
  ‘13.	
  Phoenix,	
  AZ,	
  USA,	
  May	
  	
  2014.	
  (Under	
  review.)	
  
C.	
  del	
  Mundo,	
  W.	
  Feng.	
  “Enabling	
  Efficient	
  Intra-­‐Warp	
  Communication	
  for	
  Fourier	
  Transforms	
  in	
  a	
  Many-­‐Core	
  Architecture”,	
  SC|13,	
  Denver,	
  CO,	
  Nov.	
  2013.	
  	
  
C.	
  del	
  Mundo,	
  V.	
  Adhinarayanan,	
  W.	
  Feng,	
  “Accelerating	
  FFT	
  for	
  Wideband	
  Channelization.”	
  in	
  IEEE	
  ICC	
  ‘13.	
  Budapest,	
  Hungary,	
  June	
  2013.	
  

synergy.cs.vt.edu	
  
Computational Units Not Created Equal
•  “AMD CPU ≠ Intel CPU” and “AMD GPU ≠ NVIDIA GPU”
•  Initial performance of a CUDA-optimized N-body dwarf

-32%

“Architecture-unaware”
or
“Architecture-independent”
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Basic & Architecture-Unaware Execution
Speedup over hand-tuned SSE

400

371
328

12%

300
224
200

100

192
163
88

0
Basic

Architecture unaware Architecture aware

NVIDIA GTX280

AMD 5870

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Optimization for N-body Molecular Modeling
•  Optimization techniques on
AMD GPUs

Isolated

–  Removing conditions à kernel
splitting
–  Local staging
–  Using vector types
–  Using image memory 

•  Speedup over basic OpenCL
GPU implementation

Combined

–  Isolated optimizations
–  Combined optimizations
MT: Max Threads; KS: Kernel Splitting; RA: Register Accumulator;
RP: Register Preloading; LM: Local Memory; IM: Image Memory;
LU{2,4}: Loop Unrolling{2x,4x}; VASM{2,4}: Vectorized Access &
Scalar Math{float2, float4}; VAVM{2,4}: Vectorized Access & Vector
Math{float2, float4}
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
AMD-Optimized N-body Dwarf

-32%

371x speed-up
•  12% better than NVIDIA GTX 280

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
FFT: Fast Fourier Transform
•  A spectral method that is
a critical building block
across many disciplines

33
	
  

synergy.cs.vt.edu	
  
Peak SP Performance & Peak Memory Bandwidth

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Summary of Optimizations

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Optimizations
•  RP: Register Preloading
–  All data elements are first preloaded into the register file before use.
–  Computation facilitated solely on registers

•  LM-CM: Local Memory (Communication Only)
–  Data elements are loaded into local memory only for communication
–  Threads swap data elements solely in local memory

•  CGAP: Coalesced Global Access Pattern
–  Threads access memory contiguously

•  VASM{2|4}: Vector Access, Scalar Math, float{2|4}
–  Data elements are loaded as the listed vector type.
–  Arithmetic operations are scalar (float × float).

•  CM-K: Constant Memory for Kernel Arguments
–  The twiddle multiplication state of FFT is precomputed on the CPU
and stored in GPU constant memory for fast look-up.
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Architecture-Optimized FFT

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Architecture-Optimized FFT

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

GPU Computing Gems
J. Molecular Graphics & Modeling
IEEE HPCC ’12
IEEE HPDC ‘13
IEEE ICPADS ’11
IEEE ICC ‘13

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap
FUTURE	
  WORK	
  

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
The Berkeley View †

Sparse Linear
Algebra

•  Traditional Approach
–  Applications that target existing
hardware and programming
models

Structured
Grids

Spectral
Methods
N-Body
Methods

•  Berkeley Approach
–  Hardware design that keeps
future applications in mind
–  Basis for future applications?
13 computational dwarfs
A computational dwarf is a pattern of
communication & computation that is
common across a set of applications.

Dense Linear
Algebra

Unstructured
Grids
Monte Carlo à  
MapReduce

and
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack & Branch+Bound
Graphical Models
Finite State Machine

†	
   Asanovic,	
  K.,	
  et	
  al.	
  The	
  Landscape	
  of	
  Parallel	
  CompuDng	
  Research:	
  A	
  View	
  from	
  Berkeley.	
  
Tech.	
  Rep.	
  UCB/EECS-­‐2006-­‐183,	
  University	
  of	
  California,	
  Berkeley,	
  Dec.	
  2006.	
  	
  
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Example of a Computational Dwarf: N-Body
•  Computational Dwarf: Pattern of computation & communication
… that is common across a set of applications
•  N-Body problems are studied in
–  Cosmology, particle physics, biology, and engineering

•  All have similar structures
•  An N-Body benchmark can
provide meaningful insight
to people in all these fields
•  Optimizations may be
generally applicable as well

RoadRunner Universe:
Astrophysics

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

GEM:
Molecular Modeling

synergy.cs.vt.edu	
  
First Instantiation: OpenDwarfs
(formerly “OpenCL and the 13 Dwarfs”)	
  

•  Goal
–  Provide common algorithmic methods, i.e., dwarfs, in a language that is
“write once, run anywhere” (CPU, GPU, or even FPGA), i.e., OpenCL

•  Part of a larger umbrella project (2008-2018), funded by the
NSF Center for High-Performance Reconfigurable Computing (CHREC)

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

ACM ICPE ’12:
OpenCL & the 13 Dwarfs

Roadmap

IEEE Cluster ’11, FPGA ’11,
IEEE ICPADS ’11, SAAHPC ’11

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance & Power Modeling
•  Goals
– 
– 
– 
– 

Robust framework
Very high accuracy (Target: < 5% prediction error)
Identification of portable predictors for performance and power
Multi-dimensional characterization
§  Performance à sequential, intra-node parallel, inter-node parallel
§  Power à component level, node level, cluster level

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Problem Formulation:

LP-Based Energy-Optimal DVFS Schedule
•  Definitions
–  A DVFS system exports n { (fi, Pi ) } settings.
–  Ti : total execution time of a program running at setting i

•  Given a program with deadline D, find a DVS schedule (t1*, …,
tn*) such that
–  If the program is executed for ti seconds at setting i, the total energy usage E is
minimized, the deadline D is met, and the required work is completed.

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Single-Coefficient β Performance Model	

•  Our Formulation
–  Define the relative performance slowdown δ as
T(f) / T(fMAX) – 1
–  Re-formulate two-coefficient model
as a single-coefficient model:

–  The coefficient β is computed at run-time using a regression method on the
past MIPS rates reported from the built-in PMU.
C.	
  Hsu	
  and	
  W.	
  Feng.	
  
“A	
  Power-­‐Aware	
  Run-­‐Time	
  
System	
  for	
  High-­‐Performance	
  
Compu<ng,”	
  SC|05,	
  Nov.	
  2005.	
  

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
β – Adaptation on NAS Parallel Benchmarks

C. Hsu and W. Feng.
“A Power-Aware Run-Time System
for High-Performance Computing,”
SC|05, Nov. 2005.
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Need for Better Performance & Power Modeling
•  S

Source:Virginia Tech
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Search for Reliable Predictors
•  Re-visit performance counters used for prediction
–  Applicability of performance counters across generations of
architecture

•  Performance counters monitored
–  NPB and PARSEC benchmarks
–  Platform: Intel Xeon E5645 CPU (Westmere-EP)
Context switches (CS)

Instruction issued (IIS)

L3 data cache misses (L3 DCM)

L2 data cache misses (L2 DCM)

L2 instruction cache misses (L2 ICM) Instruction completed (TOTINS)
Outstanding bus request cycles
(OutReq)

Instruction queue write cycles
(InsQW)

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap
ACM/IEEE SC ’05
ICPP ’07
IEEE/ACM CCGrid ‘09
IEEE GreenCom ’10
ACM CF ’11
IEEE Cluster ‘11
ISC ’12
IEEE ICPE ‘13

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

Performance Portability

Performance Portability
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
What is Heterogeneous Task Scheduling?
•  Automatically spreading tasks across heterogeneous compute
resources
–  CPUs, GPUs, APUs, FPGAs, DSPs, and so on

•  Specify tasks at a higher level (currently OpenMP extensions)
•  Run them across available resources automatically

Goal
•  A run-time system that intelligently uses what is available
resource-wise and optimize for performance portability
–  Each user should not have to implement this for themselves!

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Heterogeneous Task Scheduling

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
How to Heterogeneous Task Schedule (HTS)
•  Accelerated OpenMP offers heterogeneous task scheduling with
–  Programmability
–  Functional portability, given underlying compilers
–  Performance portability

•  How?
–  A simple extension to Accelerated OpenMP syntax for programmability
–  Automatically dividing parallel tasks across arbitrary heterogeneous
compute resources for functional portability
§  CPUs
§  GPUs
§  APUs

–  Intelligent runtime task scheduling for performance portability

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Programmability: Why Accelerated OpenMP?
# pragma omp parallel for

shared(in1,in2,out,pow)
for (i=0; i<end; i++){
out[i] = in1[i]*in2[i];
pow[i] = pow[i]*pow[i];
}

Traditional OpenMP

# pragma acc_region_loop

acc_copyin(in1[0:end],in2[0:end])
acc_copyout(out[0:end])

acc_copy(pow[0:end])
for (i=0; i<end; i++){
out[i] = in1[i] * in2[i];
pow[i] = pow[i]*pow[i];
}

OpenMP Accelerator Directives

# pragma acc_region_loop

acc_copyin(in1[0:end],in2[0:end]) 
acc_copyout(out[0:end])

acc_copy(pow[0:end])

hetero(<cond>[,<scheduler>[,<ratio>
[,<div>]]])
for (i=0; i<end; i++){
out[i] = in1[i] * in2[i];
pow[i] = pow[i]*pow[i];
}
Our Proposed Extension
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
OpenMP Accelerator Behavior
Original/Master thread

Worker threads

Parallel region

Accelerated region

#pragma omp acc_region …
Implicit barrier
#pragma omp parallel …

Kernels
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
DESIRED OpenMP Accelerator Behavior
Original/Master thread

Worker threads

Parallel region

Accelerated region

#pragma omp acc_region …
#pragma omp parallel num_threads(2)

#pragma omp parallel …

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Work-share a Region Across the Whole System

#pragma omp acc_region …

OR
#pragma omp parallel …

Original/Master thread

Worker threads

Parallel region

Accelerated region
synergy.cs.vt.edu	
  
Scheduling and Load-Balancing by Adaptation

•  Measure computational suitability at runtime
eTSAR: Task-Size Adapting Runtime
•  Compute new distribution of work through a linear
optimization approach
Program phase
Compute
•  Re-distribute Scheduling before each pass
work
Original

A:7

With GPU back−off

1.00

I = total iterations available

0.50

0.25

min(

Adaptive

0.75

n
X

fj = fraction of iterations for compute unit j

0.75

t+ + tj )
j

(8)

fj = 1

j=1

pj = recent time/iteration for compute unit j f ⇤ p(1) f ⇤ p = t+
2
2
1
1
1
n = number of compute devices
f3 ⇤ p 3 f1 ⇤ p 1 = t +
2
+
tj (or tj ) = time over (or under) equal
.
.
.
2
3
4
1
2
3
4
n 1
X GPUs
Number of+
fn ⇤ p n f1 ⇤ p 1 = t +
n
min(
t1 + t1 · · · + t+ 1 + tn 1 )
(2)
n

t1

0.25

0.00
1

) Percentage of time spent on computation
j=1
nd scheduling
n
X

ij = I

(10)

tn

1

(11)

(b) Modified objective and constraints
(3)

Fig. 5: Linear program optimization and performance
j=0
+

1

(9)

t2

Split

0.50

(7)

j=1

ij = iterations for compute unit j

0.00
1.00

n 1
X

synergy.cs.vt.edu	
  
Heterogeneous Scheduling: Issues and Solutions
•  Issue: Launching overhead is high on GPUs
–  Using a work-queue, many GPU kernels may need to be run

•  Solution: Schedule only at the beginning of a region
–  The overhead is only paid once or a small number of times

•  Issue: Without a work-queue, how do we balance load?
–  The performance of each device must be predicted

•  Solution: Allocate different amounts of work
–  For the first pass, predict a reasonable initial division of work. We use
a ratio between the number of CPU and GPU cores for this.
–  For subsequent passes, use the performance of previous passes to
predict following passes.

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
ly get Equation 6, which can be solved in only
ctions and produces a result within one iteration
program for the common case.
i0
c
i0
c

⇤ pc =
⇤ pc =

the next iteration, we assume that iterations take the sa
amount of time on average from one pass to the next. For
general case with an arbitrary number of compute units,
use a linear program for which Figure I lists the necess
variables. Equation 1 represents the objective function, w
the accompanying constraints in Equations 2 through 5.

Dynamic Ratios

i 0 ⇤ pg
g
(i0 i0 ) ⇤ pg
t
c
((i0 i0 ) ⇤ pg )/pc
t
c
((i0 i0 ) ⇤ pg )/pc
t
c
(i0 ⇤ pg )/pc
t
(i0 ⇤ pg )/pc
t

n 1
X

•  All dynamic schedulers share a common min( t1
i0 =
j=1
c
basis 0
ic =
•  pg the end of a pass, a version of the linear
i0 + (i0 ⇤At)/pc =
c
c
/pc + (i0 ⇤program at right is solved
c pg )/pc =
i 2 ⇤ p2
0
ic ⇤ (pg + pthe simplified case, we actually solve
•  For c ) = it ⇤ pg
i 3 ⇤ p3
– 

i0
c

= (it ⇤ pg )/(pg + pc )

(6)

+

+ t1 · · · + t+
n
n
X

1

+ tn

1)

ij = I

j=0

i 1 ⇤ p1 = t+
1
i 1 ⇤ p1 = t+
2
.
.
.

t1
t2

Purpose where
i n ⇤ pn i 1 ⇤ p 1 = t + 1 t n 1
n
’ is the number of iterations to assign the CPU
n and testing–  i c
indicated that neither of the schedExpressed in words, the linear program calculates
is entirely appropriate intotal iterations of the next pass
certain cases. Thus, we
iteration counts with predicted times that are as close
–  it is the
other schedulers: split and quick.
identical
– schedulingtime per iterationleast (c)pu/(g)pu onbetween all devices as possible. The constrai
p* is the requires more at for
Our dynamic
ecutions of thethe last pass
parallel region to compute an
tio, which works well for applications which
ple passes. However, some parallel regions are
ly a few times, or even just once. Split scheduls these regions. In this case, each pass through© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192
synergy.cs.vt.edu	
  
begins a loop that iterates div times with each
cuting totaltasks/div tasks. Thus, the runtime
Dynamic Schedulers
•  Schedulers
–  Dynamic
–  Split
–  Quick

•  Arguments
–  Ratio
§  Initial split to use, i.e., ratio between CPU and GPU performance

–  Div
§  How far to subdivide the region

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Static Scheduler
Ratio: 0.5

250 it

500 it

500 it

1,000 it
250 it

500 it

Pass 1

Original/Master thread

Pass 2
Worker threads

Parallel region

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

Accelerated region
synergy.cs.vt.edu	
  
Dynamic Scheduler
Ratio: 0.5

250 it

900 it

500 it

1,000 it
250 it

100 it
Pass 1

Original/Master thread

Pass 2
Worker threads

Parallel region

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

Accelerated region
synergy.cs.vt.edu	
  
Split Scheduler
Ratio: 0.5 Div: 4
125 it

125 it

125 it

125 it

500 it

Pass 1

Reschedule Original/Master thread Worker threads Parallel region Accelerated region
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Arguments:
Ratio=0.5
Div=4

Dynamic vs. Split Scheduler

Dynamic
500 it

900 it
1,000 it

250 it
250 it

100 it
Pass 1
Split

63 it

113 it

Pass 2
113 it

113 it

113 it

113 it 113 it 113 it

12 it

12 it

1,000 it

500 it

62 it

12 it

12 it

12 it

12 it

12 it

Reschedule Original/Master thread Worker threads Parallel region Accelerated region

88
	
  

synergy.cs.vt.edu	
  
Quick Scheduler
Ratio: 0.5 Div: 4
375 it

125 it

500 it

Pass 1

Pass 2

Reschedule Original/Master thread Worker threads Parallel region Accelerated region
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Arguments:
Ratio=0.5
Div=4

Dynamic vs. Quick Scheduler
900 it

Dynamic
500 it

1,000 it

250 it
250 it

100 it
Pass 1
Quick

63 it

Pass 2
338 it

900 it
1,000 it

500 it
62 it

37 it

100 it

Reschedule Original/Master thread Worker threads Parallel region Accelerated region

90
	
  

synergy.cs.vt.edu	
  
Benchmarks
•  NAS CG – Many passes (1,800 for C class)
–  The Conjugate Gradient benchmark from the NAS Parallel
Benchmarks

•  GEM – One pass
–  Molecular Modeling, computes the electrostatic potential along the
surface of a macromolecule

•  K-Means – Few passes
–  Iterative clustering of points

•  Helmholtz – Few passes, GPU Unsuitable
–  Jacobi iterative method implementing the Helmholtz equation

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Experimental Setup
•  System
– 
– 
– 
– 

12-core AMD Opteron 6174 CPU
NVIDIA Tesla C2050 GPU
Linux 2.6.27.19
CCE compiler with OpenMP accelerator extensions

•  Procedures
–  All parameters default, unless otherwise specified
–  Results represent 5 or more runs

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Results across Schedulers for all Benchmarks
Scheduler
CPU

GPU

cg

Static

Dynamic

gem

Split

Quick

helmholtz

kmeans

1.0
8

Speedup over 12−core CPU

1.4

0.8

1.2

1.5
6

1.0

0.6
0.8

1.0

4
0.4

0.6
0.4

0.5

2

0.2

0.2
0.0

0
cg

0.0
gem

0.0
helmholtz

kmeans

Application
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Performance, Programmability, Portability

Roadmap

IEEE IPDPS ’12

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
An Ecosystem for
^
Heterogeneous
Parallel Computing

Roadmap

Much work still to be done.
•  OpenCL and the 13 Dwarfs
Beta release pending
•  Source-to-Source Translation
CU2CL only & no optimization
•  Architecture-Aware Optimization
Only manual optimizations
•  Performance & Power Modeling
Preliminary & pre-multicore
•  Affinity-Based Cost Modeling
Empirical results; modeling in progress
•  Heterogeneous Task Scheduling
Preliminary with OpenMP

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Intra-Node

An Ecosystem for Heterogeneous Parallel Computing
^

Applications

Sequence
Alignment

Molecular
Dynamics

Earthquake
Modeling

Neuroinformatics

Avionic
Composites

Software
Ecosystem

Heterogeneous Parallel Computing (HPC) Platform

… from Intra-Node to Inter-Node

synergy.cs.vt.edu	
  
Programming CPU-GPU Clusters (e.g., MPI+CUDA)
GPU	
  

GPU	
  

device	
  
memory	
  

device	
  
memory	
  

CPU	
  

main	
  
memory	
  

Node 1

Network

CPU	
  

main	
  
memory	
  

Node 2

Rank = 0

Rank = 1

if(rank == 0)
{
cudaMemcpy(host_buf, dev_buf, D2H)
MPI_Send(host_buf, .. ..)
}

if(rank == 1)
{
MPI_Recv(host_buf, .. ..)
cudaMemcpy(dev_buf, host_buf, H2D)
}

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Goal of Programming CPU-GPU Clusters (MPI + Any Acc)
IEEE HPCC ’12
ACM HPDC ‘13

GPU	
  

device	
  
memory	
  

CPU	
  

main	
  
memory	
  

Network

GPU	
  

device	
  
memory	
  

main	
  
memory	
  

Rank = 0
if(rank == 0)
{
MPI_Send(any_buf, .. ..);
}

CPU	
  

Rank = 1
if(rank == 1)
{
MPI_Recv(any_buf, .. ..);
}
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
“Virtualizing” GPUs …
Compute Node
VOCL Proxy
OpenCL
API
Native OpenCL Library

Compute Node
Compute Node
Application
OpenCL API

Application

MPI

OpenCL
API
VOCL Library

Physical
GPU

Native OpenCL Library

MPI
Virtual GPU

Virtual GPU

Physical
GPU

Compute Node
VOCL Proxy
OpenCL
API
Native OpenCL Library

Traditional Model

Our Model
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

Physical
GPU
synergy.cs.vt.edu	
  
Conclusion
•  An ecosystem for heterogeneous parallel computing

Highest-ranked commodity supercomputer
in U.S. on the Green500 (11/11)

–  Enabling software that tunes parameters of hardware devices
… with respect to performance, programmability, and portability
… via a benchmark suite of dwarfs (i.e., motifs) and mini-apps

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
People

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Funding Acknowledgements

Microsoft featured our “Big Data in the Cloud”
grant in U.S. Congressional Testimony

© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  
Wu	
  Feng,	
  wfeng@vt.edu,	
  540-­‐231-­‐1192	
  
http://synergy.cs.vt.edu/

http://www.chrec.org/

http://www.green500.org/

http://www.mpiblast.org/
http://sss.cs.vt.edu/

“Accelerators ‘R Us”!
http://accel.cs.vt.edu/
http://myvice.cs.vt.edu/
© W. Feng, November 2013
wfeng@vt.edu, 540.231.1192

synergy.cs.vt.edu	
  

More Related Content

What's hot

PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...AMD Developer Central
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...AMD Developer Central
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsWT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsAMD Developer Central
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
PG-4119, 3D Geometry Compression on GPU, by Jacques Lefaucheux
PG-4119, 3D Geometry Compression on GPU, by Jacques LefaucheuxPG-4119, 3D Geometry Compression on GPU, by Jacques Lefaucheux
PG-4119, 3D Geometry Compression on GPU, by Jacques LefaucheuxAMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
MM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent BetbederMM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent BetbederAMD Developer Central
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbr Skip
 
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...AMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...
CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...
CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...AMD Developer Central
 
MM-4099, Adapting game content to the viewing environment, by Noman Hashim
MM-4099, Adapting game content to the viewing environment, by Noman HashimMM-4099, Adapting game content to the viewing environment, by Noman Hashim
MM-4099, Adapting game content to the viewing environment, by Noman HashimAMD Developer Central
 

What's hot (20)

PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsWT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
PG-4119, 3D Geometry Compression on GPU, by Jacques Lefaucheux
PG-4119, 3D Geometry Compression on GPU, by Jacques LefaucheuxPG-4119, 3D Geometry Compression on GPU, by Jacques Lefaucheux
PG-4119, 3D Geometry Compression on GPU, by Jacques Lefaucheux
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
MM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent BetbederMM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tb
 
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...
CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...
CE-4117, HSA Optimizations and Impact on end User Experiences for AfterShot P...
 
MM-4099, Adapting game content to the viewing environment, by Noman Hashim
MM-4099, Adapting game content to the viewing environment, by Noman HashimMM-4099, Adapting game content to the viewing environment, by Noman Hashim
MM-4099, Adapting game content to the viewing environment, by Noman Hashim
 

Viewers also liked

Parallel Computing Application
Parallel Computing ApplicationParallel Computing Application
Parallel Computing Applicationhanis salwan
 
Casual mass parallel computing
Casual mass parallel computingCasual mass parallel computing
Casual mass parallel computingaragozin
 
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...Yahoo Developer Network
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processingPage Maker
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 

Viewers also liked (6)

Parallel Computing Application
Parallel Computing ApplicationParallel Computing Application
Parallel Computing Application
 
Casual mass parallel computing
Casual mass parallel computingCasual mass parallel computing
Casual mass parallel computing
 
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 

Similar to HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...David Meyer
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Seema Shah
 
Software Defined Networking: The OpenDaylight Project
Software Defined Networking: The OpenDaylight ProjectSoftware Defined Networking: The OpenDaylight Project
Software Defined Networking: The OpenDaylight ProjectGreat Wide Open
 
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
OpenACC and Open Hackathons Monthly Highlights May  2023.pdfOpenACC and Open Hackathons Monthly Highlights May  2023.pdf
OpenACC and Open Hackathons Monthly Highlights May 2023.pdfOpenACC
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldSrivatsan Srinivasan
 
FPGA-based soft-processors: 6G nodes and post-quantum security in space
 FPGA-based soft-processors: 6G nodes and post-quantum security in space FPGA-based soft-processors: 6G nodes and post-quantum security in space
FPGA-based soft-processors: 6G nodes and post-quantum security in spaceFacultad de Informática UCM
 
Indiana University's Advanced Science Gateway Support
Indiana University's Advanced Science Gateway SupportIndiana University's Advanced Science Gateway Support
Indiana University's Advanced Science Gateway Supportmarpierc
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsGokhan Boranalp
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Rafael Ferreira da Silva
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptxachakracu
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC
 
Adaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarAdaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarSuyog Potdar
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingAn optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingDIGVIJAY SHINDE
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
Locationless data science on a modern secure edge
Locationless data science on a modern secure edgeLocationless data science on a modern secure edge
Locationless data science on a modern secure edgeJohn Archer
 

Similar to HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng (20)

Netsoft19 Keynote: Fluid Network Planes
Netsoft19 Keynote: Fluid Network PlanesNetsoft19 Keynote: Fluid Network Planes
Netsoft19 Keynote: Fluid Network Planes
 
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Gc vit sttp cc december 2013
Gc vit sttp cc december 2013
 
Software Defined Networking: The OpenDaylight Project
Software Defined Networking: The OpenDaylight ProjectSoftware Defined Networking: The OpenDaylight Project
Software Defined Networking: The OpenDaylight Project
 
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
OpenACC and Open Hackathons Monthly Highlights May  2023.pdfOpenACC and Open Hackathons Monthly Highlights May  2023.pdf
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
 
FPGA-based soft-processors: 6G nodes and post-quantum security in space
 FPGA-based soft-processors: 6G nodes and post-quantum security in space FPGA-based soft-processors: 6G nodes and post-quantum security in space
FPGA-based soft-processors: 6G nodes and post-quantum security in space
 
Indiana University's Advanced Science Gateway Support
Indiana University's Advanced Science Gateway SupportIndiana University's Advanced Science Gateway Support
Indiana University's Advanced Science Gateway Support
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptx
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Adaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarAdaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog Potdar
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingAn optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
Locationless data science on a modern secure edge
Locationless data science on a modern secure edgeLocationless data science on a modern secure edge
Locationless data science on a modern secure edge
 

More from AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

  • 1. Towards an Ecosystem for Heterogeneous Parallel Computing Wu  FENG   Dept.  of  Computer  Science     Dept.  of  Electrical  &  Computer  Engineering   Virginia  Bioinforma<cs  Ins<tute   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 2. Japanese ‘Computnik’ Earth Simulator Shatters U.S. Supercomputer Hegemony © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 3. Importance of High-Performance Computing (HPC) in the LARGE and in the Small Competitive Risk From Not Having Access to HEC Competitive Risk From Not Having Access to HPC Could exist and compete 3% 34% Could not exist as a business Could not compete on quality & testing issues 47% 16% Could not compete on time to market & cost Data from Council of Competitiveness. Sponsored Survey Conducted by IDC Only 3% of companies could exist and compete without HPC. ª  200+ participating companies, including many Fortune 500 (Proctor & Gamble and biological and chemical companies) © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 4. Computnik 2.0?   •  The Second Coming of Computnik? Computnik 2.0?   –  No  …  “only”  43%  faster  than  the  previous  #1  supercomputer,  but        à  $20M  cheaper  than  the  previous  #1  supercomputer        à  42%  less  power  consump<on   •  The  Second Coming of the “Beowulf Cluster” for HPC –  The further commoditization of HPC © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 5. The First Coming of the “Beowulf Cluster” •  Utilize commodity PCs (with commodity CPUs) to build a supercomputer The Second Coming of the “Beowulf Cluster” •  Utilize commodity PCs (with commodity CPUs) to build a supercomputer + •  Utilize commodity graphics processing units (GPUs) to build a supercomputer Issue: Extracting performance with programming ease and portability à productivity © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 6. “Holy Grail” Vision •  An ecosystem for heterogeneous parallel computing Highest-ranked commodity supercomputer in U.S. on the Green500 (11/11) –  Enabling software that tunes parameters of hardware devices … with respect to performance, programmability, and portability … via a benchmark suite of dwarfs (i.e., motifs) and mini-apps © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 7. Design  of   Composite   Structures   SoPware   Ecosystem   Heterogeneous  Parallel  Compu<ng  (HPC)  PlaUorm   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 8. An Ecosystem for Heterogeneous Parallel Computing ^ Applications Sequence Alignment Molecular Dynamics Earthquake Modeling Neuroinformatics CFD for Mini-Drones Software Ecosystem Heterogeneous Parallel Computing (HPC) Platform … inter-node (in brief) with application to BIG DATA (StreamMR) synergy.cs.vt.edu  
  • 9. Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 10. Programming GPUs CUDA •  NVIDIA’s proprietary framework OpenCL •  Open standard for heterogeneous parallel computing (Khronos Group) •  Vendor-neutral environment for CPUs, GPUs, APUs, and even FPGAs © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 11. Prevalence First  public  release  February  2007   First  public  release  December  2008   Search Date: 2013-03-25 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 12. CU2CL: CUDA-to-OpenCL Source-to-Source Translator† •  Works as a Clang plug-in to leverage its production-quality compiler framework. •  Covers primary CUDA constructs found in CUDA C and CUDA run-time API. •  Performs as well as codes manually ported from CUDA to OpenCL. See APU’13 Talk (PT-4057) from Tuesday, November 12 •  Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next? †  “CU2CL:  A  CUDA-­‐to-­‐OpenCL  Translator  for  Mul<-­‐  and  Many-­‐core  Architectures,”  17th   IEEE  Int’l  Conf.  on  Parallel  &  Distributed  Systems  (ICPADS),  Dec.  2011.     © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 13. OpenCL: Write Once, Run Anywhere OpenCL Program CUDA Program CU2CL (“cuticle”) OpenCL-supported CPUs, GPUs, FPGAs © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 NVIDIA GPUs synergy.cs.vt.edu  
  • 14. Application CUDA Lines Lines Manually Changed % AutoTranslated bandwidthTest 891 5 98.9 BlackScholes 347 14 96.0 matrixMul 351 9 97.4 vectorAdd 147 0 100.0 Back Propagation 313 24 92.3 Hotspot 328 2 99.4 Needleman-Wunsch 430 3 99.3 SRAD 541 0 100.0 17,768 1,796 89.9 524 15 97.1 8,402 166 98.0 Fen Zi: Molecular Dynamics GEM: Molecular Modeling IZ PS: Neural Network © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 15. Why CU2CL? •  Much larger # of apps implemented in CUDA than in OpenCL –  Idea §  Leverage scientists’ investment in CUDA to drive OpenCL adoption –  Issues (from the perspective of domain scientists) §  Writing from Scratch: Learning Curve (OpenCL is too low-level an API compared to CUDA. CUDA also low level.) §  Porting from CUDA: Tedious and Error-Prone •  Significant demand from major stakeholders (and Apple …) Why Not CU2CL? •  Just start with OpenCL?! •  CU2CL only does source-to-source translation at present –  Functional (code) portability? Yes. –  Performance portability? Mostly no. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 16. Performance, Programmability, Portability Roadmap IEEE ICPADS ’11 P2S2 ’12 Parallel Computing ‘13 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 What about performance portability? synergy.cs.vt.edu  
  • 17. Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 18. Traditional Approach to Optimizing Compiler Architecture-Independent Optimization Architecture-Aware Optimization C.  del  Mundo,  W.  Feng.  “Towards  a  Performance-­‐Portable  FFT  Library  for  Heterogeneous  Computing,”    in  IEEE  IPDPS  ‘13.  Phoenix,  AZ,  USA,  May    2014.  (Under  review.)   C.  del  Mundo,  W.  Feng.  “Enabling  Efficient  Intra-­‐Warp  Communication  for  Fourier  Transforms  in  a  Many-­‐Core  Architecture”,  SC|13,  Denver,  CO,  Nov.  2013.     C.  del  Mundo,  V.  Adhinarayanan,  W.  Feng,  “Accelerating  FFT  for  Wideband  Channelization.”  in  IEEE  ICC  ‘13.  Budapest,  Hungary,  June  2013.   synergy.cs.vt.edu  
  • 19. Computational Units Not Created Equal •  “AMD CPU ≠ Intel CPU” and “AMD GPU ≠ NVIDIA GPU” •  Initial performance of a CUDA-optimized N-body dwarf -32% “Architecture-unaware” or “Architecture-independent” © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 20. Basic & Architecture-Unaware Execution Speedup over hand-tuned SSE 400 371 328 12% 300 224 200 100 192 163 88 0 Basic Architecture unaware Architecture aware NVIDIA GTX280 AMD 5870 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 21. Optimization for N-body Molecular Modeling •  Optimization techniques on AMD GPUs Isolated –  Removing conditions à kernel splitting –  Local staging –  Using vector types –  Using image memory  •  Speedup over basic OpenCL GPU implementation Combined –  Isolated optimizations –  Combined optimizations MT: Max Threads; KS: Kernel Splitting; RA: Register Accumulator; RP: Register Preloading; LM: Local Memory; IM: Image Memory; LU{2,4}: Loop Unrolling{2x,4x}; VASM{2,4}: Vectorized Access & Scalar Math{float2, float4}; VAVM{2,4}: Vectorized Access & Vector Math{float2, float4} © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 22. AMD-Optimized N-body Dwarf -32% 371x speed-up •  12% better than NVIDIA GTX 280 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 23. FFT: Fast Fourier Transform •  A spectral method that is a critical building block across many disciplines 33   synergy.cs.vt.edu  
  • 24. Peak SP Performance & Peak Memory Bandwidth © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 25. Summary of Optimizations © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 26. Optimizations •  RP: Register Preloading –  All data elements are first preloaded into the register file before use. –  Computation facilitated solely on registers •  LM-CM: Local Memory (Communication Only) –  Data elements are loaded into local memory only for communication –  Threads swap data elements solely in local memory •  CGAP: Coalesced Global Access Pattern –  Threads access memory contiguously •  VASM{2|4}: Vector Access, Scalar Math, float{2|4} –  Data elements are loaded as the listed vector type. –  Arithmetic operations are scalar (float × float). •  CM-K: Constant Memory for Kernel Arguments –  The twiddle multiplication state of FFT is precomputed on the CPU and stored in GPU constant memory for fast look-up. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 27. Architecture-Optimized FFT © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 28. Architecture-Optimized FFT © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 29. Performance, Programmability, Portability Roadmap GPU Computing Gems J. Molecular Graphics & Modeling IEEE HPCC ’12 IEEE HPDC ‘13 IEEE ICPADS ’11 IEEE ICC ‘13 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 30. Performance, Programmability, Portability Roadmap FUTURE  WORK   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 31. Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 32. The Berkeley View † Sparse Linear Algebra •  Traditional Approach –  Applications that target existing hardware and programming models Structured Grids Spectral Methods N-Body Methods •  Berkeley Approach –  Hardware design that keeps future applications in mind –  Basis for future applications? 13 computational dwarfs A computational dwarf is a pattern of communication & computation that is common across a set of applications. Dense Linear Algebra Unstructured Grids Monte Carlo à   MapReduce and Combinational Logic Graph Traversal Dynamic Programming Backtrack & Branch+Bound Graphical Models Finite State Machine †   Asanovic,  K.,  et  al.  The  Landscape  of  Parallel  CompuDng  Research:  A  View  from  Berkeley.   Tech.  Rep.  UCB/EECS-­‐2006-­‐183,  University  of  California,  Berkeley,  Dec.  2006.     © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 33. Example of a Computational Dwarf: N-Body •  Computational Dwarf: Pattern of computation & communication … that is common across a set of applications •  N-Body problems are studied in –  Cosmology, particle physics, biology, and engineering •  All have similar structures •  An N-Body benchmark can provide meaningful insight to people in all these fields •  Optimizations may be generally applicable as well RoadRunner Universe: Astrophysics © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 GEM: Molecular Modeling synergy.cs.vt.edu  
  • 34. First Instantiation: OpenDwarfs (formerly “OpenCL and the 13 Dwarfs”)   •  Goal –  Provide common algorithmic methods, i.e., dwarfs, in a language that is “write once, run anywhere” (CPU, GPU, or even FPGA), i.e., OpenCL •  Part of a larger umbrella project (2008-2018), funded by the NSF Center for High-Performance Reconfigurable Computing (CHREC) © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 35. Performance, Programmability, Portability ACM ICPE ’12: OpenCL & the 13 Dwarfs Roadmap IEEE Cluster ’11, FPGA ’11, IEEE ICPADS ’11, SAAHPC ’11 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 36. Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 37. Performance & Power Modeling •  Goals –  –  –  –  Robust framework Very high accuracy (Target: < 5% prediction error) Identification of portable predictors for performance and power Multi-dimensional characterization §  Performance à sequential, intra-node parallel, inter-node parallel §  Power à component level, node level, cluster level © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 38. Problem Formulation: LP-Based Energy-Optimal DVFS Schedule •  Definitions –  A DVFS system exports n { (fi, Pi ) } settings. –  Ti : total execution time of a program running at setting i •  Given a program with deadline D, find a DVS schedule (t1*, …, tn*) such that –  If the program is executed for ti seconds at setting i, the total energy usage E is minimized, the deadline D is met, and the required work is completed. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 39. Single-Coefficient β Performance Model •  Our Formulation –  Define the relative performance slowdown δ as T(f) / T(fMAX) – 1 –  Re-formulate two-coefficient model as a single-coefficient model: –  The coefficient β is computed at run-time using a regression method on the past MIPS rates reported from the built-in PMU. C.  Hsu  and  W.  Feng.   “A  Power-­‐Aware  Run-­‐Time   System  for  High-­‐Performance   Compu<ng,”  SC|05,  Nov.  2005.   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 40. β – Adaptation on NAS Parallel Benchmarks C. Hsu and W. Feng. “A Power-Aware Run-Time System for High-Performance Computing,” SC|05, Nov. 2005. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 41. Need for Better Performance & Power Modeling •  S Source:Virginia Tech © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 42. Search for Reliable Predictors •  Re-visit performance counters used for prediction –  Applicability of performance counters across generations of architecture •  Performance counters monitored –  NPB and PARSEC benchmarks –  Platform: Intel Xeon E5645 CPU (Westmere-EP) Context switches (CS) Instruction issued (IIS) L3 data cache misses (L3 DCM) L2 data cache misses (L2 DCM) L2 instruction cache misses (L2 ICM) Instruction completed (TOTINS) Outstanding bus request cycles (OutReq) Instruction queue write cycles (InsQW) © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 44. Performance, Programmability, Portability Roadmap ACM/IEEE SC ’05 ICPP ’07 IEEE/ACM CCGrid ‘09 IEEE GreenCom ’10 ACM CF ’11 IEEE Cluster ‘11 ISC ’12 IEEE ICPE ‘13 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 45. Performance, Programmability, Portability Roadmap Performance Portability Performance Portability © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 46. What is Heterogeneous Task Scheduling? •  Automatically spreading tasks across heterogeneous compute resources –  CPUs, GPUs, APUs, FPGAs, DSPs, and so on •  Specify tasks at a higher level (currently OpenMP extensions) •  Run them across available resources automatically Goal •  A run-time system that intelligently uses what is available resource-wise and optimize for performance portability –  Each user should not have to implement this for themselves! © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 47. Heterogeneous Task Scheduling © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 48. How to Heterogeneous Task Schedule (HTS) •  Accelerated OpenMP offers heterogeneous task scheduling with –  Programmability –  Functional portability, given underlying compilers –  Performance portability •  How? –  A simple extension to Accelerated OpenMP syntax for programmability –  Automatically dividing parallel tasks across arbitrary heterogeneous compute resources for functional portability §  CPUs §  GPUs §  APUs –  Intelligent runtime task scheduling for performance portability © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 49. Programmability: Why Accelerated OpenMP? # pragma omp parallel for shared(in1,in2,out,pow) for (i=0; i<end; i++){ out[i] = in1[i]*in2[i]; pow[i] = pow[i]*pow[i]; } Traditional OpenMP # pragma acc_region_loop acc_copyin(in1[0:end],in2[0:end]) acc_copyout(out[0:end]) acc_copy(pow[0:end]) for (i=0; i<end; i++){ out[i] = in1[i] * in2[i]; pow[i] = pow[i]*pow[i]; } OpenMP Accelerator Directives # pragma acc_region_loop acc_copyin(in1[0:end],in2[0:end]) acc_copyout(out[0:end]) acc_copy(pow[0:end]) hetero(<cond>[,<scheduler>[,<ratio> [,<div>]]]) for (i=0; i<end; i++){ out[i] = in1[i] * in2[i]; pow[i] = pow[i]*pow[i]; } Our Proposed Extension © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 50. OpenMP Accelerator Behavior Original/Master thread Worker threads Parallel region Accelerated region #pragma omp acc_region … Implicit barrier #pragma omp parallel … Kernels © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 51. DESIRED OpenMP Accelerator Behavior Original/Master thread Worker threads Parallel region Accelerated region #pragma omp acc_region … #pragma omp parallel num_threads(2) #pragma omp parallel … © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 52. Work-share a Region Across the Whole System #pragma omp acc_region … OR #pragma omp parallel … Original/Master thread Worker threads Parallel region Accelerated region synergy.cs.vt.edu  
  • 53. Scheduling and Load-Balancing by Adaptation •  Measure computational suitability at runtime eTSAR: Task-Size Adapting Runtime •  Compute new distribution of work through a linear optimization approach Program phase Compute •  Re-distribute Scheduling before each pass work Original A:7 With GPU back−off 1.00 I = total iterations available 0.50 0.25 min( Adaptive 0.75 n X fj = fraction of iterations for compute unit j 0.75 t+ + tj ) j (8) fj = 1 j=1 pj = recent time/iteration for compute unit j f ⇤ p(1) f ⇤ p = t+ 2 2 1 1 1 n = number of compute devices f3 ⇤ p 3 f1 ⇤ p 1 = t + 2 + tj (or tj ) = time over (or under) equal . . . 2 3 4 1 2 3 4 n 1 X GPUs Number of+ fn ⇤ p n f1 ⇤ p 1 = t + n min( t1 + t1 · · · + t+ 1 + tn 1 ) (2) n t1 0.25 0.00 1 ) Percentage of time spent on computation j=1 nd scheduling n X ij = I (10) tn 1 (11) (b) Modified objective and constraints (3) Fig. 5: Linear program optimization and performance j=0 + 1 (9) t2 Split 0.50 (7) j=1 ij = iterations for compute unit j 0.00 1.00 n 1 X synergy.cs.vt.edu  
  • 54. Heterogeneous Scheduling: Issues and Solutions •  Issue: Launching overhead is high on GPUs –  Using a work-queue, many GPU kernels may need to be run •  Solution: Schedule only at the beginning of a region –  The overhead is only paid once or a small number of times •  Issue: Without a work-queue, how do we balance load? –  The performance of each device must be predicted •  Solution: Allocate different amounts of work –  For the first pass, predict a reasonable initial division of work. We use a ratio between the number of CPU and GPU cores for this. –  For subsequent passes, use the performance of previous passes to predict following passes. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 55. ly get Equation 6, which can be solved in only ctions and produces a result within one iteration program for the common case. i0 c i0 c ⇤ pc = ⇤ pc = the next iteration, we assume that iterations take the sa amount of time on average from one pass to the next. For general case with an arbitrary number of compute units, use a linear program for which Figure I lists the necess variables. Equation 1 represents the objective function, w the accompanying constraints in Equations 2 through 5. Dynamic Ratios i 0 ⇤ pg g (i0 i0 ) ⇤ pg t c ((i0 i0 ) ⇤ pg )/pc t c ((i0 i0 ) ⇤ pg )/pc t c (i0 ⇤ pg )/pc t (i0 ⇤ pg )/pc t n 1 X •  All dynamic schedulers share a common min( t1 i0 = j=1 c basis 0 ic = •  pg the end of a pass, a version of the linear i0 + (i0 ⇤At)/pc = c c /pc + (i0 ⇤program at right is solved c pg )/pc = i 2 ⇤ p2 0 ic ⇤ (pg + pthe simplified case, we actually solve •  For c ) = it ⇤ pg i 3 ⇤ p3 –  i0 c = (it ⇤ pg )/(pg + pc ) (6) + + t1 · · · + t+ n n X 1 + tn 1) ij = I j=0 i 1 ⇤ p1 = t+ 1 i 1 ⇤ p1 = t+ 2 . . . t1 t2 Purpose where i n ⇤ pn i 1 ⇤ p 1 = t + 1 t n 1 n ’ is the number of iterations to assign the CPU n and testing–  i c indicated that neither of the schedExpressed in words, the linear program calculates is entirely appropriate intotal iterations of the next pass certain cases. Thus, we iteration counts with predicted times that are as close –  it is the other schedulers: split and quick. identical – schedulingtime per iterationleast (c)pu/(g)pu onbetween all devices as possible. The constrai p* is the requires more at for Our dynamic ecutions of thethe last pass parallel region to compute an tio, which works well for applications which ple passes. However, some parallel regions are ly a few times, or even just once. Split scheduls these regions. In this case, each pass through© W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu   begins a loop that iterates div times with each cuting totaltasks/div tasks. Thus, the runtime
  • 56. Dynamic Schedulers •  Schedulers –  Dynamic –  Split –  Quick •  Arguments –  Ratio §  Initial split to use, i.e., ratio between CPU and GPU performance –  Div §  How far to subdivide the region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 57. Static Scheduler Ratio: 0.5 250 it 500 it 500 it 1,000 it 250 it 500 it Pass 1 Original/Master thread Pass 2 Worker threads Parallel region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 Accelerated region synergy.cs.vt.edu  
  • 58. Dynamic Scheduler Ratio: 0.5 250 it 900 it 500 it 1,000 it 250 it 100 it Pass 1 Original/Master thread Pass 2 Worker threads Parallel region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 Accelerated region synergy.cs.vt.edu  
  • 59. Split Scheduler Ratio: 0.5 Div: 4 125 it 125 it 125 it 125 it 500 it Pass 1 Reschedule Original/Master thread Worker threads Parallel region Accelerated region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 60. Arguments: Ratio=0.5 Div=4 Dynamic vs. Split Scheduler Dynamic 500 it 900 it 1,000 it 250 it 250 it 100 it Pass 1 Split 63 it 113 it Pass 2 113 it 113 it 113 it 113 it 113 it 113 it 12 it 12 it 1,000 it 500 it 62 it 12 it 12 it 12 it 12 it 12 it Reschedule Original/Master thread Worker threads Parallel region Accelerated region 88   synergy.cs.vt.edu  
  • 61. Quick Scheduler Ratio: 0.5 Div: 4 375 it 125 it 500 it Pass 1 Pass 2 Reschedule Original/Master thread Worker threads Parallel region Accelerated region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 62. Arguments: Ratio=0.5 Div=4 Dynamic vs. Quick Scheduler 900 it Dynamic 500 it 1,000 it 250 it 250 it 100 it Pass 1 Quick 63 it Pass 2 338 it 900 it 1,000 it 500 it 62 it 37 it 100 it Reschedule Original/Master thread Worker threads Parallel region Accelerated region 90   synergy.cs.vt.edu  
  • 63. Benchmarks •  NAS CG – Many passes (1,800 for C class) –  The Conjugate Gradient benchmark from the NAS Parallel Benchmarks •  GEM – One pass –  Molecular Modeling, computes the electrostatic potential along the surface of a macromolecule •  K-Means – Few passes –  Iterative clustering of points •  Helmholtz – Few passes, GPU Unsuitable –  Jacobi iterative method implementing the Helmholtz equation © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 64. Experimental Setup •  System –  –  –  –  12-core AMD Opteron 6174 CPU NVIDIA Tesla C2050 GPU Linux 2.6.27.19 CCE compiler with OpenMP accelerator extensions •  Procedures –  All parameters default, unless otherwise specified –  Results represent 5 or more runs © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 65. Results across Schedulers for all Benchmarks Scheduler CPU GPU cg Static Dynamic gem Split Quick helmholtz kmeans 1.0 8 Speedup over 12−core CPU 1.4 0.8 1.2 1.5 6 1.0 0.6 0.8 1.0 4 0.4 0.6 0.4 0.5 2 0.2 0.2 0.0 0 cg 0.0 gem 0.0 helmholtz kmeans Application © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 66. Performance, Programmability, Portability Roadmap IEEE IPDPS ’12 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 67. An Ecosystem for ^ Heterogeneous Parallel Computing Roadmap Much work still to be done. •  OpenCL and the 13 Dwarfs Beta release pending •  Source-to-Source Translation CU2CL only & no optimization •  Architecture-Aware Optimization Only manual optimizations •  Performance & Power Modeling Preliminary & pre-multicore •  Affinity-Based Cost Modeling Empirical results; modeling in progress •  Heterogeneous Task Scheduling Preliminary with OpenMP © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 68. Intra-Node An Ecosystem for Heterogeneous Parallel Computing ^ Applications Sequence Alignment Molecular Dynamics Earthquake Modeling Neuroinformatics Avionic Composites Software Ecosystem Heterogeneous Parallel Computing (HPC) Platform … from Intra-Node to Inter-Node synergy.cs.vt.edu  
  • 69. Programming CPU-GPU Clusters (e.g., MPI+CUDA) GPU   GPU   device   memory   device   memory   CPU   main   memory   Node 1 Network CPU   main   memory   Node 2 Rank = 0 Rank = 1 if(rank == 0) { cudaMemcpy(host_buf, dev_buf, D2H) MPI_Send(host_buf, .. ..) } if(rank == 1) { MPI_Recv(host_buf, .. ..) cudaMemcpy(dev_buf, host_buf, H2D) } © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 70. Goal of Programming CPU-GPU Clusters (MPI + Any Acc) IEEE HPCC ’12 ACM HPDC ‘13 GPU   device   memory   CPU   main   memory   Network GPU   device   memory   main   memory   Rank = 0 if(rank == 0) { MPI_Send(any_buf, .. ..); } CPU   Rank = 1 if(rank == 1) { MPI_Recv(any_buf, .. ..); } © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 71. “Virtualizing” GPUs … Compute Node VOCL Proxy OpenCL API Native OpenCL Library Compute Node Compute Node Application OpenCL API Application MPI OpenCL API VOCL Library Physical GPU Native OpenCL Library MPI Virtual GPU Virtual GPU Physical GPU Compute Node VOCL Proxy OpenCL API Native OpenCL Library Traditional Model Our Model © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 Physical GPU synergy.cs.vt.edu  
  • 72. Conclusion •  An ecosystem for heterogeneous parallel computing Highest-ranked commodity supercomputer in U.S. on the Green500 (11/11) –  Enabling software that tunes parameters of hardware devices … with respect to performance, programmability, and portability … via a benchmark suite of dwarfs (i.e., motifs) and mini-apps © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 73. People © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 74. Funding Acknowledgements Microsoft featured our “Big Data in the Cloud” grant in U.S. Congressional Testimony © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
  • 75. Wu  Feng,  wfeng@vt.edu,  540-­‐231-­‐1192   http://synergy.cs.vt.edu/ http://www.chrec.org/ http://www.green500.org/ http://www.mpiblast.org/ http://sss.cs.vt.edu/ “Accelerators ‘R Us”! http://accel.cs.vt.edu/ http://myvice.cs.vt.edu/ © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu