Heterogeneous programming

State of programming models and code
transformations on heterogeneous
platforms
Boyana Norris
norris@mcs.anl.gov
- Computer Scientist, Mathematics and
Computer Science Division, Argonne
National Laboratory
- Senior Fellow, Computation Institute,
University of Chicago

Before there were computers…
Jacquard Loom, invented in 1801
 Programming was
– Parallel
– Pattern-based
– Multithreaded

(Possibly) the first heterogeneous computer(s)

Outline, goals
 Parallel programming for heterogeneous architectures
– Challenges
– Example approaches
 Help set the stage for subsequent panel discussions w.r.t.
issues related to programming heterogeneous architectures
– Need your input, please do interrupt

Heterogeneity
 Hardware heterogeneity (different devices with different
capabilities), e.g.:
– Multicore x86 CPUs with GPUs
– Multicore x86 CPUs with Intel Phi accelerators
– big.LITTLE (coupling slower, low-power ARM cores with faster, power-
hungry ARM cores)
– A cluster with different types of nodes
– x86 CPU with FPGAs (e.g., Convey)
– …
 Software heterogeneity (e.g., OS, languages)
– Not part of this talk

Similarities among heterogeneous platforms
 Typically each processor has several, and sometimes many
execution units
– NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs);
– AMD GPUs have 20 or more SIMD units;
– Intel Phi has >50 x86 cores
 Each execution unit typically has SIMD or vector execution.
– NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIA
calls warps);
– AMD GPUs execute in wavefronts that are 64-threads wide;
– Intel Phi has 512-bit wide SIMD instructions (16 floats or 8 doubles).

Parallel programming models
 Bulk synchronous parallelism (BSP)
 Stream processing
 Algorithmic skeletons (e.g., master-worker)
 Workflow/dataflow
 Remote method invocation
 Distributed objects
 Components
 Functional
 …

Parallel programming models (cont.)
 Parallel process interaction
– Distributed data, exchanged through explicit messages (e.g., MPI)
– Shared/global memory (e.g., PGAS)
 Work parallelism
– SPMD
– Dataflow
– Task-based
– Streaming
– …
 Heterogeneous resources
– Host-directed execution with selected kernels offloaded to co-
processor, e.g., MPI + CUDA/OpenCL
– “Symmetric”, e.g., MPI on x86/Phi systems

Example: Host-directed MPI+X model
Image by Yili Zheng, LBL

Challenges
 Managing data
– Data distribution, movement, replication
– Load balancing
 Different processing capabilities (FPUs, clock rates, vector
units)
 Different instruction sets

Software developer’s point of view
 Important considerations, tradeoffs
– Initial investment
• learning curve, reimplementation
– Ongoing costs
• Maintainability, portability
– Performance
• Real time, within power constraints,…
– Life expectancy
• Architectures, software dependencies
– Suitability for particular goals
• Embedded system vs petaflop machine
– Agility
• Ability to exploit new architectures
– …

Programming model implementations
 Established:
– Parallelism expressed through message-passing, thread-based shared
memory, PGAS languages
– High-level languages or libraries with APIs that can map to different
models, e.g., MPI
– General-purpose languages with compiler support for exploiting
hybrid architectures
– Small language extensions or annotations embedded in GPLs with
compiler or source transformation tool support, e.g., Fortran CUDA
– Streaming, e.g., CUDA
 More recent
 Extinct, e.g., HPF

Tradeoffs
Scalability
DevelopmentProductivity
Low High
Sequential GPLs and high-
level DSLs
Low-level languages
or APIs, fully explicit
parallelism control
Libraries, frameworks
High-level parallel
languages
High

Source transformations
 Typically multiple levels of abstraction and programming
models are used simultaneously
 Goal is to express algorithms at the highest level appropriate
for the functionality being implemented
 A single language or library is unlikely to be best for any given
application on all possible hardware
 One approach:
– Define algorithms using high-level abstractions
– Provide tools to translate these into lower-level, possibly architecture
specific implementations
 Most programming on heterogeneous platforms involves
source transformation

Example: Annotation-based approaches
 Pros: low-effort, minimal changes
 Cons: limited expressivity, performance
 Examples:
– MPI + OpenACC directives in a GPL
– Some embedded DSLs (e.g., as supported by Orio)

Current limitations
 Minimally intrusive approaches typically don’t result in the
best performance possible, e.g., OpenACC annotations
without code restructuring
 A number of single-platform solutions provided by vendors
(e.g., Intel, NVIDIA), portability or performance on other
platforms not guaranteed

General-purpose programming languages
 GPLs for parallel, possibly heterogeneous architectures
– UPC, CAF, Chapel, X10
 Pros:
– Robustness (e.g., type safety, memory consistency)
– Tools (e.g., debugging, performance analysis)
 Cons:
– Manual reimplementation required in most cases
– Hard to balance user control with resource management automation
– Interoperability

Recall host-directed MPI+X model

PGAS model

High-level frameworks and libraries
 Domain-specific problem-solving environments and
mathematical libraries can encapsulate the specifics of
mapping to heterogeneous architectures (e.g., PETSc, Trilinos,
Cactus)
 Advantages
– Efficient implementations of common functionality
– Different levels of APIs to hide or expose different levels of the
implementation and runtime (unlike pure language approaches)
– Relatively rapid support of new hardware
 Disadvantages
– Learning curves, deep software dependencies

Ongoing efforts attempting to balance
scalability with productivity
 DOE X-Stack program pursues fundamental advances in
programming models, languages, compilers, runtime systems
and tools to support the transition of applications to exascale
platforms
– DEGAS (Dynamic, Exascale Global Address Space): a PGAS approach
– SLEEC (Semantics-rich Libraries for Effective Exascale Computation):
annotations and cost models to compile into optimized low-level
implementations
– X-Tune: model-based code generation and optimization of algorithms
written in GPLs
– D-TEC: compilers for both new general-purpose languages and
embedding DSLs into other languages

Summary
 Many traditional programming models can be used on
heterogeneous architectures, with vendor support for
compilers, libraries and runtimes
 No clear multi-platform winner programming
model/language/framework
 Many new efforts on deepening the software stack to enable
better balance of programmability, performance, portability

Heterogeneous programming

More Related Content

Similar to Heterogeneous programming

Heterogeneous programming

Editor's Notes