Heterogeneous programming


Published on

High-level talk on programming models for parallel heterogeneous architectures at the second workshop organized by the NSF-funded Conceptualization of Software Institute for Abstractions and Methodologies for HPC Simulations Codes on Future Architectures, http://flash.uchicago.edu/site/NSF-SI2/

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Inspired punched cards used in Charles Babbage’s analytical engine (conceived in 1834)
  • Atanasoff Berry Computer (ABC) – 1937-42Mark 2
  • sneetah
  • Heterogeneous programming

    1. 1. State of programming models and codetransformations on heterogeneousplatformsBoyana Norrisnorris@mcs.anl.gov- Computer Scientist, Mathematics andComputer Science Division, ArgonneNational Laboratory- Senior Fellow, Computation Institute,University of Chicago
    2. 2. Before there were computers…Jacquard Loom, invented in 1801 Programming was– Parallel– Pattern-based– Multithreaded
    3. 3. (Possibly) the first heterogeneous computer(s)
    4. 4. Outline, goals Parallel programming for heterogeneous architectures– Challenges– Example approaches Help set the stage for subsequent panel discussions w.r.t.issues related to programming heterogeneous architectures– Need your input, please do interrupt
    5. 5. Heterogeneity Hardware heterogeneity (different devices with differentcapabilities), e.g.:– Multicore x86 CPUs with GPUs– Multicore x86 CPUs with Intel Phi accelerators– big.LITTLE (coupling slower, low-power ARM cores with faster, power-hungry ARM cores)– A cluster with different types of nodes– x86 CPU with FPGAs (e.g., Convey)– … Software heterogeneity (e.g., OS, languages)– Not part of this talk
    6. 6. Similarities among heterogeneous platforms Typically each processor has several, and sometimes manyexecution units– NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs);– AMD GPUs have 20 or more SIMD units;– Intel Phi has >50 x86 cores Each execution unit typically has SIMD or vector execution.– NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIAcalls warps);– AMD GPUs execute in wavefronts that are 64-threads wide;– Intel Phi has 512-bit wide SIMD instructions (16 floats or 8 doubles).
    7. 7. Many scales
    8. 8. Parallel programming models Bulk synchronous parallelism (BSP) Stream processing Algorithmic skeletons (e.g., master-worker) Workflow/dataflow Remote method invocation Distributed objects Components Functional …
    9. 9. Parallel programming models (cont.) Parallel process interaction– Distributed data, exchanged through explicit messages (e.g., MPI)– Shared/global memory (e.g., PGAS) Work parallelism– SPMD– Dataflow– Task-based– Streaming– … Heterogeneous resources– Host-directed execution with selected kernels offloaded to co-processor, e.g., MPI + CUDA/OpenCL– “Symmetric”, e.g., MPI on x86/Phi systems
    10. 10. Example: Host-directed MPI+X modelImage by Yili Zheng, LBL
    11. 11. Challenges Managing data– Data distribution, movement, replication– Load balancing Different processing capabilities (FPUs, clock rates, vectorunits) Different instruction sets
    12. 12. Software developer’s point of view Important considerations, tradeoffs– Initial investment• learning curve, reimplementation– Ongoing costs• Maintainability, portability– Performance• Real time, within power constraints,…– Life expectancy• Architectures, software dependencies– Suitability for particular goals• Embedded system vs petaflop machine– Agility• Ability to exploit new architectures– …
    13. 13. Programming model implementations Established:– Parallelism expressed through message-passing, thread-based sharedmemory, PGAS languages– High-level languages or libraries with APIs that can map to differentmodels, e.g., MPI– General-purpose languages with compiler support for exploitinghybrid architectures– Small language extensions or annotations embedded in GPLs withcompiler or source transformation tool support, e.g., Fortran CUDA– Streaming, e.g., CUDA More recent Extinct, e.g., HPF
    14. 14. TradeoffsScalabilityDevelopmentProductivityLow HighSequential GPLs and high-level DSLsLow-level languagesor APIs, fully explicitparallelism controlLibraries, frameworksHigh-level parallellanguagesHigh
    15. 15. Source transformations Typically multiple levels of abstraction and programmingmodels are used simultaneously Goal is to express algorithms at the highest level appropriatefor the functionality being implemented A single language or library is unlikely to be best for any givenapplication on all possible hardware One approach:– Define algorithms using high-level abstractions– Provide tools to translate these into lower-level, possibly architecturespecific implementations Most programming on heterogeneous platforms involvessource transformation
    16. 16. Example: Annotation-based approaches Pros: low-effort, minimal changes Cons: limited expressivity, performance Examples:– MPI + OpenACC directives in a GPL– Some embedded DSLs (e.g., as supported by Orio)
    17. 17. Current limitations Minimally intrusive approaches typically don’t result in thebest performance possible, e.g., OpenACC annotationswithout code restructuring A number of single-platform solutions provided by vendors(e.g., Intel, NVIDIA), portability or performance on otherplatforms not guaranteed
    18. 18. General-purpose programming languages GPLs for parallel, possibly heterogeneous architectures– UPC, CAF, Chapel, X10 Pros:– Robustness (e.g., type safety, memory consistency)– Tools (e.g., debugging, performance analysis) Cons:– Manual reimplementation required in most cases– Hard to balance user control with resource management automation– Interoperability
    19. 19. Recall host-directed MPI+X modelImage by Yili Zheng, LBL
    20. 20. PGAS modelImage by Yili Zheng, LBL
    21. 21. High-level frameworks and libraries Domain-specific problem-solving environments andmathematical libraries can encapsulate the specifics ofmapping to heterogeneous architectures (e.g., PETSc, Trilinos,Cactus) Advantages– Efficient implementations of common functionality– Different levels of APIs to hide or expose different levels of theimplementation and runtime (unlike pure language approaches)– Relatively rapid support of new hardware Disadvantages– Learning curves, deep software dependencies
    22. 22. Ongoing efforts attempting to balancescalability with productivity DOE X-Stack program pursues fundamental advances inprogramming models, languages, compilers, runtime systemsand tools to support the transition of applications to exascaleplatforms– DEGAS (Dynamic, Exascale Global Address Space): a PGAS approach– SLEEC (Semantics-rich Libraries for Effective Exascale Computation):annotations and cost models to compile into optimized low-levelimplementations– X-Tune: model-based code generation and optimization of algorithmswritten in GPLs– D-TEC: compilers for both new general-purpose languages andembedding DSLs into other languages
    23. 23. Summary Many traditional programming models can be used onheterogeneous architectures, with vendor support forcompilers, libraries and runtimes No clear multi-platform winner programmingmodel/language/framework Many new efforts on deepening the software stack to enablebetter balance of programmability, performance, portability