Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Preparing OpenSHMEM for Exascale

1,625 views

Published on

In this video from the 2015 Stanford HPC Conference, Pavel Shamis from ORNL presents: Preparing OpenSHMEM for Exascale.

"OpenSHMEM is a partitioned global address space (PGAS) one-sided communications library that enables remote memory access (RMA) across processing elements (PEs). Its API allows data to be transferred from one PE memory space to another PE’s symmetric memory space; decoupling the data transfers from synchronizations. OpenSHMEM is useful for applications that are latency driven or that have irregular communication patterns, because its one-sided API can be mapped very efficiently to hardware (e.g. RDMA interconnects, etc), and its one-sided programming model helps the overlapping of communication with computation. Summit is Oak Ridge National Laboratory’s next high performance supercomputer system that will be based on a many core/GPU hybrid architecture. In order to prepare OpenSHMEM for future systems, it is important to enhance its programming model to enable efficient utilization of the new hardware capabilities (e.g. massive multithreaded systems, accesses different type memories, next generation of interconnects, etc). This session will present recent advances in the area of OpenSHMEM extensions, implementations, and tools.”

Watch the video: http://insidehpc.com/2015/02/video-preparing-openshmem-for-exascale/

See more talks in the Stanford HPC Conference Video Gallery: http://wp.me/P3RLHQ-dOO

Published in: Technology
  • Be the first to comment

Preparing OpenSHMEM for Exascale

  1. 1. Oak Ridge National Laboratory Computing and Computational Sciences HPC Advisory Council Stanford Conference California Feb 2, 2015 Preparing OpenSHMEM for Exascale Presented by: Pavel Shamis (Pasha)
  2. 2. 2 Preparing OpenSHMEM for Exascale Outline • CORAL overview – Summit • What is OpenSHMEM ? • Preparing OpenSHMEM for Exascale – Recent advances
  3. 3. 3 Preparing OpenSHMEM for Exascale CORAL • CORAL – Collaboration of ORNL, ANL, LLNL • Objective – Procure 3 leadership computers to be sited at Argonne, Oak Ridge and Lawrence Livermore in 2017 – Two of the contracts have been awarded with the Argonne contract in process • Leadership Computers – RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 5x-10x Titan or Sequoia
  4. 4. 4 Preparing OpenSHMEM for Exascale The Road to Exascale CORAL System Jaguar: 2.3 PF Multi-core CPU 7 MW Titan: 27 PF Hybrid GPU/CPU 9 MW 2010 2012 2017 2022 OLCF5: 5-10x Summit ~20 MWSummit: 5-10x Titan Hybrid GPU/CPU 10 MW Since clock-rate scaling ended in 2003, HPC performance has been achieved through increased parallelism. Jaguar scaled to 300,000 cores. Titan and beyond deliver hierarchical parallelism with very powerful nodes. MPI plus thread level parallelism through OpenACC or OpenMP plus vectors
  5. 5. 5 Preparing OpenSHMEM for Exascale System Summary Mellanox® Interconnect Dual-rail EDR Infiniband® IBM POWER • NVLink™ NVIDIA Volta • HBM • NVLink Compute Node POWER® Architecture Processor NVIDIA®Volta™ NVMe-compatible PCIe 800GB SSD > 512 GB HBM + DDR4 Coherent Shared Memory Compute Rack Standard 19” Warm water cooling Compute System Summit: 5x-10x Titan 10 MW
  6. 6. 6 Preparing OpenSHMEM for Exascale Summit VS Titan 12 SC’14 Summit - Bland Do Not Release Prior to Monday, Nov. 17, 2014 How does Summit compare to Titan Feature Summit Titan Application Performance 5-10x Titan Baseline Number of Nodes ~3,400 18,688 Node performance > 40 TF 1.4 TF Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3) NVRAM per Node 800 GB 0 Node Interconnect NVLink (5-12x PCIe 3) PCIe 2 System Interconnect (node injection bandwidth) Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s) Interconnect Topology Non-blocking Fat Tree 3D Torus Processors IBM POWER9 NVIDIA Volta™ AMD O p t er on™ NVIDIA Ke p ler™ File System 120 PB, 1 TB/s, GP FS™ 32 PB, 1 TB/s, Lustre® Peak power consumption 10 MW 9 MW Present and Future Leadership Computers at OLCF, Buddy Bland https://www.olcf.ornl.gov/wp-content/uploads/2014/12/OLCF-User-Group-Summit-12-3-2014.pdf
  7. 7. 7 Preparing OpenSHMEM for Exascale Challenges for Programming Models • Very powerful compute nodes – Hybrid architecture – Multiple CPU/GPU – Different types of memory • Must be fun to program ;-) – MPI + X
  8. 8. 8 Preparing OpenSHMEM for Exascale What is OpenSHMEM ? • Communication library and interface specification that implements a Partitioned Global Address Space (PGAS) programming model • Processing Element (PE) an OpenSHMEM process • Symmetric objects have same address (or offset) on all PEs PE N-1 Global and Static Variables Symmetric Heap Local Variables PE 0 Global and Static Variables Symmetric Heap Local Variables PE 1 Global and Static Variables Symmetric Heap Local Variables RemotelyAccessibleSymmetric DataObjects Variable: X Variable: X Variable: X X = shmalloc(sizeof(long)) PrivateData Objects
  9. 9. 9 Preparing OpenSHMEM for Exascale OpenSHMEM Operations • Remote memory Put and Get – void shmem_getmem(void *target, const void *source, size_t len, int pe); – void shmem_putmem(void *target, const void *source, size_t len, int pe); • Remote memory Atomic operations – long long shmem_int_add(int *target, int value, int pe); • Collective – broadcast, reductions, etc • Synchronization operations – Point-to-point – Global • Ordering operations • Distributed lock operations PE N-1 Global and Static Variables Symmetric Heap Local Variables PE 0 Global and Static Variables Symmetric Heap Local Variables PE 1 Global and Static Variables Symmetric Heap Local Variables RemotelyAccessibleSymmetric DataObjects Variable: X Variable: X Variable: X X = shmalloc(sizeof(long)) PrivateData Objects
  10. 10. 10 Preparing OpenSHMEM for Exascale OpenSHMEM Code Example 1 2 3 4
  11. 11. 11 Preparing OpenSHMEM for Exascale OpenSHMEM Code Example • You just learned program OpenSHMEM ! – Library initialization – AMO/PUT/GET – Synchronization – Done 1 2 3 4
  12. 12. 12 Preparing OpenSHMEM for Exascale OpenSHMEM • OpenSHMEM is a one-sided communications library – C and Fortran API – Uses symmetric data objects to efficiently communicate across processes • Advantages: – Good for irregular applications, latency-driven communication • Random memory access patterns – Maps really well to hardware/interconnects OpenSHMEM InfniBand (Mellanox) Gemini/Aries (Cray) RMA PUT/GET V V Atomics V V Collectives V V
  13. 13. 13 Preparing OpenSHMEM for Exascale OpenSHMEM Key Principles • Keep it simple – The specification is only ~ 80 pages • Keep it fast – As close as possible to hardware
  14. 14. 14 Preparing OpenSHMEM for Exascale Evolution of OpenSHMEM 2011 20131990s 20152012 • SHMEM library introduced by Cray Research Inc. (T3D systems) • Adapted by SGI for products based on the Numa-Link architecture and included in the Message Passing Toolkit (MPT). • Vendor specific SHMEM libraries emerge (Quadrics, HP, IBM, Mellanox, Intel, gpSHMEM, SiCortex etc.). • OpenSHMEM is born. • ORNL and UH come together to address the differences between various SHMEM implementations. • OSSS signed SHMEM trademark licensing agreement • OpenSHMEM 1.0 is finalized • OpenSHMEM 1.0 reference implementation & V&V, Tools • OpenSHMEM 1.1 released mid 2014 • OpenSHMEM 1.2 2015 onwards, next OpenSHMEM specifications: faster, more predictable, more agile OpenSHMEM is a living specification!
  15. 15. 15 Preparing OpenSHMEM for Exascale OpenSHMEM - Roadmap • OpenSHMEM v1.1 (June 2014) – Errata, bug fixes – Ratified (100+ tickets resolved) • OpenSHMEM v1.2 (Early 2015) – API naming convention – finalize(), global_exit() – Consistent data type support – Version information – Clarifications: zero-length, wait – shmem_ptr() • OpenSHMEM v1.5 (Late 2015) – Non-blocking communication semantics (RMA, AMO) – teams, groups – Thread safety • OpenSHMEM v1.6 – Non-blocking collectives • OpenSHMEM v1.7 – Thread safety update • OpenSHMEM Next Generation (2.0) – Let’s go wild !!! (Exascale!) – Active set + Memory context – Fault Tolerance – Exit codes – Locality – I/O White paper: OpenSHMEM Tools API
  16. 16. 16 Preparing OpenSHMEM for Exascale OpenSHMEM Community Today Academia Vendors Government
  17. 17. 17 Preparing OpenSHMEM for Exascale OpenSHMEM Implementations • Proprietary – SGI SHMEM – Cray SHMEM – IBM SHMEM – HP SHMEM – Mellanox Scalable SHMEM • Legacy – Quadrics SHMEM • Open Source – OpenSHMEM Reference Implementation (UH) – Portals SHMEM – Oshmpi / Open SHMEM over MPI (under development) – OpenSHMEM with OpenMPI – OpenSHMEM with Mvapich MPI (OSU) – TSHMEM (UFL) – GatorSHMEM (UFL)
  18. 18. 18 Preparing OpenSHMEM for Exascale OpenSHMEM Eco-system OpenSHMEM Reference Implementation ANALYZER Vampir
  19. 19. 19 Preparing OpenSHMEM for Exascale OpenSHMEM Eco-system • OpenSHMEM Specification – http://www.openshmem.org/site/Downloads/Source • Vampir – https://www.vampir.eu • TAU – http://www.cs.uoregon.edu/research/tau/home.php • DDT – www.allinea.com/products/ddt • OpenSHMEM Analyzer – https://www.openshmem.org/OSA • UCCS – http://uccs.github.io/uccs/
  20. 20. 20 Preparing OpenSHMEM for Exascale Upcoming Challenges for OpenSHMEM • Based on what we know about the upcoming architecture… • Communication across different components of system • Locality of resources Hybrid Architecture • Thread Safety (without performance sacrifices) • Threads locality • Scalability Multiple CPU/GPU • Address spaces Different Types of memory
  21. 21. 21 Preparing OpenSHMEM for Exascale Hybrid Architecture Challenges and Ideas • OpenSHMEM for accelerators • “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potluri – http://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_O UG14.pdf – Preliminary study, prototype concept
  22. 22. 22 Preparing OpenSHMEM for Exascale NVSHMEM • The problem – Communication across GPU requires synchronization with Host • Software overheads, hardware overhead of launching kernels, etc. • Research idea/concept proposed by Nvidia – GPU-initiated communication – NVSHMEM communication primitives: nvshmem_put(), nvshmem_get() to/from remote GPU memory – Emulated using CUDA IPC (CUDA 4.2) The slide is based on “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf CHANGE IN T Loop { Interior Compute (kernel launch) Pack Boundaries (kernel launch) Stream Synchronize Exchange (MPI/OpenSHMEM) Unpack Boundaries (kernel launch) Boundary Compute (kernel launch) Stream/Device Synchronize } - Kernel launch overheads - CPU based blocking synchronization Traditional
  23. 23. 23 Preparing OpenSHMEM for Exascale NVSHMEM u[i][j] = u[i][j] + (v[i+1][j] + v[i-1][j] + v[i][j+1] + v[i][j+1])/x 16 Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf PRELIMINARY R D tl 0" 500" 1000" 1500" 64" 128" 256" 512" 1K" 2K" Time%per%Step%(usec)% Stencil%Size%% tradi/ onal" persistent"kernel" 100" 1000" p%(usec)% Tradi. onal" Persistent"Kernel"
  24. 24. 24 Preparing OpenSHMEM for Exascale Many-Core System Challenges • It is challenging to provide high- performance THREAD_MULTIPLE support – Locks / Atomic operations in communication path • Even though MPI IMB benchmarks benefits from full process memory separation, multi-threaded UCCS obtains comparable performance Aurelien Bouteiller, Thomas Herault and George Bosilca, “A Multithreaded Communication Substrate for OpenSHMEM”, OUG2014
  25. 25. 25 Preparing OpenSHMEM for Exascale Many-Core System Challenges – “Old” Ideas • SHMEM_PTR (or SHMEM_LOCAL_PTR on Cray) PE 0 Symmetric Heap Local Variables PE 1 Symmetric Heap Local Variables Variable: X Variable: X Y = shmem_ptr(&X, PE1) Variable: Y M em ory M apping
  26. 26. 26 Preparing OpenSHMEM for Exascale Many-Core System Challenges – “Old” Ideas • Provides direct assess to “remote” PE element with memory load and store operations • Supported on a systems where SHMEM_PUT/GET are implemented with memory load and store operations – Usually implemented using XPMEM (https://code.google.com/p/xpmem/) • Gabriele Jost, Ulf R. Hanebutte, James Dinan, “OpenSHMEM with Threads: A Bad Idea?” • http://www.csm.ornl.gov/OpenSHMEM2014/documents/talk6_jost_OUG14.pd f
  27. 27. 27 Preparing OpenSHMEM for Exascale Many-Core System Challenges – New Ideas • OpenSHMEM Context by Intel – James Dinan and Mario Flajslik, “Contexts: A Mechanism for High Throughput Communication in OpenSHMEM”, PGAS 2015 – Explicit API for allocation and management of communication contexts OpenSHMEM Application Thread 0 Thread 1 Thread 2 Context 0 Context 1 Context 2 OpenSHMEM Library Put Put Get Put Put Get Put Put Get
  28. 28. 28 Preparing OpenSHMEM for Exascale Many-Core System Challenges – New Ideas • Cray’s proposal of “Hot” Threads – Monika ten Bruggencate Cray Inc. “Cray SHMEM Update”, First OpenSHMEM Workshop: Experiences, Implementations and Tools • http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentatio ns_and_tutorials/tenBruggencate_Cray_SHMEM_Update.pdf – Idea: each thread is registered within OpenSHMEM library. The library allocates and automatically manages communication resources (context) for the application – Compatible with current API
  29. 29. 29 Preparing OpenSHMEM for Exascale Address Space and Locality Challenges • Symmetric heap is not flexible enough – All PE have to allocate the same amount of memory • No concept of locality • How we manage different types of memory ? • What is the right abstraction ?
  30. 30. 30 Preparing OpenSHMEM for Exascale Memory Spaces • Aaron Welch , Swaroop Pophale , Pavel Shamis , Oscar Hernandez, Stephen Poole, Barbara Chapman, “Extending the OpenSHMEM Memory Model to Support User-Defined Spaces”, PGAS2014 • Concept of teams – Original OpenSHMEM active-set (group of Pes) concept is outdates, BUT very lightweight (local operation) • Memory Space – Memory space association with a team – Similar concepts can be found in MPI, Chapel, etc.
  31. 31. 31 Preparing OpenSHMEM for Exascale Teams • Explicit method of grouping PEs • Fully local objects and operations - Fast • New (sub)teams created from parent teams • Re-indexing of PE ids with respect to the team • Strided teams and axial splits – No need to maintain “translation” array – All translations can be done with simple arithmetic • Ability to specify team index for remote operations
  32. 32. 32 Preparing OpenSHMEM for Exascale Spaces
  33. 33. 33 Preparing OpenSHMEM for Exascale Spaces • Spaces and teams creation is decoupled • Faster memory allocation compared to “shmalloc” • Future directions – Different types of memory – Locality – Separate address spaces – Asymmetric RMA access
  34. 34. 34 Preparing OpenSHMEM for Exascale Fault Tolerance ? • How to run in presence of faults ? • What is the responsibility of programming model and communication libraries • Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop Pophale, Aaron Welch, Stephen Poole, Barbara Chapman, “Fault Tolerance for OpenSHMEM”, PGAS/OUG14 – http://nic.uoregon.edu/pgas14/oug_submissions/oug2014_submis sion_12.pdf
  35. 35. 35 Preparing OpenSHMEM for Exascale Fault Tolerance • Basic idea – In memory checkpoint of symmetric memory regions – Symmetric recovery or only “memory recovery”
  36. 36. 36 Preparing OpenSHMEM for Exascale Fault Tolerance • Code snippet
  37. 37. 37 Preparing OpenSHMEM for Exascale Fault Tolerance • Work in progress… • OpenSHMEM is just one piece of the puzzle – Run-time, I/O, drivers, etc. – The system has to provide fault tolerance infrastructure • Error notification, coordination, etc. • Leveraging existing work/research in the HPC community – MPI, Hadoop, etc.
  38. 38. 38 Preparing OpenSHMEM for Exascale Summary • This just a “snapshot” some of the ideas – Other active research & development topics: non-blocking operations, counting operations, signaled operation, asymmetric memory access, etc • These challenges are relevant for many other HPC programming models • The key to success – Co-design of hardware and software – Generic solutions that target broader community • The challenges are common across different fields: storage, analytics, big- data, etc.
  39. 39. 39 Preparing OpenSHMEM for Exascale How to get involved? • Join the mailing list – www.openshmem.org/Community/MailingList • Join OpenSHMEM redmine – www.openshmem.org/redmine • GitHUB – https://github.com/orgs/openshmem-org • OpenSHMEM RF, test suites, benchmarks, etc. • Participate in our upcoming events – Workshop, user group meetings, and conference calls
  40. 40. 40 Preparing OpenSHMEM for Exascale Upcoming Events… Workshop 2015 August,4th-6th, 2015
  41. 41. 41 Preparing OpenSHMEM for Exascale www.csm.ornl.gov/OpenSHMEM2015/ Co-Located with PGAS 2015 9th international Conference on Partitioned Global Address Space Programming Models Washington, DC Upcoming Events…
  42. 42. 42 Preparing OpenSHMEM for Exascale Acknowledgements This work was supported by the United States Department of Defense & used resources of the Extreme Scale Systems Center at Oak Ridge National Laboratory. Empowering the Mission
  43. 43. 43 Preparing OpenSHMEM for Exascale Questions ?
  44. 44. 44 Preparing OpenSHMEM for Exascale Backup Slides
  45. 45. 45 Preparing OpenSHMEM for Exascale NVSHMEM Code Example USING NVSHMEM Device Code __global__ void one_kernel (u, v, sync, …) { i = threadIdx.x; for (…) { if (i+1 > nx) { v[i+1] = nvshmem_float_g (v[1], rightpe) } if (i-1 < 1) { v[i-1] = nvshmem_float_g (v[nx], leftpe) } ------- u[i] = (u[i] + (v[i+1] + v[i-1] . . . contd…. contd…. /*peers array has left and right PE ids*/ if (i < 2) { nvshmem_int_p (sync[i], 1, peers[i]); nvshmem_quiet(); nvshmem_wait_until (sync[i], EQ, 1); } //intra-process sync ------- //compute v from u and sync } } 19 Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf

×