Oak Ridge National Laboratory
Computing and Computational Sciences
HPC Advisory Council Stanford Conference
California
Feb 2, 2015
Preparing OpenSHMEM for
Exascale
Presented by:
Pavel Shamis (Pasha)
2 Preparing OpenSHMEM for Exascale
Outline
• CORAL overview
– Summit
• What is OpenSHMEM ?
• Preparing OpenSHMEM for Exascale
– Recent advances
3 Preparing OpenSHMEM for Exascale
CORAL
• CORAL – Collaboration of ORNL, ANL, LLNL
• Objective – Procure 3 leadership computers to be sited at Argonne,
Oak Ridge and Lawrence Livermore in 2017
– Two of the contracts have been awarded with the Argonne contract in
process
• Leadership Computers
– RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and
science performance 5x-10x Titan or Sequoia
4 Preparing OpenSHMEM for Exascale
The Road to Exascale
CORAL System
Jaguar: 2.3 PF
Multi-core CPU
7 MW
Titan: 27 PF
Hybrid GPU/CPU
9 MW
2010 2012 2017 2022
OLCF5: 5-10x Summit
~20 MWSummit: 5-10x Titan
Hybrid GPU/CPU
10 MW
Since clock-rate scaling ended in 2003,
HPC performance has been achieved
through increased parallelism. Jaguar
scaled to 300,000 cores.
Titan and beyond deliver hierarchical
parallelism with very powerful nodes. MPI
plus thread level parallelism through
OpenACC or OpenMP plus vectors
5 Preparing OpenSHMEM for Exascale
System Summary
Mellanox® Interconnect
Dual-rail EDR Infiniband®
IBM POWER
• NVLink™
NVIDIA Volta
• HBM
• NVLink
Compute Node
POWER® Architecture Processor
NVIDIA®Volta™
NVMe-compatible PCIe 800GB SSD
> 512 GB HBM + DDR4
Coherent Shared Memory
Compute Rack
Standard 19”
Warm water cooling
Compute System
Summit: 5x-10x Titan
10 MW
6 Preparing OpenSHMEM for Exascale
Summit VS Titan
12 SC’14 Summit - Bland Do Not Release Prior to Monday, Nov. 17, 2014
How does Summit compare to Titan
Feature Summit Titan
Application Performance 5-10x Titan Baseline
Number of Nodes ~3,400 18,688
Node performance > 40 TF 1.4 TF
Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3)
NVRAM per Node 800 GB 0
Node Interconnect NVLink (5-12x PCIe 3) PCIe 2
System Interconnect
(node injection bandwidth)
Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s)
Interconnect Topology Non-blocking Fat Tree 3D Torus
Processors IBM POWER9
NVIDIA Volta™
AMD O
p
t er on™
NVIDIA Ke p ler™
File System 120 PB, 1 TB/s, GP FS™ 32 PB, 1 TB/s, Lustre®
Peak power consumption 10 MW 9 MW
Present and Future Leadership Computers at OLCF, Buddy Bland
https://www.olcf.ornl.gov/wp-content/uploads/2014/12/OLCF-User-Group-Summit-12-3-2014.pdf
7 Preparing OpenSHMEM for Exascale
Challenges for Programming Models
• Very powerful compute nodes
– Hybrid architecture
– Multiple CPU/GPU
– Different types of memory
• Must be fun to program ;-)
– MPI + X
8 Preparing OpenSHMEM for Exascale
What is OpenSHMEM ?
• Communication library and interface specification that implements a
Partitioned Global Address Space (PGAS) programming model
• Processing Element (PE) an OpenSHMEM process
• Symmetric objects have same address (or offset) on all PEs
PE N-1
Global and Static
Variables
Symmetric Heap
Local Variables
PE 0
Global and Static
Variables
Symmetric Heap
Local Variables
PE 1
Global and Static
Variables
Symmetric Heap
Local Variables
RemotelyAccessibleSymmetric
DataObjects
Variable: X Variable: X Variable: X
X = shmalloc(sizeof(long))
PrivateData
Objects
9 Preparing OpenSHMEM for Exascale
OpenSHMEM Operations
• Remote memory Put and Get
– void shmem_getmem(void *target, const void *source, size_t len, int pe);
– void shmem_putmem(void *target, const void *source, size_t len, int pe);
• Remote memory Atomic operations
– long long shmem_int_add(int *target, int value, int pe);
• Collective
– broadcast, reductions, etc
• Synchronization operations
– Point-to-point
– Global
• Ordering operations
• Distributed lock operations
PE N-1
Global and Static
Variables
Symmetric Heap
Local Variables
PE 0
Global and Static
Variables
Symmetric Heap
Local Variables
PE 1
Global and Static
Variables
Symmetric Heap
Local Variables
RemotelyAccessibleSymmetric
DataObjects
Variable: X Variable: X Variable: X
X = shmalloc(sizeof(long))
PrivateData
Objects
10 Preparing OpenSHMEM for Exascale
OpenSHMEM Code Example
1
2
3
4
11 Preparing OpenSHMEM for Exascale
OpenSHMEM Code Example
• You just learned program OpenSHMEM !
– Library initialization
– AMO/PUT/GET
– Synchronization
– Done 1
2
3
4
12 Preparing OpenSHMEM for Exascale
OpenSHMEM
• OpenSHMEM is a one-sided communications library
– C and Fortran API
– Uses symmetric data objects to efficiently communicate across
processes
• Advantages:
– Good for irregular applications, latency-driven communication
• Random memory access patterns
– Maps really well to hardware/interconnects
OpenSHMEM InfniBand (Mellanox) Gemini/Aries (Cray)
RMA PUT/GET V V
Atomics V V
Collectives V V
13 Preparing OpenSHMEM for Exascale
OpenSHMEM Key Principles
• Keep it simple
– The specification is only ~ 80 pages
• Keep it fast
– As close as possible to hardware
14 Preparing OpenSHMEM for Exascale
Evolution of OpenSHMEM
2011 20131990s 20152012
• SHMEM library introduced by Cray Research Inc. (T3D systems)
• Adapted by SGI for products based on the Numa-Link architecture and included in the
Message Passing Toolkit (MPT).
• Vendor specific SHMEM libraries emerge (Quadrics, HP, IBM, Mellanox, Intel, gpSHMEM,
SiCortex etc.).
• OpenSHMEM is born.
• ORNL and UH come together to address the differences between
various SHMEM implementations.
• OSSS signed SHMEM trademark licensing agreement
• OpenSHMEM 1.0 is finalized
• OpenSHMEM 1.0 reference
implementation & V&V, Tools
• OpenSHMEM 1.1 released
mid 2014
• OpenSHMEM 1.2
2015 onwards, next OpenSHMEM
specifications: faster, more
predictable, more agile
OpenSHMEM is a living specification!
15 Preparing OpenSHMEM for Exascale
OpenSHMEM - Roadmap
• OpenSHMEM v1.1 (June 2014)
– Errata, bug fixes
– Ratified (100+ tickets resolved)
• OpenSHMEM v1.2 (Early 2015)
– API naming convention
– finalize(), global_exit()
– Consistent data type support
– Version information
– Clarifications: zero-length, wait
– shmem_ptr()
• OpenSHMEM v1.5 (Late 2015)
– Non-blocking communication semantics
(RMA, AMO)
– teams, groups
– Thread safety
• OpenSHMEM v1.6
– Non-blocking collectives
• OpenSHMEM v1.7
– Thread safety update
• OpenSHMEM Next Generation (2.0)
– Let’s go wild !!! (Exascale!)
– Active set + Memory context
– Fault Tolerance
– Exit codes
– Locality
– I/O
White paper:
OpenSHMEM Tools API
16 Preparing OpenSHMEM for Exascale
OpenSHMEM Community Today
Academia
Vendors
Government
17 Preparing OpenSHMEM for Exascale
OpenSHMEM Implementations
• Proprietary
– SGI SHMEM
– Cray SHMEM
– IBM SHMEM
– HP SHMEM
– Mellanox Scalable SHMEM
• Legacy
– Quadrics SHMEM
• Open Source
– OpenSHMEM Reference
Implementation (UH)
– Portals SHMEM
– Oshmpi / Open SHMEM
over MPI (under
development)
– OpenSHMEM with OpenMPI
– OpenSHMEM with Mvapich
MPI (OSU)
– TSHMEM (UFL)
– GatorSHMEM (UFL)
18 Preparing OpenSHMEM for Exascale
OpenSHMEM Eco-system
OpenSHMEM
Reference
Implementation
ANALYZER
Vampir
19 Preparing OpenSHMEM for Exascale
OpenSHMEM Eco-system
• OpenSHMEM Specification
– http://www.openshmem.org/site/Downloads/Source
• Vampir
– https://www.vampir.eu
• TAU
– http://www.cs.uoregon.edu/research/tau/home.php
• DDT
– www.allinea.com/products/ddt
• OpenSHMEM Analyzer
– https://www.openshmem.org/OSA
• UCCS
– http://uccs.github.io/uccs/
20 Preparing OpenSHMEM for Exascale
Upcoming Challenges for OpenSHMEM
• Based on what we know about the upcoming architecture…
• Communication across different components of
system
• Locality of resources
Hybrid
Architecture
• Thread Safety (without performance sacrifices)
• Threads locality
• Scalability
Multiple
CPU/GPU
• Address spaces
Different Types
of memory
21 Preparing OpenSHMEM for Exascale
Hybrid Architecture Challenges and Ideas
• OpenSHMEM for accelerators
• “TOC-Centric Communication: a case study with NVSHMEM”,
OUG/PGAS 2014, Shreeram Potluri
– http://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_O
UG14.pdf
– Preliminary study, prototype concept
22 Preparing OpenSHMEM for Exascale
NVSHMEM
• The problem
– Communication across GPU requires
synchronization with Host
• Software overheads, hardware overhead
of launching kernels, etc.
• Research idea/concept proposed by
Nvidia
– GPU-initiated communication
– NVSHMEM communication primitives:
nvshmem_put(), nvshmem_get()
to/from remote GPU memory
– Emulated using CUDA IPC (CUDA 4.2)
The slide is based on “TOC-Centric Communication: a case study with NVSHMEM”,
OUG/PGAS 2014, Shreeram
Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf
CHANGE IN T
Loop {
Interior Compute (kernel launch)
Pack Boundaries (kernel launch)
Stream Synchronize
Exchange (MPI/OpenSHMEM)
Unpack Boundaries (kernel launch)
Boundary Compute (kernel launch)
Stream/Device Synchronize
}
- Kernel launch overheads
- CPU based blocking synchronization
Traditional
23 Preparing OpenSHMEM for Exascale
NVSHMEM
u[i][j] = u[i][j]
+ (v[i+1][j] + v[i-1][j]
+ v[i][j+1] + v[i][j+1])/x
16
Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”,
OUG/PGAS 2014, Shreeram
Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf
PRELIMINARY R
D
tl
0"
500"
1000"
1500"
64" 128" 256" 512" 1K" 2K"
Time%per%Step%(usec)%
Stencil%Size%%
tradi/ onal" persistent"kernel"
100"
1000"
p%(usec)%
Tradi. onal" Persistent"Kernel"
24 Preparing OpenSHMEM for Exascale
Many-Core System Challenges
• It is challenging to provide high-
performance
THREAD_MULTIPLE support
– Locks / Atomic operations in
communication path
• Even though MPI IMB
benchmarks benefits from full
process memory separation,
multi-threaded UCCS obtains
comparable performance
Aurelien Bouteiller, Thomas Herault and George Bosilca, “A Multithreaded Communication Substrate for
OpenSHMEM”, OUG2014
25 Preparing OpenSHMEM for Exascale
Many-Core System Challenges – “Old” Ideas
• SHMEM_PTR (or SHMEM_LOCAL_PTR on Cray)
PE 0
Symmetric Heap
Local Variables
PE 1
Symmetric Heap
Local Variables
Variable: X Variable: X
Y = shmem_ptr(&X, PE1)
Variable: Y
M
em
ory
M
apping
26 Preparing OpenSHMEM for Exascale
Many-Core System Challenges – “Old” Ideas
• Provides direct assess to “remote” PE element with memory load and
store operations
• Supported on a systems where SHMEM_PUT/GET are implemented
with memory load and store operations
– Usually implemented using XPMEM (https://code.google.com/p/xpmem/)
• Gabriele Jost, Ulf R. Hanebutte, James Dinan, “OpenSHMEM with
Threads: A Bad Idea?”
• http://www.csm.ornl.gov/OpenSHMEM2014/documents/talk6_jost_OUG14.pd
f
27 Preparing OpenSHMEM for Exascale
Many-Core System Challenges – New Ideas
• OpenSHMEM Context by Intel
– James Dinan and Mario Flajslik, “Contexts: A Mechanism for High
Throughput Communication in OpenSHMEM”, PGAS 2015
– Explicit API for allocation and management of communication contexts
OpenSHMEM Application
Thread 0 Thread 1 Thread 2
Context
0
Context
1
Context
2
OpenSHMEM Library
Put Put Get Put Put Get Put Put Get
28 Preparing OpenSHMEM for Exascale
Many-Core System Challenges – New Ideas
• Cray’s proposal of “Hot” Threads
– Monika ten Bruggencate Cray Inc. “Cray SHMEM Update”, First
OpenSHMEM Workshop: Experiences, Implementations and Tools
• http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentatio
ns_and_tutorials/tenBruggencate_Cray_SHMEM_Update.pdf
– Idea: each thread is registered within OpenSHMEM library. The library
allocates and automatically manages communication resources (context)
for the application
– Compatible with current API
29 Preparing OpenSHMEM for Exascale
Address Space and Locality Challenges
• Symmetric heap is not flexible enough
– All PE have to allocate the same amount of memory
• No concept of locality
• How we manage different types of memory ?
• What is the right abstraction ?
30 Preparing OpenSHMEM for Exascale
Memory Spaces
• Aaron Welch , Swaroop Pophale , Pavel Shamis , Oscar Hernandez,
Stephen Poole, Barbara Chapman, “Extending the OpenSHMEM
Memory Model to Support User-Defined Spaces”, PGAS2014
• Concept of teams
– Original OpenSHMEM active-set (group of Pes) concept is outdates,
BUT very lightweight (local operation)
• Memory Space
– Memory space association with a team
– Similar concepts can be found in MPI, Chapel, etc.
31 Preparing OpenSHMEM for Exascale
Teams
• Explicit method of grouping PEs
• Fully local objects and operations - Fast
• New (sub)teams created from parent teams
• Re-indexing of PE ids with respect to the
team
• Strided teams and axial splits
– No need to maintain “translation” array
– All translations can be done with simple
arithmetic
• Ability to specify team index for remote
operations
32 Preparing OpenSHMEM for Exascale
Spaces
33 Preparing OpenSHMEM for Exascale
Spaces
• Spaces and teams creation is decoupled
• Faster memory allocation compared to
“shmalloc”
• Future directions
– Different types of memory
– Locality
– Separate address spaces
– Asymmetric RMA access
34 Preparing OpenSHMEM for Exascale
Fault Tolerance ?
• How to run in presence of faults ?
• What is the responsibility of programming model and communication
libraries
• Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop
Pophale, Aaron Welch, Stephen Poole, Barbara Chapman, “Fault
Tolerance for OpenSHMEM”, PGAS/OUG14
– http://nic.uoregon.edu/pgas14/oug_submissions/oug2014_submis
sion_12.pdf
35 Preparing OpenSHMEM for Exascale
Fault Tolerance
• Basic idea
– In memory checkpoint of symmetric memory regions
– Symmetric recovery or only “memory recovery”
36 Preparing OpenSHMEM for Exascale
Fault Tolerance
• Code snippet
37 Preparing OpenSHMEM for Exascale
Fault Tolerance
• Work in progress…
• OpenSHMEM is just one piece of the puzzle
– Run-time, I/O, drivers, etc.
– The system has to provide fault tolerance infrastructure
• Error notification, coordination, etc.
• Leveraging existing work/research in the HPC community
– MPI, Hadoop, etc.
38 Preparing OpenSHMEM for Exascale
Summary
• This just a “snapshot” some of the ideas
– Other active research & development topics: non-blocking operations,
counting operations, signaled operation, asymmetric memory access, etc
• These challenges are relevant for many other HPC programming
models
• The key to success
– Co-design of hardware and software
– Generic solutions that target broader community
• The challenges are common across different fields: storage, analytics, big-
data, etc.
39 Preparing OpenSHMEM for Exascale
How to get involved?
• Join the mailing list
– www.openshmem.org/Community/MailingList
• Join OpenSHMEM redmine
– www.openshmem.org/redmine
• GitHUB
– https://github.com/orgs/openshmem-org
• OpenSHMEM RF, test suites, benchmarks, etc.
• Participate in our upcoming events
– Workshop, user group meetings, and conference calls
40 Preparing OpenSHMEM for Exascale
Upcoming Events…
Workshop 2015
August,4th-6th, 2015
41 Preparing OpenSHMEM for Exascale
www.csm.ornl.gov/OpenSHMEM2015/
Co-Located with
PGAS 2015
9th international Conference on
Partitioned Global Address Space
Programming Models
Washington, DC
Upcoming Events…
42 Preparing OpenSHMEM for Exascale
Acknowledgements
This work was supported by the United States Department of Defense &
used resources of the Extreme Scale Systems Center at Oak Ridge
National Laboratory.
Empowering the Mission
43 Preparing OpenSHMEM for Exascale
Questions ?
44 Preparing OpenSHMEM for Exascale
Backup Slides
45 Preparing OpenSHMEM for Exascale
NVSHMEM Code Example
USING NVSHMEM
Device Code
__global__ void one_kernel (u, v, sync, …) {
i = threadIdx.x;
for (…) {
if (i+1 > nx) {
v[i+1] = nvshmem_float_g (v[1], rightpe)
}
if (i-1 < 1) {
v[i-1] = nvshmem_float_g (v[nx], leftpe)
}
-------
u[i] = (u[i] + (v[i+1] + v[i-1] . . .
contd….
contd….
/*peers array has left and right PE ids*/
if (i < 2) {
nvshmem_int_p (sync[i], 1, peers[i]);
nvshmem_quiet();
nvshmem_wait_until (sync[i], EQ, 1);
}
//intra-process sync
------- //compute v from u and sync
}
}
19
Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”,
OUG/PGAS 2014, Shreeram
Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf

Preparing OpenSHMEM for Exascale

  • 1.
    Oak Ridge NationalLaboratory Computing and Computational Sciences HPC Advisory Council Stanford Conference California Feb 2, 2015 Preparing OpenSHMEM for Exascale Presented by: Pavel Shamis (Pasha)
  • 2.
    2 Preparing OpenSHMEMfor Exascale Outline • CORAL overview – Summit • What is OpenSHMEM ? • Preparing OpenSHMEM for Exascale – Recent advances
  • 3.
    3 Preparing OpenSHMEMfor Exascale CORAL • CORAL – Collaboration of ORNL, ANL, LLNL • Objective – Procure 3 leadership computers to be sited at Argonne, Oak Ridge and Lawrence Livermore in 2017 – Two of the contracts have been awarded with the Argonne contract in process • Leadership Computers – RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 5x-10x Titan or Sequoia
  • 4.
    4 Preparing OpenSHMEMfor Exascale The Road to Exascale CORAL System Jaguar: 2.3 PF Multi-core CPU 7 MW Titan: 27 PF Hybrid GPU/CPU 9 MW 2010 2012 2017 2022 OLCF5: 5-10x Summit ~20 MWSummit: 5-10x Titan Hybrid GPU/CPU 10 MW Since clock-rate scaling ended in 2003, HPC performance has been achieved through increased parallelism. Jaguar scaled to 300,000 cores. Titan and beyond deliver hierarchical parallelism with very powerful nodes. MPI plus thread level parallelism through OpenACC or OpenMP plus vectors
  • 5.
    5 Preparing OpenSHMEMfor Exascale System Summary Mellanox® Interconnect Dual-rail EDR Infiniband® IBM POWER • NVLink™ NVIDIA Volta • HBM • NVLink Compute Node POWER® Architecture Processor NVIDIA®Volta™ NVMe-compatible PCIe 800GB SSD > 512 GB HBM + DDR4 Coherent Shared Memory Compute Rack Standard 19” Warm water cooling Compute System Summit: 5x-10x Titan 10 MW
  • 6.
    6 Preparing OpenSHMEMfor Exascale Summit VS Titan 12 SC’14 Summit - Bland Do Not Release Prior to Monday, Nov. 17, 2014 How does Summit compare to Titan Feature Summit Titan Application Performance 5-10x Titan Baseline Number of Nodes ~3,400 18,688 Node performance > 40 TF 1.4 TF Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3) NVRAM per Node 800 GB 0 Node Interconnect NVLink (5-12x PCIe 3) PCIe 2 System Interconnect (node injection bandwidth) Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s) Interconnect Topology Non-blocking Fat Tree 3D Torus Processors IBM POWER9 NVIDIA Volta™ AMD O p t er on™ NVIDIA Ke p ler™ File System 120 PB, 1 TB/s, GP FS™ 32 PB, 1 TB/s, Lustre® Peak power consumption 10 MW 9 MW Present and Future Leadership Computers at OLCF, Buddy Bland https://www.olcf.ornl.gov/wp-content/uploads/2014/12/OLCF-User-Group-Summit-12-3-2014.pdf
  • 7.
    7 Preparing OpenSHMEMfor Exascale Challenges for Programming Models • Very powerful compute nodes – Hybrid architecture – Multiple CPU/GPU – Different types of memory • Must be fun to program ;-) – MPI + X
  • 8.
    8 Preparing OpenSHMEMfor Exascale What is OpenSHMEM ? • Communication library and interface specification that implements a Partitioned Global Address Space (PGAS) programming model • Processing Element (PE) an OpenSHMEM process • Symmetric objects have same address (or offset) on all PEs PE N-1 Global and Static Variables Symmetric Heap Local Variables PE 0 Global and Static Variables Symmetric Heap Local Variables PE 1 Global and Static Variables Symmetric Heap Local Variables RemotelyAccessibleSymmetric DataObjects Variable: X Variable: X Variable: X X = shmalloc(sizeof(long)) PrivateData Objects
  • 9.
    9 Preparing OpenSHMEMfor Exascale OpenSHMEM Operations • Remote memory Put and Get – void shmem_getmem(void *target, const void *source, size_t len, int pe); – void shmem_putmem(void *target, const void *source, size_t len, int pe); • Remote memory Atomic operations – long long shmem_int_add(int *target, int value, int pe); • Collective – broadcast, reductions, etc • Synchronization operations – Point-to-point – Global • Ordering operations • Distributed lock operations PE N-1 Global and Static Variables Symmetric Heap Local Variables PE 0 Global and Static Variables Symmetric Heap Local Variables PE 1 Global and Static Variables Symmetric Heap Local Variables RemotelyAccessibleSymmetric DataObjects Variable: X Variable: X Variable: X X = shmalloc(sizeof(long)) PrivateData Objects
  • 10.
    10 Preparing OpenSHMEMfor Exascale OpenSHMEM Code Example 1 2 3 4
  • 11.
    11 Preparing OpenSHMEMfor Exascale OpenSHMEM Code Example • You just learned program OpenSHMEM ! – Library initialization – AMO/PUT/GET – Synchronization – Done 1 2 3 4
  • 12.
    12 Preparing OpenSHMEMfor Exascale OpenSHMEM • OpenSHMEM is a one-sided communications library – C and Fortran API – Uses symmetric data objects to efficiently communicate across processes • Advantages: – Good for irregular applications, latency-driven communication • Random memory access patterns – Maps really well to hardware/interconnects OpenSHMEM InfniBand (Mellanox) Gemini/Aries (Cray) RMA PUT/GET V V Atomics V V Collectives V V
  • 13.
    13 Preparing OpenSHMEMfor Exascale OpenSHMEM Key Principles • Keep it simple – The specification is only ~ 80 pages • Keep it fast – As close as possible to hardware
  • 14.
    14 Preparing OpenSHMEMfor Exascale Evolution of OpenSHMEM 2011 20131990s 20152012 • SHMEM library introduced by Cray Research Inc. (T3D systems) • Adapted by SGI for products based on the Numa-Link architecture and included in the Message Passing Toolkit (MPT). • Vendor specific SHMEM libraries emerge (Quadrics, HP, IBM, Mellanox, Intel, gpSHMEM, SiCortex etc.). • OpenSHMEM is born. • ORNL and UH come together to address the differences between various SHMEM implementations. • OSSS signed SHMEM trademark licensing agreement • OpenSHMEM 1.0 is finalized • OpenSHMEM 1.0 reference implementation & V&V, Tools • OpenSHMEM 1.1 released mid 2014 • OpenSHMEM 1.2 2015 onwards, next OpenSHMEM specifications: faster, more predictable, more agile OpenSHMEM is a living specification!
  • 15.
    15 Preparing OpenSHMEMfor Exascale OpenSHMEM - Roadmap • OpenSHMEM v1.1 (June 2014) – Errata, bug fixes – Ratified (100+ tickets resolved) • OpenSHMEM v1.2 (Early 2015) – API naming convention – finalize(), global_exit() – Consistent data type support – Version information – Clarifications: zero-length, wait – shmem_ptr() • OpenSHMEM v1.5 (Late 2015) – Non-blocking communication semantics (RMA, AMO) – teams, groups – Thread safety • OpenSHMEM v1.6 – Non-blocking collectives • OpenSHMEM v1.7 – Thread safety update • OpenSHMEM Next Generation (2.0) – Let’s go wild !!! (Exascale!) – Active set + Memory context – Fault Tolerance – Exit codes – Locality – I/O White paper: OpenSHMEM Tools API
  • 16.
    16 Preparing OpenSHMEMfor Exascale OpenSHMEM Community Today Academia Vendors Government
  • 17.
    17 Preparing OpenSHMEMfor Exascale OpenSHMEM Implementations • Proprietary – SGI SHMEM – Cray SHMEM – IBM SHMEM – HP SHMEM – Mellanox Scalable SHMEM • Legacy – Quadrics SHMEM • Open Source – OpenSHMEM Reference Implementation (UH) – Portals SHMEM – Oshmpi / Open SHMEM over MPI (under development) – OpenSHMEM with OpenMPI – OpenSHMEM with Mvapich MPI (OSU) – TSHMEM (UFL) – GatorSHMEM (UFL)
  • 18.
    18 Preparing OpenSHMEMfor Exascale OpenSHMEM Eco-system OpenSHMEM Reference Implementation ANALYZER Vampir
  • 19.
    19 Preparing OpenSHMEMfor Exascale OpenSHMEM Eco-system • OpenSHMEM Specification – http://www.openshmem.org/site/Downloads/Source • Vampir – https://www.vampir.eu • TAU – http://www.cs.uoregon.edu/research/tau/home.php • DDT – www.allinea.com/products/ddt • OpenSHMEM Analyzer – https://www.openshmem.org/OSA • UCCS – http://uccs.github.io/uccs/
  • 20.
    20 Preparing OpenSHMEMfor Exascale Upcoming Challenges for OpenSHMEM • Based on what we know about the upcoming architecture… • Communication across different components of system • Locality of resources Hybrid Architecture • Thread Safety (without performance sacrifices) • Threads locality • Scalability Multiple CPU/GPU • Address spaces Different Types of memory
  • 21.
    21 Preparing OpenSHMEMfor Exascale Hybrid Architecture Challenges and Ideas • OpenSHMEM for accelerators • “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potluri – http://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_O UG14.pdf – Preliminary study, prototype concept
  • 22.
    22 Preparing OpenSHMEMfor Exascale NVSHMEM • The problem – Communication across GPU requires synchronization with Host • Software overheads, hardware overhead of launching kernels, etc. • Research idea/concept proposed by Nvidia – GPU-initiated communication – NVSHMEM communication primitives: nvshmem_put(), nvshmem_get() to/from remote GPU memory – Emulated using CUDA IPC (CUDA 4.2) The slide is based on “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf CHANGE IN T Loop { Interior Compute (kernel launch) Pack Boundaries (kernel launch) Stream Synchronize Exchange (MPI/OpenSHMEM) Unpack Boundaries (kernel launch) Boundary Compute (kernel launch) Stream/Device Synchronize } - Kernel launch overheads - CPU based blocking synchronization Traditional
  • 23.
    23 Preparing OpenSHMEMfor Exascale NVSHMEM u[i][j] = u[i][j] + (v[i+1][j] + v[i-1][j] + v[i][j+1] + v[i][j+1])/x 16 Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf PRELIMINARY R D tl 0" 500" 1000" 1500" 64" 128" 256" 512" 1K" 2K" Time%per%Step%(usec)% Stencil%Size%% tradi/ onal" persistent"kernel" 100" 1000" p%(usec)% Tradi. onal" Persistent"Kernel"
  • 24.
    24 Preparing OpenSHMEMfor Exascale Many-Core System Challenges • It is challenging to provide high- performance THREAD_MULTIPLE support – Locks / Atomic operations in communication path • Even though MPI IMB benchmarks benefits from full process memory separation, multi-threaded UCCS obtains comparable performance Aurelien Bouteiller, Thomas Herault and George Bosilca, “A Multithreaded Communication Substrate for OpenSHMEM”, OUG2014
  • 25.
    25 Preparing OpenSHMEMfor Exascale Many-Core System Challenges – “Old” Ideas • SHMEM_PTR (or SHMEM_LOCAL_PTR on Cray) PE 0 Symmetric Heap Local Variables PE 1 Symmetric Heap Local Variables Variable: X Variable: X Y = shmem_ptr(&X, PE1) Variable: Y M em ory M apping
  • 26.
    26 Preparing OpenSHMEMfor Exascale Many-Core System Challenges – “Old” Ideas • Provides direct assess to “remote” PE element with memory load and store operations • Supported on a systems where SHMEM_PUT/GET are implemented with memory load and store operations – Usually implemented using XPMEM (https://code.google.com/p/xpmem/) • Gabriele Jost, Ulf R. Hanebutte, James Dinan, “OpenSHMEM with Threads: A Bad Idea?” • http://www.csm.ornl.gov/OpenSHMEM2014/documents/talk6_jost_OUG14.pd f
  • 27.
    27 Preparing OpenSHMEMfor Exascale Many-Core System Challenges – New Ideas • OpenSHMEM Context by Intel – James Dinan and Mario Flajslik, “Contexts: A Mechanism for High Throughput Communication in OpenSHMEM”, PGAS 2015 – Explicit API for allocation and management of communication contexts OpenSHMEM Application Thread 0 Thread 1 Thread 2 Context 0 Context 1 Context 2 OpenSHMEM Library Put Put Get Put Put Get Put Put Get
  • 28.
    28 Preparing OpenSHMEMfor Exascale Many-Core System Challenges – New Ideas • Cray’s proposal of “Hot” Threads – Monika ten Bruggencate Cray Inc. “Cray SHMEM Update”, First OpenSHMEM Workshop: Experiences, Implementations and Tools • http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentatio ns_and_tutorials/tenBruggencate_Cray_SHMEM_Update.pdf – Idea: each thread is registered within OpenSHMEM library. The library allocates and automatically manages communication resources (context) for the application – Compatible with current API
  • 29.
    29 Preparing OpenSHMEMfor Exascale Address Space and Locality Challenges • Symmetric heap is not flexible enough – All PE have to allocate the same amount of memory • No concept of locality • How we manage different types of memory ? • What is the right abstraction ?
  • 30.
    30 Preparing OpenSHMEMfor Exascale Memory Spaces • Aaron Welch , Swaroop Pophale , Pavel Shamis , Oscar Hernandez, Stephen Poole, Barbara Chapman, “Extending the OpenSHMEM Memory Model to Support User-Defined Spaces”, PGAS2014 • Concept of teams – Original OpenSHMEM active-set (group of Pes) concept is outdates, BUT very lightweight (local operation) • Memory Space – Memory space association with a team – Similar concepts can be found in MPI, Chapel, etc.
  • 31.
    31 Preparing OpenSHMEMfor Exascale Teams • Explicit method of grouping PEs • Fully local objects and operations - Fast • New (sub)teams created from parent teams • Re-indexing of PE ids with respect to the team • Strided teams and axial splits – No need to maintain “translation” array – All translations can be done with simple arithmetic • Ability to specify team index for remote operations
  • 32.
    32 Preparing OpenSHMEMfor Exascale Spaces
  • 33.
    33 Preparing OpenSHMEMfor Exascale Spaces • Spaces and teams creation is decoupled • Faster memory allocation compared to “shmalloc” • Future directions – Different types of memory – Locality – Separate address spaces – Asymmetric RMA access
  • 34.
    34 Preparing OpenSHMEMfor Exascale Fault Tolerance ? • How to run in presence of faults ? • What is the responsibility of programming model and communication libraries • Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop Pophale, Aaron Welch, Stephen Poole, Barbara Chapman, “Fault Tolerance for OpenSHMEM”, PGAS/OUG14 – http://nic.uoregon.edu/pgas14/oug_submissions/oug2014_submis sion_12.pdf
  • 35.
    35 Preparing OpenSHMEMfor Exascale Fault Tolerance • Basic idea – In memory checkpoint of symmetric memory regions – Symmetric recovery or only “memory recovery”
  • 36.
    36 Preparing OpenSHMEMfor Exascale Fault Tolerance • Code snippet
  • 37.
    37 Preparing OpenSHMEMfor Exascale Fault Tolerance • Work in progress… • OpenSHMEM is just one piece of the puzzle – Run-time, I/O, drivers, etc. – The system has to provide fault tolerance infrastructure • Error notification, coordination, etc. • Leveraging existing work/research in the HPC community – MPI, Hadoop, etc.
  • 38.
    38 Preparing OpenSHMEMfor Exascale Summary • This just a “snapshot” some of the ideas – Other active research & development topics: non-blocking operations, counting operations, signaled operation, asymmetric memory access, etc • These challenges are relevant for many other HPC programming models • The key to success – Co-design of hardware and software – Generic solutions that target broader community • The challenges are common across different fields: storage, analytics, big- data, etc.
  • 39.
    39 Preparing OpenSHMEMfor Exascale How to get involved? • Join the mailing list – www.openshmem.org/Community/MailingList • Join OpenSHMEM redmine – www.openshmem.org/redmine • GitHUB – https://github.com/orgs/openshmem-org • OpenSHMEM RF, test suites, benchmarks, etc. • Participate in our upcoming events – Workshop, user group meetings, and conference calls
  • 40.
    40 Preparing OpenSHMEMfor Exascale Upcoming Events… Workshop 2015 August,4th-6th, 2015
  • 41.
    41 Preparing OpenSHMEMfor Exascale www.csm.ornl.gov/OpenSHMEM2015/ Co-Located with PGAS 2015 9th international Conference on Partitioned Global Address Space Programming Models Washington, DC Upcoming Events…
  • 42.
    42 Preparing OpenSHMEMfor Exascale Acknowledgements This work was supported by the United States Department of Defense & used resources of the Extreme Scale Systems Center at Oak Ridge National Laboratory. Empowering the Mission
  • 43.
    43 Preparing OpenSHMEMfor Exascale Questions ?
  • 44.
    44 Preparing OpenSHMEMfor Exascale Backup Slides
  • 45.
    45 Preparing OpenSHMEMfor Exascale NVSHMEM Code Example USING NVSHMEM Device Code __global__ void one_kernel (u, v, sync, …) { i = threadIdx.x; for (…) { if (i+1 > nx) { v[i+1] = nvshmem_float_g (v[1], rightpe) } if (i-1 < 1) { v[i-1] = nvshmem_float_g (v[nx], leftpe) } ------- u[i] = (u[i] + (v[i+1] + v[i-1] . . . contd…. contd…. /*peers array has left and right PE ids*/ if (i < 2) { nvshmem_int_p (sync[i], 1, peers[i]); nvshmem_quiet(); nvshmem_wait_until (sync[i], EQ, 1); } //intra-process sync ------- //compute v from u and sync } } 19 Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf

Editor's Notes

  • #2 Thank organizers Pavel Shamis / Pasha, people struggle to pronounce Pavel, so I go by Pasha “Pavel” Chekov character in Start Track * I’m Research staff at ORNL, CSR group, * These days I spent most of my time working on communication middleware, high performance networks, programming models, OpenSHMEM * In past, I spent more then a few years working on MPI, Developing Infiniband codes , VERBs, etc. Prior to ORNL I used to work at Mellanox and lead the development of the HPC software stack. I will not be staying for rest of the conference, I have to leaver after the talk. Feel free to interup and ask question. On positive side (mostly for u), I have strict deadline ~ 1 hour otherwise I will miss my flight, so I will not be able to bore you
  • #3 Start wit the coral effort and summit system Introduce new architecture, new challenges The context switch to OpenSHMEM How we plan to address some of the challenges
  • #5 Sierra and Summit: Scaling to New Heights with OpenPower by David E. Bernholdt, Terri Quinn
  • #6 The system is based on Open Power architecture
  • #7 Present and Future Leadership Computers at OLCF, Buddy Bland
  • #8 HW Architectures as usual drop their problems on Software developers
  • #9 Implements PGAS but was introduced before PGAS was formally defined We define it this way because people are familiar with PGAS
  • #10 I can use better diagram
  • #12 I simplify OpenSHMEM programming There are much more put/get operations. But overall it is not that hard
  • #13 What we learned by now OpenSHMEM can be implemented as a think layer on top of divers Such as VERBS or GNI AMO and RMA is mapped to ibv_post_send or GNI fma_post
  • #14 * Swiss knife of memory operations * Can be dangerous Straight razor
  • #15 History of OpenSHMEM in one silide… * Between 90 and 2011 shmem didn’t change too much. Few vendors provided very similar SHMEM libraries. A bit different function signatures but very similar conceptially Was born out of the desire to address the differences between different implementations, enhance OpenSHMEM, create tools eco-system
  • #16 OpenSHMEM specification has not been updated since early 90s Specification 1.0 – Essentially SGI SHMEM Specification 1.1 – Bugfix release of 1.0 All 1.X backward compatible with 1.0 1.2 specification – coming this year ( release ~ march)
  • #17 Multiple organization are involved in the effort Academia, Labs – development and research Vendors – provide the implement Very close collaboration with vendors. We have to ensure that what ever we define can efficiently implemented In hardware Very often we get pullback 
  • #19 We are trying to build OpenSHMEM eco-system that will help software developer write, debug, and analyze their codes
  • #21 The challenges for OpenSHMEM Back to the challenges, we know that we are going to have very powerfull nodes I’m sure there are more challenges…
  • #22 Was presented at OpenSHMEM User meeting Excellent presentation by Nvidia
  • #24 Preliminary results Evaluated with stencil kernel In order to calculate border values you have to access “remote” gpu Traditional kernel – you have to go to host for communication and syncronization Persistent kernel – communication initiated
  • #25 Slide Probably you are familiar with MPI thread safety. The most interesting mode THREAD_MULTIPLE – every thread can send and receive data Most MPI struggle to provide high performance implementation with THREAD_MULTIPLE
  • #26 Old trick. We received a lot requests from user to enable this feature
  • #27 The memory access can be optimized by compiler (it sees it as an access to pointers)
  • #28 Requires new API
  • #31 Active set – defines group of processes * Defined on fly * Very lightweight local operation * Current active set supports only of power 2. Most likely based on n-cube topology.
  • #35 What does it mean * Very large scale systems, * A lot of components * Vendors increase density of the silicon * Decrease power consumption *** Mean time between failures is not expected to improve As usual, How HW people dropped the problem on software  Obviously everybody expects that we know how to solve it
  • #36 Remotely accessible memory is most critical memory region: The memory is remotely accessible and visible for other process
  • #37 Pseudo code
  • #39 On openshmem website you may find references to the publications, presentations, and more P2P synchronization operations proposed by Intel RMA IO operation for storage – Los Alamoa
  • #41 Last year workshop was in Anaplice DC: OpenSHMEM tutorials, Infiniband VERBS tutorial, This year: Will be announced soon, most likely Baltimore Similar format to previous event One day is tutorials Second Presentations and
  • #42 A bit different format. Short paper or presentations, F2F discussions