Preparing OpenSHMEM for Exascale

Oak Ridge National Laboratory
Computing and Computational Sciences
HPC Advisory Council Stanford Conference
California
Feb 2, 2015
Preparing OpenSHMEM for
Exascale
Presented by:
Pavel Shamis (Pasha)

2 Preparing OpenSHMEM for Exascale
Outline
• CORAL overview
– Summit
• What is OpenSHMEM ?
• Preparing OpenSHMEM for Exascale
– Recent advances

CORAL
• CORAL – Collaboration of ORNL, ANL, LLNL
• Objective – Procure 3 leadership computers to be sited at Argonne,
Oak Ridge and Lawrence Livermore in 2017
– Two of the contracts have been awarded with the Argonne contract in
process
• Leadership Computers
– RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and
science performance 5x-10x Titan or Sequoia

The Road to Exascale
CORAL System
Jaguar: 2.3 PF
Multi-core CPU
7 MW
Titan: 27 PF
Hybrid GPU/CPU
9 MW
2010 2012 2017 2022
OLCF5: 5-10x Summit
~20 MWSummit: 5-10x Titan
Hybrid GPU/CPU
10 MW
Since clock-rate scaling ended in 2003,
HPC performance has been achieved
through increased parallelism. Jaguar
scaled to 300,000 cores.
Titan and beyond deliver hierarchical
parallelism with very powerful nodes. MPI
plus thread level parallelism through
OpenACC or OpenMP plus vectors

System Summary
Mellanox® Interconnect
Dual-rail EDR Infiniband®
IBM POWER
• NVLink™
NVIDIA Volta
• HBM
• NVLink
Compute Node
POWER® Architecture Processor
NVIDIA®Volta™
NVMe-compatible PCIe 800GB SSD
> 512 GB HBM + DDR4
Coherent Shared Memory
Compute Rack
Standard 19”
Warm water cooling
Compute System
Summit: 5x-10x Titan
10 MW

Summit VS Titan
12 SC’14 Summit - Bland Do Not Release Prior to Monday, Nov. 17, 2014
How does Summit compare to Titan
Feature Summit Titan
Application Performance 5-10x Titan Baseline
Number of Nodes ~3,400 18,688
Node performance > 40 TF 1.4 TF
Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3)
NVRAM per Node 800 GB 0
Node Interconnect NVLink (5-12x PCIe 3) PCIe 2
System Interconnect
(node injection bandwidth)
Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s)
Interconnect Topology Non-blocking Fat Tree 3D Torus
Processors IBM POWER9
NVIDIA Volta™
AMD O
p
t er on™
NVIDIA Ke p ler™
File System 120 PB, 1 TB/s, GP FS™ 32 PB, 1 TB/s, Lustre®
Peak power consumption 10 MW 9 MW
Present and Future Leadership Computers at OLCF, Buddy Bland
https://www.olcf.ornl.gov/wp-content/uploads/2014/12/OLCF-User-Group-Summit-12-3-2014.pdf

Challenges for Programming Models
• Very powerful compute nodes
– Hybrid architecture
– Multiple CPU/GPU
– Different types of memory
• Must be fun to program ;-)
– MPI + X

What is OpenSHMEM ?
• Communication library and interface specification that implements a
Partitioned Global Address Space (PGAS) programming model
• Processing Element (PE) an OpenSHMEM process
• Symmetric objects have same address (or offset) on all PEs
PE N-1
Global and Static
Variables
Symmetric Heap
Local Variables
PE 0
Global and Static
Variables
Symmetric Heap
Local Variables
PE 1
Global and Static
Variables
Symmetric Heap
Local Variables
RemotelyAccessibleSymmetric
DataObjects
Variable: X Variable: X Variable: X
X = shmalloc(sizeof(long))
PrivateData
Objects

OpenSHMEM Operations
• Remote memory Put and Get
– void shmem_getmem(void *target, const void *source, size_t len, int pe);
– void shmem_putmem(void *target, const void *source, size_t len, int pe);
• Remote memory Atomic operations
– long long shmem_int_add(int *target, int value, int pe);
• Collective
– broadcast, reductions, etc
• Synchronization operations
– Point-to-point
– Global
• Ordering operations
• Distributed lock operations
PE N-1
Global and Static
Variables
Symmetric Heap
Local Variables
PE 0
Global and Static
Variables
Symmetric Heap
Local Variables
PE 1
Global and Static
Variables
Symmetric Heap
Local Variables
RemotelyAccessibleSymmetric
DataObjects
Variable: X Variable: X Variable: X
X = shmalloc(sizeof(long))
PrivateData
Objects

OpenSHMEM Code Example
1
2
3
4

OpenSHMEM Code Example
• You just learned program OpenSHMEM !
– Library initialization
– AMO/PUT/GET
– Synchronization
– Done 1
2
3
4

OpenSHMEM
• OpenSHMEM is a one-sided communications library
– C and Fortran API
– Uses symmetric data objects to efficiently communicate across
processes
• Advantages:
– Good for irregular applications, latency-driven communication
• Random memory access patterns
– Maps really well to hardware/interconnects
OpenSHMEM InfniBand (Mellanox) Gemini/Aries (Cray)
RMA PUT/GET V V
Atomics V V
Collectives V V

OpenSHMEM Key Principles
• Keep it simple
– The specification is only ~ 80 pages
• Keep it fast
– As close as possible to hardware

Evolution of OpenSHMEM
2011 20131990s 20152012
• SHMEM library introduced by Cray Research Inc. (T3D systems)
• Adapted by SGI for products based on the Numa-Link architecture and included in the
Message Passing Toolkit (MPT).
• Vendor specific SHMEM libraries emerge (Quadrics, HP, IBM, Mellanox, Intel, gpSHMEM,
SiCortex etc.).
• OpenSHMEM is born.
• ORNL and UH come together to address the differences between
various SHMEM implementations.
• OSSS signed SHMEM trademark licensing agreement
• OpenSHMEM 1.0 is finalized
• OpenSHMEM 1.0 reference
implementation & V&V, Tools
• OpenSHMEM 1.1 released
mid 2014
• OpenSHMEM 1.2
2015 onwards, next OpenSHMEM
specifications: faster, more
predictable, more agile
OpenSHMEM is a living specification!

OpenSHMEM - Roadmap
• OpenSHMEM v1.1 (June 2014)
– Errata, bug fixes
– Ratified (100+ tickets resolved)
• OpenSHMEM v1.2 (Early 2015)
– API naming convention
– finalize(), global_exit()
– Consistent data type support
– Version information
– Clarifications: zero-length, wait
– shmem_ptr()
• OpenSHMEM v1.5 (Late 2015)
– Non-blocking communication semantics
(RMA, AMO)
– teams, groups
– Thread safety
• OpenSHMEM v1.6
– Non-blocking collectives
• OpenSHMEM v1.7
– Thread safety update
• OpenSHMEM Next Generation (2.0)
– Let’s go wild !!! (Exascale!)
– Active set + Memory context
– Fault Tolerance
– Exit codes
– Locality
– I/O
White paper:
OpenSHMEM Tools API

OpenSHMEM Community Today
Academia
Vendors
Government

OpenSHMEM Implementations
• Proprietary
– SGI SHMEM
– Cray SHMEM
– IBM SHMEM
– HP SHMEM
– Mellanox Scalable SHMEM
• Legacy
– Quadrics SHMEM
• Open Source
– OpenSHMEM Reference
Implementation (UH)
– Portals SHMEM
– Oshmpi / Open SHMEM
over MPI (under
development)
– OpenSHMEM with OpenMPI
– OpenSHMEM with Mvapich
MPI (OSU)
– TSHMEM (UFL)
– GatorSHMEM (UFL)

OpenSHMEM Eco-system
OpenSHMEM
Reference
Implementation
ANALYZER
Vampir

OpenSHMEM Eco-system
• OpenSHMEM Specification
– http://www.openshmem.org/site/Downloads/Source
• Vampir
– https://www.vampir.eu
• TAU
– http://www.cs.uoregon.edu/research/tau/home.php
• DDT
– www.allinea.com/products/ddt
• OpenSHMEM Analyzer
– https://www.openshmem.org/OSA
• UCCS
– http://uccs.github.io/uccs/

Upcoming Challenges for OpenSHMEM
• Based on what we know about the upcoming architecture…
• Communication across different components of
system
• Locality of resources
Hybrid
Architecture
• Thread Safety (without performance sacrifices)
• Threads locality
• Scalability
Multiple
CPU/GPU
• Address spaces
Different Types
of memory

Hybrid Architecture Challenges and Ideas
• OpenSHMEM for accelerators
• “TOC-Centric Communication: a case study with NVSHMEM”,
OUG/PGAS 2014, Shreeram Potluri
– http://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_O
UG14.pdf
– Preliminary study, prototype concept

NVSHMEM
• The problem
– Communication across GPU requires
synchronization with Host
• Software overheads, hardware overhead
of launching kernels, etc.
• Research idea/concept proposed by
Nvidia
– GPU-initiated communication
– NVSHMEM communication primitives:
nvshmem_put(), nvshmem_get()
to/from remote GPU memory
– Emulated using CUDA IPC (CUDA 4.2)
The slide is based on “TOC-Centric Communication: a case study with NVSHMEM”,
OUG/PGAS 2014, Shreeram
Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf
CHANGE IN T
Loop {
Interior Compute (kernel launch)
Pack Boundaries (kernel launch)
Stream Synchronize
Exchange (MPI/OpenSHMEM)
Unpack Boundaries (kernel launch)
Boundary Compute (kernel launch)
Stream/Device Synchronize
}
- Kernel launch overheads
- CPU based blocking synchronization
Traditional

NVSHMEM
u[i][j] = u[i][j]
+ (v[i+1][j] + v[i-1][j]
+ v[i][j+1] + v[i][j+1])/x
16
Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”,
PRELIMINARY R
D
tl
0"
500"
1000"
1500"
64" 128" 256" 512" 1K" 2K"
Time%per%Step%(usec)%
Stencil%Size%%
tradi/ onal" persistent"kernel"
100"
1000"
p%(usec)%
Tradi. onal" Persistent"Kernel"

Many-Core System Challenges
• It is challenging to provide high-
performance
THREAD_MULTIPLE support
– Locks / Atomic operations in
communication path
• Even though MPI IMB
benchmarks benefits from full
process memory separation,
multi-threaded UCCS obtains
comparable performance
Aurelien Bouteiller, Thomas Herault and George Bosilca, “A Multithreaded Communication Substrate for
OpenSHMEM”, OUG2014

Many-Core System Challenges – “Old” Ideas
• SHMEM_PTR (or SHMEM_LOCAL_PTR on Cray)
PE 0
Symmetric Heap
Local Variables
PE 1
Symmetric Heap
Local Variables
Variable: X Variable: X
Y = shmem_ptr(&X, PE1)
Variable: Y
M
em
ory
M
apping

Many-Core System Challenges – “Old” Ideas
• Provides direct assess to “remote” PE element with memory load and
store operations
• Supported on a systems where SHMEM_PUT/GET are implemented
with memory load and store operations
– Usually implemented using XPMEM (https://code.google.com/p/xpmem/)
• Gabriele Jost, Ulf R. Hanebutte, James Dinan, “OpenSHMEM with
Threads: A Bad Idea?”
• http://www.csm.ornl.gov/OpenSHMEM2014/documents/talk6_jost_OUG14.pd
f

Many-Core System Challenges – New Ideas
• OpenSHMEM Context by Intel
– James Dinan and Mario Flajslik, “Contexts: A Mechanism for High
Throughput Communication in OpenSHMEM”, PGAS 2015
– Explicit API for allocation and management of communication contexts
OpenSHMEM Application
Thread 0 Thread 1 Thread 2
Context
0
Context
1
Context
2
OpenSHMEM Library
Put Put Get Put Put Get Put Put Get

Many-Core System Challenges – New Ideas
• Cray’s proposal of “Hot” Threads
– Monika ten Bruggencate Cray Inc. “Cray SHMEM Update”, First
OpenSHMEM Workshop: Experiences, Implementations and Tools
• http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentatio
ns_and_tutorials/tenBruggencate_Cray_SHMEM_Update.pdf
– Idea: each thread is registered within OpenSHMEM library. The library
allocates and automatically manages communication resources (context)
for the application
– Compatible with current API

Address Space and Locality Challenges
• Symmetric heap is not flexible enough
– All PE have to allocate the same amount of memory
• No concept of locality
• How we manage different types of memory ?
• What is the right abstraction ?

Memory Spaces
• Aaron Welch , Swaroop Pophale , Pavel Shamis , Oscar Hernandez,
Stephen Poole, Barbara Chapman, “Extending the OpenSHMEM
Memory Model to Support User-Defined Spaces”, PGAS2014
• Concept of teams
– Original OpenSHMEM active-set (group of Pes) concept is outdates,
BUT very lightweight (local operation)
• Memory Space
– Memory space association with a team
– Similar concepts can be found in MPI, Chapel, etc.

Teams
• Explicit method of grouping PEs
• Fully local objects and operations - Fast
• New (sub)teams created from parent teams
• Re-indexing of PE ids with respect to the
team
• Strided teams and axial splits
– No need to maintain “translation” array
– All translations can be done with simple
arithmetic
• Ability to specify team index for remote
operations

Spaces

Spaces
• Spaces and teams creation is decoupled
• Faster memory allocation compared to
“shmalloc”
• Future directions
– Different types of memory
– Locality
– Separate address spaces
– Asymmetric RMA access

Fault Tolerance ?
• How to run in presence of faults ?
• What is the responsibility of programming model and communication
libraries
• Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop
Pophale, Aaron Welch, Stephen Poole, Barbara Chapman, “Fault
Tolerance for OpenSHMEM”, PGAS/OUG14
– http://nic.uoregon.edu/pgas14/oug_submissions/oug2014_submis
sion_12.pdf

Fault Tolerance
• Basic idea
– In memory checkpoint of symmetric memory regions
– Symmetric recovery or only “memory recovery”

Fault Tolerance
• Code snippet

Fault Tolerance
• Work in progress…
• OpenSHMEM is just one piece of the puzzle
– Run-time, I/O, drivers, etc.
– The system has to provide fault tolerance infrastructure
• Error notification, coordination, etc.
• Leveraging existing work/research in the HPC community
– MPI, Hadoop, etc.

Summary
• This just a “snapshot” some of the ideas
– Other active research & development topics: non-blocking operations,
counting operations, signaled operation, asymmetric memory access, etc
• These challenges are relevant for many other HPC programming
models
• The key to success
– Co-design of hardware and software
– Generic solutions that target broader community
• The challenges are common across different fields: storage, analytics, big-
data, etc.

How to get involved?
• Join the mailing list
– www.openshmem.org/Community/MailingList
• Join OpenSHMEM redmine
– www.openshmem.org/redmine
• GitHUB
– https://github.com/orgs/openshmem-org
• OpenSHMEM RF, test suites, benchmarks, etc.
• Participate in our upcoming events
– Workshop, user group meetings, and conference calls

Upcoming Events…
Workshop 2015
August,4th-6th, 2015

www.csm.ornl.gov/OpenSHMEM2015/
Co-Located with
PGAS 2015
9th international Conference on
Partitioned Global Address Space
Programming Models
Washington, DC
Upcoming Events…

Acknowledgements
This work was supported by the United States Department of Defense &
used resources of the Extreme Scale Systems Center at Oak Ridge
National Laboratory.
Empowering the Mission

Questions ?

Backup Slides

NVSHMEM Code Example
USING NVSHMEM
Device Code
__global__ void one_kernel (u, v, sync, …) {
i = threadIdx.x;
for (…) {
if (i+1 > nx) {
v[i+1] = nvshmem_float_g (v[1], rightpe)
}
if (i-1 < 1) {
v[i-1] = nvshmem_float_g (v[nx], leftpe)
}
-------
u[i] = (u[i] + (v[i+1] + v[i-1] . . .
contd….
contd….
/*peers array has left and right PE ids*/
if (i < 2) {
nvshmem_int_p (sync[i], 1, peers[i]);
nvshmem_quiet();
nvshmem_wait_until (sync[i], EQ, 1);
}
//intra-process sync
------- //compute v from u and sync
}
}
19
Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”,

Preparing OpenSHMEM for Exascale

More Related Content

What's hot

Viewers also liked

Similar to Preparing OpenSHMEM for Exascale

More from inside-BigData.com

Recently uploaded

Preparing OpenSHMEM for Exascale

Editor's Notes