New Process/Thread Runtime

New Process/Thread Runtime
Process in Process
Techniques for Practical
Address-Space Sharing
Atsushi Hori (RIKEN)
Dec. 13, 2017

Arm HPC Workshop@Akihabara 2017
Background
• The rise of many-core architectures
• The current parallel execution models are
designed for multi-core architectures
• Shall we have a new parallel execution
model ?
2

What to be shared and what not to be shared
• Isolated address spaces
• slow communication
• Shared variables
• contentions on shared variables
3
Address Space
Isolated Shared
Variables
Privatized
Multi-Process
(MPI)
Shared ??
Multi-Thread
(OpenMP)

What to be shared and what not to be shared
• Isolated address spaces
• slow communication
• Shared variables
• contentions on shared variables
4
Address Space
Isolated Shared
Variables
Privatized
Multi-Process
(MPI)
3rd Exec.
Model
Shared ??
Multi-Thread
(OpenMP)

Implementation of 3rd Execution Model
• MPC (by CEA)
• Multi-thread approach
• Compiler converts all variables thread local
• a.out and b.out cannot run simultaneously
• PVAS (by RIKEN)
• Multi-process approach
• Patched Linux
• OS kernel allows processes to share address
space
• MPC, PVAS, and SMARTMAP are not portable
5

Why portability matters ?
• On the large supercomputers (i.e. the K),
modified OS kernel or kernel module is not
allowed for users to install
• When I tried to port PVAS onto McKernel,
core developer denies the modification
• DO NOT CONTAMINATE MY CODE !!
6

PiP is very PORTABLE
7
CPU OS
Xeon and Xeon Phi
x86_64 Linux
x86_64 McKernel
the K and FX10 SPARC64 XTCOS
ARM (Opteron A1170) Aarch64 Linux
0
0.1
0.2
1
10
100
200
Time[S]
# Tasks -- Xeon
PiP:preload
PiP:thread
Fork&Exec
Vfork&Exec
PosixSpawn Pthread
0
1
2
1
10
100
200# Tasks -- KNL
0
0.1
0.2
1
10
100
200
# Tasks -- Aarch64
0
1
2
1
10
100
200
# Tasks -- K
Task Spawning Time

Portability
• PiP can run the machines where
• pthread_create() (, or clone() system call)
• PIE
• dlmopen()
are supported
• PiP does not run on
• BG/Q PIE is not supported
• Windows PIE is not fully supported
• Mac OSX dlmopen() is not supported
• FACT: All machines listed in Top500 (Nov. 2017)
use Linux family OS !!
8

• User-level implementation of 3rd exec. model
• Portable and practical
Process in Process (PiP)
9
555555554000-555555556000 r-xp ... /PIP/test/basic
555555755000-555555756000 r--p ... /PIP/test/basic
555555756000-555555757000 rw-p ... /PIP/test/basic
555555757000-555555778000 rw-p ... [heap]
7fffe8000000-7fffe8021000 rw-p ...
7fffe8021000-7fffec000000 ---p ...
7ffff0000000-7ffff0021000 rw-p ...
7ffff0021000-7ffff4000000 ---p ...
7ffff4b24000-7ffff4c24000 rw-p ...
7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so
7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so
7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so
7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so
7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic
7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic
7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic
7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic
7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so
7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so
7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so
7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so
7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic
7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic
7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic
7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic
...
7ffff5a52000-7ffff5a56000 rw-p ...
...
7ffff5c6e000-7ffff5c72000 rw-p ...
7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so
7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so
7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so
7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so
7ffff602e000-7ffff6033000 rw-p ...
7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so
7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so
7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so
7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so
7ffff63ef000-7ffff63f4000 rw-p ...
7ffff63f4000-7ffff63f5000 ---p ...
7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641]
7ffff6bf5000-7ffff6bf6000 ---p ...
7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640]
7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so
7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so
7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so
7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so
7ffff77b2000-7ffff77b7000 rw-p ...
...
7ffff79cf000-7ffff79d3000 rw-p ...
7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so
7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so
7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so
7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so
7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so
7ffff7edc000-7ffff7fe0000 rw-p ...
7ffff7ff7000-7ffff7ffa000 rw-p ...
7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so
7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so
7ffff7ffe000-7ffff7fff000 rw-p ...
7ffffffde000-7ffffffff000 rw-p ... [stack]
ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall]
Program
Glibc
Address Space
Task-0 int x;
Task-1 int x;
:
a.out Task-(n-1) int x;
Task-(n) int a;
:
b.out Task-(m-1) int a;
:

Why address space sharing is better ?
• Memory mapping techniques in multi-process model
• POSIX (SYS-V, mmap, ..) shmem
• XPMEM
• Same page table is shared by tasks
• no page table coherency overhead
• saving memory for page tables
• pointers can be used as they are
10
Memory mapping
must maintain page
table coherency
-> OVERHEAD
(system call, page
fault, and page table
size)
shared
region
Page
Table
shared
region
Page
Table
Proc-0 Proc-1
coherent
Shared Physical Memory Pages

Memory Mapping vs. PiP
11
for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria
concurrency because the
alysis is processing the data.
ction 7.4, we chose the latter
n is exible enough to allow
rms
forms to cover several OS
n our evaluation as listed in
platform H/W info.
Clock Memory Network
.6GHz 64 GiB ConnectX-3
.4GHz 96(+16) GiB Omni-Path
.0GHz 16 GiB Tofu
n Section 7.1 and 7.3 without using
ne with cache quadrant mode.
platform S/W info.
Table 5. Overhead of XPMEM and POSIX shmem functions
(Wallaby/Linux)
XPMEM Cycles
xpmem_make() 1,585
xpmem_get() 15,294
xpmem_attach() 2,414
xpmem_detach() 19,183
xpmem_release() 693
POSIX Shmem Cycles
Sender shm_open() 22,294
ftruncate() 4,080
mmap() 5,553
close() 6,017
Receiver shm_open() 13,522
mmap() 16,232
close() 16,746
6.2 Page Fault Overhead
Figure 4 shows the time series of each access using the same
microbenchmark program used in the preceding subsection.
Element access was strided with 64 bytes so that each cache
block was accessed only once, to eliminate the cache block
eect. In the XPMEM case, the mmap()ed region was attached
by using the XPMEM functions. The upper-left graph in
this gure shows the time series using POSIX shmem and
XPMEM, and the lower-left graph shows the time series
using PiP. Both graphs on the left-hand side show spikes at
every 4 KiB. Because of space limitations, we do not show
(Xeon/Linux)
10
100
1,000
5,000
AccessTime[Tick]
ShmemXPMEM XPMEM
PageSize:4KiB PageSize:2MiB
10
100
500
0 4,096 8,192 12,288 16,384
Array Elements [Byte offset]
PiP:process PiP:thread
0 4,096 8,192 12,288 16,384
Array Elements [Byte offset]
PiP:process PiP:thread
(Xeon/Linux)
PiP takes
less than
100 clocks !!

Process in Process (PiP)
• dlmopen (not a typo of dlopen)
• load a program having a new name space
• The same variable “foo” can have multiple
instances having different addresses
• Position Independent Executable (PIE)
• PIE programs can be loaded at any location
• Combine dlmopen and PIE
• load a PIE program with dlmopen
• We can privatize variables in the same
address space
12

Glibc Issue
• In the current Glibc, dlmopen() can create up to 16
name spaces only
• Each PiP task requires one name space to have
privatized variables
• Many-core architecture can run more than 16 PiP tasks,
up to the number of CPU cores
• Glibc patch is also provided to have more number of
name spaces, in case 16 is not enough
• Changing the size of name space stable
• Currently 260 PiP tasks can be created
• Some workaround codes can be found in PiP library
code
13

PiP Showcases
14

Showcase 1 : MPI pt2pt
• Current Eager/Rndv. 2 Copies
• PiP Rndv. 1 Copy
15
(Xeon/Linux)
1
4
16
64
256
1024
4096
16384
65536
Bandwidth(MB/s)
Message Size (Bytes)
eager-2copy
rndv-2copy
PiP (rndv-1copy)
PiP is 3.5x
faster @ 128KB
better

Showcase 2 : MPI DDT
• Derived Data Type (DDT) Communication
• Non-contiguous data transfer
• Current pack - send - unpack (3 copies)
• PiP non-contig send (1 copy)
16
0
0.5
1
1.5
2
64K
16,
128
32K,
32,
128
16K,
64,
128
8K,
128,
128
4K,
256,
128
2K,
512,
128
1K,
1K,
128
512,
2K,
128
256,
4K,
128
128,
8K,
128
64,
16K,
128
NormolizedTime
Count of double elements in X,Y,Z dimentions
eager-2copy (base)
rndv-2copy
PiP
Non-contig Vec
Non-contig Vec
(Xeon/Linux)
better

Showcase 3 : MPI_Win_allocate_shared (1/2)
17
MPI Implementation
int main(int argc, char **argv) {
MPI_Init(argc, argv);
...
MPI_Win_allocate_shared(size, 1,
MPI_INFO_NULL, comm, mem, win);
...
MPI_Win_shared_query(win, north, sz,
dsp_unit, northptr);
MPI_Win_shared_query(win, south, sz,
dsp_unit, southptr);
MPI_Win_shared_query(win, east, sz,
dsp_unit, eastptr);
MPI_Win_shared_query(win, west, sz,
dsp_unit, westptr);
...
MPI_Win_lock_all(0, win);
for(int iter=0; iterniters; ++iter) {
MPI_Win_sync(win);
MPI_Barrier(shmcomm);
/* stencil computation */
}
MPI_Win_unlock_all(win);
...
}
PiP Implementation
int main(int argc, char **argv) {
pip_init( pipid, p, NULL, 0 );
...
mem = malloc( size );
...
pip_get_addr( north, mem, northptr );
pip_get_addr( south, mem, southptr );
pip_get_addr( east, mem, eastptr );
pip_get_addr( west, mem, westptr );
...
for(int iter=0; iterniters; ++iter) {
pip_barrier( p );
...
/* stencil computation */
}
...
pip_fin();
}

Showcase 3 : MPI_Win_allocate_shared (2/2)
18
1E+2
1E+3
1E+4
1E+5
1E+6
0.1
1
10
100
1 10 100 1,000
TotalPageTableSize[KiB]
PTSizePercentagetoArraySize(MPI)
# Tasks -- KNL
PiP
MPI
Percentage
1E+0
1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
1 10 100 1,000
#TotalPageFaults
# Tasks -- KNL
PiP
MPI
5P Stencil (4K x 4K)
Size of Page Tables# Page Faults
better

Showcase 4 : In Situ
19
LAMMPSProcess InsituProcess
Pre1allocated
SharedBuffer
Copy%in
Copy%out
Gather
data
chunks
Analysis
Dump
copydata
LAMMPSprocess Insituprocess
Copy%out
Gather
data
chunks
Analysis
Dump
data
Original SHMEM-based In Situ
PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,12
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
SlowdownRatio(basedonw/oIn-situ)
LAMMPS: 3d Lennard-Jones melt (xx, yy, zz)
POSIX shmem PiP
LAMMPS in situ: POSIX shmem vs. PiP
On Xeon + Linux
• LAMMPSprocessranwithfourOpenMP threads;
• Insituprocessiswithsinglethread;
• O(N2)comp. cost data transfer cost at(12,12,12).
(Xeon/Linux)
better

Showcase 5 : SNAP
20
683.3%
379.1%
207.9%
153.0%
106.4%
91.6%
83.3%
430.5%
221.2%
123.0%
68.3%
42.0%
27.7%
22.0%
1.6%
1.7%
1.7%
2.2%
2.5%
3.3%
3.8%
0
0.5
1
1.5
2
2.5
3
3.5
4
0
100
200
300
400
500
600
700
800
16 32 64 128 256 512 1024
Speedup(PiP%vs%Threads)
Solve%Time%(s)
Number%of%Cores
MPICH/Threads
MPICH/PiP
Speedup(PiPvsThreads)
PiP V.S. threads in hybrid MPI+X SNAP
strong scaling on OFP (1-16 nodes, flat mode).
• ( MPI + OpenMP ) ( MPI + PiP )
better
better

Showcase 5 : Using in Hybrid MPI + “X” as the
“X” (2)
21
! PiP based(parallelism
– Easy(application(data(sharing(across(cores
– No(multithreading(safety(overhead
– Naturally(utilizing(multiple(network(ports((
Network(Ports
MPI(stack
APP(data
1
4
16
64
256
1024
4096
16384
65536
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
MessageSize(Bytes)
KMessages/sbewteenPiPtasks
1(Pair
4(Pairs
16(Pairs
64(Pairs
1
4
16
64
256
1024
4096
16384
65536
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
MessageSize(Bytes)
KMessages/sbetweenthreads
1(Pair
4(Pairs
16(Pairs
64(Pairs
683.3
430.5
1.6
0
100
200
300
400
500
600
700
800
16
SolveTime(s)
Multipair message rate (osu_mbw_mr )
between two OFP nodes (Xeon Phi + Linux, flat mode).
PiP V.S
strong sca

Research Collaboration
• ANL (Dr. Pavan and Dr. Min) — DOE-MEXT
• MPICH
• UT/ICL (Prof. Bosilca)
• Open MPI
• CEA (Dr. Pérache) — CEA-RIKEN
• MPC
• UIUC (Prof. Kale) — JLESC
• AMPI
• Intel (Dr. Dayal)
• In Situ
22

Summary
• Process in Process (PIP)
• New implementation of the 3rd execution
model
• better than memory mapping techniques
• PiP is portable and practical because of
user-level implementation
• can run on the K and OFP super
computers
• Showcases prove PiP can improve
performance
23

Final words
• The Glib issues will be reported to Redhat
• We are seeking PiP applications not only HPC
but also Enterprise
24

New Process/Thread Runtime

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to New Process/Thread Runtime

Similar to New Process/Thread Runtime (20)

More from Linaro

More from Linaro (20)

Recently uploaded

Recently uploaded (20)

New Process/Thread Runtime