SlideShare a Scribd company logo
New Process/Thread Runtime
Process in Process
Techniques for Practical
Address-Space Sharing
Atsushi Hori (RIKEN)
Dec. 13, 2017
Arm HPC Workshop@Akihabara 2017
Background
• The rise of many-core architectures
• The current parallel execution models are
designed for multi-core architectures
• Shall we have a new parallel execution
model ?
2
Arm HPC Workshop@Akihabara 2017
What to be shared and what not to be shared
• Isolated address spaces
• slow communication
• Shared variables
• contentions on shared variables
3
Address Space
Isolated Shared
Variables
Privatized
Multi-Process
(MPI)
Shared ??
Multi-Thread
(OpenMP)
Arm HPC Workshop@Akihabara 2017
What to be shared and what not to be shared
• Isolated address spaces
• slow communication
• Shared variables
• contentions on shared variables
4
Address Space
Isolated Shared
Variables
Privatized
Multi-Process
(MPI)
3rd Exec.
Model
Shared ??
Multi-Thread
(OpenMP)
Arm HPC Workshop@Akihabara 2017
Implementation of 3rd Execution Model
• MPC (by CEA)
• Multi-thread approach
• Compiler converts all variables thread local
• a.out and b.out cannot run simultaneously
• PVAS (by RIKEN)
• Multi-process approach
• Patched Linux
• OS kernel allows processes to share address
space
• MPC, PVAS, and SMARTMAP are not portable
5
Arm HPC Workshop@Akihabara 2017
Why portability matters ?
• On the large supercomputers (i.e. the K),
modified OS kernel or kernel module is not
allowed for users to install
• When I tried to port PVAS onto McKernel,
core developer denies the modification
• DO NOT CONTAMINATE MY CODE !!
6
Arm HPC Workshop@Akihabara 2017
PiP is very PORTABLE
7
CPU OS
Xeon and Xeon Phi
x86_64 Linux
x86_64 McKernel
the K and FX10 SPARC64 XTCOS
ARM (Opteron A1170) Aarch64 Linux
0
0.1
0.2
1
10
100
200
Time[S]
# Tasks -- Xeon
PiP:preload
PiP:thread
Fork&Exec
Vfork&Exec
PosixSpawn Pthread
0
1
2
1
10
100
200# Tasks -- KNL
0
0.1
0.2
1
10
100
200
# Tasks -- Aarch64
0
1
2
1
10
100
200
# Tasks -- K
Task Spawning Time
Arm HPC Workshop@Akihabara 2017
Portability
• PiP can run the machines where
• pthread_create() (, or clone() system call)
• PIE
• dlmopen()
are supported
• PiP does not run on
• BG/Q PIE is not supported
• Windows PIE is not fully supported
• Mac OSX dlmopen() is not supported
• FACT: All machines listed in Top500 (Nov. 2017)
use Linux family OS !!
8
Arm HPC Workshop@Akihabara 2017
• User-level implementation of 3rd exec. model
• Portable and practical
Process in Process (PiP)
9
555555554000-555555556000 r-xp ... /PIP/test/basic
555555755000-555555756000 r--p ... /PIP/test/basic
555555756000-555555757000 rw-p ... /PIP/test/basic
555555757000-555555778000 rw-p ... [heap]
7fffe8000000-7fffe8021000 rw-p ...
7fffe8021000-7fffec000000 ---p ...
7ffff0000000-7ffff0021000 rw-p ...
7ffff0021000-7ffff4000000 ---p ...
7ffff4b24000-7ffff4c24000 rw-p ...
7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so
7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so
7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so
7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so
7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic
7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic
7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic
7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic
7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so
7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so
7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so
7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so
7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic
7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic
7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic
7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic
...
7ffff5a52000-7ffff5a56000 rw-p ...
...
7ffff5c6e000-7ffff5c72000 rw-p ...
7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so
7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so
7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so
7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so
7ffff602e000-7ffff6033000 rw-p ...
7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so
7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so
7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so
7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so
7ffff63ef000-7ffff63f4000 rw-p ...
7ffff63f4000-7ffff63f5000 ---p ...
7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641]
7ffff6bf5000-7ffff6bf6000 ---p ...
7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640]
7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so
7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so
7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so
7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so
7ffff77b2000-7ffff77b7000 rw-p ...
...
7ffff79cf000-7ffff79d3000 rw-p ...
7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so
7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so
7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so
7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so
7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so
7ffff7edc000-7ffff7fe0000 rw-p ...
7ffff7ff7000-7ffff7ffa000 rw-p ...
7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so
7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so
7ffff7ffe000-7ffff7fff000 rw-p ...
7ffffffde000-7ffffffff000 rw-p ... [stack]
ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall]
Program
Glibc
Address Space
Task-0 int x;
Task-1 int x;
:
a.out Task-(n-1) int x;
Task-(n) int a;
:
b.out Task-(m-1) int a;
:
Arm HPC Workshop@Akihabara 2017
Why address space sharing is better ?
• Memory mapping techniques in multi-process model
• POSIX (SYS-V, mmap, ..) shmem
• XPMEM
• Same page table is shared by tasks
• no page table coherency overhead
• saving memory for page tables
• pointers can be used as they are
10
Memory mapping
must maintain page
table coherency
-> OVERHEAD
(system call, page
fault, and page table
size)
shared
region
Page
Table
shared
region
Page
Table
Proc-0 Proc-1
coherent
Shared Physical Memory Pages
Arm HPC Workshop@Akihabara 2017
Memory Mapping vs. PiP
11
for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria
concurrency because the
alysis is processing the data.
ction 7.4, we chose the latter
n is exible enough to allow
rms
forms to cover several OS
n our evaluation as listed in
platform H/W info.
Clock Memory Network
.6GHz 64 GiB ConnectX-3
.4GHz 96(+16) GiB Omni-Path
.0GHz 16 GiB Tofu
n Section 7.1 and 7.3 without using
ne with cache quadrant mode.
platform S/W info.
Table 5. Overhead of XPMEM and POSIX shmem functions
(Wallaby/Linux)
XPMEM Cycles
xpmem_make() 1,585
xpmem_get() 15,294
xpmem_attach() 2,414
xpmem_detach() 19,183
xpmem_release() 693
POSIX Shmem Cycles
Sender shm_open() 22,294
ftruncate() 4,080
mmap() 5,553
close() 6,017
Receiver shm_open() 13,522
mmap() 16,232
close() 16,746
6.2 Page Fault Overhead
Figure 4 shows the time series of each access using the same
microbenchmark program used in the preceding subsection.
Element access was strided with 64 bytes so that each cache
block was accessed only once, to eliminate the cache block
eect. In the XPMEM case, the mmap()ed region was attached
by using the XPMEM functions. The upper-left graph in
this gure shows the time series using POSIX shmem and
XPMEM, and the lower-left graph shows the time series
using PiP. Both graphs on the left-hand side show spikes at
every 4 KiB. Because of space limitations, we do not show
(Xeon/Linux)
10
100
1,000
5,000
AccessTime[Tick]
ShmemXPMEM XPMEM
PageSize:4KiB PageSize:2MiB
10
100
500
0 4,096 8,192 12,288 16,384
Array Elements [Byte offset]
PiP:process PiP:thread
0 4,096 8,192 12,288 16,384
Array Elements [Byte offset]
PiP:process PiP:thread
(Xeon/Linux)
PiP takes
less than
100 clocks !!
Arm HPC Workshop@Akihabara 2017
Process in Process (PiP)
• dlmopen (not a typo of dlopen)
• load a program having a new name space
• The same variable “foo” can have multiple
instances having different addresses
• Position Independent Executable (PIE)
• PIE programs can be loaded at any location
• Combine dlmopen and PIE
• load a PIE program with dlmopen
• We can privatize variables in the same
address space
12
Arm HPC Workshop@Akihabara 2017
Glibc Issue
• In the current Glibc, dlmopen() can create up to 16
name spaces only
• Each PiP task requires one name space to have
privatized variables
• Many-core architecture can run more than 16 PiP tasks,
up to the number of CPU cores
• Glibc patch is also provided to have more number of
name spaces, in case 16 is not enough
• Changing the size of name space stable
• Currently 260 PiP tasks can be created
• Some workaround codes can be found in PiP library
code
13
Arm HPC Workshop@Akihabara 2017
PiP Showcases
14
Arm HPC Workshop@Akihabara 2017
Showcase 1 : MPI pt2pt
• Current Eager/Rndv. 2 Copies
• PiP Rndv. 1 Copy
15
(Xeon/Linux)
1
4
16
64
256
1024
4096
16384
65536
Bandwidth(MB/s)
Message Size (Bytes)
eager-2copy
rndv-2copy
PiP (rndv-1copy)
PiP is 3.5x
faster @ 128KB
better
Arm HPC Workshop@Akihabara 2017
Showcase 2 : MPI DDT
• Derived Data Type (DDT) Communication
• Non-contiguous data transfer
• Current pack - send - unpack (3 copies)
• PiP non-contig send (1 copy)
16
0
0.5
1
1.5
2
64K
16,
128
32K,
32,
128
16K,
64,
128
8K,
128,
128
4K,
256,
128
2K,
512,
128
1K,
1K,
128
512,
2K,
128
256,
4K,
128
128,
8K,
128
64,
16K,
128
NormolizedTime
Count of double elements in X,Y,Z dimentions
eager-2copy (base)
rndv-2copy
PiP
Non-contig Vec
Non-contig Vec
(Xeon/Linux)
better
Arm HPC Workshop@Akihabara 2017
Showcase 3 : MPI_Win_allocate_shared (1/2)
17
MPI Implementation
int main(int argc, char **argv) {
MPI_Init(argc, argv);
...
MPI_Win_allocate_shared(size, 1,
MPI_INFO_NULL, comm, mem, win);
...
MPI_Win_shared_query(win, north, sz,
dsp_unit, northptr);
MPI_Win_shared_query(win, south, sz,
dsp_unit, southptr);
MPI_Win_shared_query(win, east, sz,
dsp_unit, eastptr);
MPI_Win_shared_query(win, west, sz,
dsp_unit, westptr);
...
MPI_Win_lock_all(0, win);
for(int iter=0; iterniters; ++iter) {
MPI_Win_sync(win);
MPI_Barrier(shmcomm);
/* stencil computation */
}
MPI_Win_unlock_all(win);
...
}
PiP Implementation
int main(int argc, char **argv) {
pip_init( pipid, p, NULL, 0 );
...
mem = malloc( size );
...
pip_get_addr( north, mem, northptr );
pip_get_addr( south, mem, southptr );
pip_get_addr( east, mem, eastptr );
pip_get_addr( west, mem, westptr );
...
for(int iter=0; iterniters; ++iter) {
pip_barrier( p );
...
/* stencil computation */
}
...
pip_fin();
}
Arm HPC Workshop@Akihabara 2017
Showcase 3 : MPI_Win_allocate_shared (2/2)
18
1E+2
1E+3
1E+4
1E+5
1E+6
0.1
1
10
100
1 10 100 1,000
TotalPageTableSize[KiB]
PTSizePercentagetoArraySize(MPI)
# Tasks -- KNL
PiP
MPI
Percentage
1E+0
1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
1 10 100 1,000
#TotalPageFaults
# Tasks -- KNL
PiP
MPI
5P Stencil (4K x 4K)
Size of Page Tables# Page Faults
better
Arm HPC Workshop@Akihabara 2017
Showcase 4 : In Situ
19
LAMMPSProcess InsituProcess
Pre1allocated
SharedBuffer
Copy%in
Copy%out
Gather
data
chunks
Analysis
Dump
copydata
LAMMPSprocess Insituprocess
Copy%out
Gather
data
chunks
Analysis
Dump
data
Original SHMEM-based In Situ
PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,12
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
SlowdownRatio(basedonw/oIn-situ)
LAMMPS: 3d Lennard-Jones melt (xx, yy, zz)
POSIX shmem PiP
LAMMPS in situ: POSIX shmem vs. PiP
On Xeon + Linux
• LAMMPSprocessranwithfourOpenMP threads;
• Insituprocessiswithsinglethread;
• O(N2)comp. cost data transfer cost at(12,12,12).
(Xeon/Linux)
better
Arm HPC Workshop@Akihabara 2017
Showcase 5 : SNAP
20
683.3%
379.1%
207.9%
153.0%
106.4%
91.6%
83.3%
430.5%
221.2%
123.0%
68.3%
42.0%
27.7%
22.0%
1.6%
1.7%
1.7%
2.2%
2.5%
3.3%
3.8%
0
0.5
1
1.5
2
2.5
3
3.5
4
0
100
200
300
400
500
600
700
800
16 32 64 128 256 512 1024
Speedup(PiP%vs%Threads)
Solve%Time%(s)
Number%of%Cores
MPICH/Threads
MPICH/PiP
Speedup(PiPvsThreads)
PiP V.S. threads in hybrid MPI+X SNAP
strong scaling on OFP (1-16 nodes, flat mode).
• ( MPI + OpenMP ) ( MPI + PiP )
better
better
Arm HPC Workshop@Akihabara 2017
Showcase 5 : Using in Hybrid MPI + “X” as the
“X” (2)
21
! PiP based(parallelism
– Easy(application(data(sharing(across(cores
– No(multithreading(safety(overhead
– Naturally(utilizing(multiple(network(ports((
Network(Ports
MPI(stack
APP(data
1
4
16
64
256
1024
4096
16384
65536
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
MessageSize(Bytes)
KMessages/sbewteenPiPtasks
1(Pair
4(Pairs
16(Pairs
64(Pairs
1
4
16
64
256
1024
4096
16384
65536
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
MessageSize(Bytes)
KMessages/sbetweenthreads
1(Pair
4(Pairs
16(Pairs
64(Pairs
683.3
430.5
1.6
0
100
200
300
400
500
600
700
800
16
SolveTime(s)
Multipair message rate (osu_mbw_mr )
between two OFP nodes (Xeon Phi + Linux, flat mode).
PiP V.S
strong sca
Arm HPC Workshop@Akihabara 2017
Research Collaboration
• ANL (Dr. Pavan and Dr. Min) — DOE-MEXT
• MPICH
• UT/ICL (Prof. Bosilca)
• Open MPI
• CEA (Dr. Pérache) — CEA-RIKEN
• MPC
• UIUC (Prof. Kale) — JLESC
• AMPI
• Intel (Dr. Dayal)
• In Situ
22
Arm HPC Workshop@Akihabara 2017
Summary
• Process in Process (PIP)
• New implementation of the 3rd execution
model
• better than memory mapping techniques
• PiP is portable and practical because of
user-level implementation
• can run on the K and OFP super
computers
• Showcases prove PiP can improve
performance
23
Arm HPC Workshop@Akihabara 2017
Final words
• The Glib issues will be reported to Redhat
• We are seeking PiP applications not only HPC
but also Enterprise
24

More Related Content

What's hot

Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
George Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
George Markomanolis
 
Linaro HPC Workshop Note
Linaro HPC Workshop NoteLinaro HPC Workshop Note
Linaro HPC Workshop Note
Linaro
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
ScyllaDB
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
ScyllaDB
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
Viller Hsiao
 
Onnc intro
Onnc introOnnc intro
Onnc intro
Luba Tang
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
Netronome
 
Whoops! I Rewrote It in Rust
Whoops! I Rewrote It in RustWhoops! I Rewrote It in Rust
Whoops! I Rewrote It in Rust
ScyllaDB
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Netronome
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
Kernel TLV
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
lpgauth
 
Data Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at ScaleData Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at Scale
ScyllaDB
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
Yunong Xiao
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
_xhr_
 
IPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-onIPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-on
APNIC
 

What's hot (20)

Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Linaro HPC Workshop Note
Linaro HPC Workshop NoteLinaro HPC Workshop Note
Linaro HPC Workshop Note
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Whoops! I Rewrote It in Rust
Whoops! I Rewrote It in RustWhoops! I Rewrote It in Rust
Whoops! I Rewrote It in Rust
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
 
Data Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at ScaleData Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at Scale
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
IPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-onIPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-on
 

Similar to New Process/Thread Runtime

OpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPON
OpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPONOpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPON
OpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPON
OpenNebula Project
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
Samuel Lampa
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
confluent
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
inside-BigData.com
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
Red_Hat_Storage
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
Colleen Corrice
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
[NetApp Managing Big Workspaces with Storage Magic
[NetApp Managing Big Workspaces with Storage Magic[NetApp Managing Big Workspaces with Storage Magic
[NetApp Managing Big Workspaces with Storage MagicPerforce
 
Intro to CakePHP
Intro to CakePHPIntro to CakePHP
Intro to CakePHP
Walther Lalk
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
Jeff Squyres
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI WorkloadsQNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
inside-BigData.com
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
Alluxio, Inc.
 
HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 Highlights
HPCC Systems
 
Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...
Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...
Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...
Charles Beyer
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
SimRelokasi2
 
ACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC ApplicationsACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC Applications
Mingliang Liu
 

Similar to New Process/Thread Runtime (20)

OpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPON
OpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPONOpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPON
OpenNebulaConf2017EU: IPP Cloud by Jimmy Goffaux, IPPON
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
[NetApp Managing Big Workspaces with Storage Magic
[NetApp Managing Big Workspaces with Storage Magic[NetApp Managing Big Workspaces with Storage Magic
[NetApp Managing Big Workspaces with Storage Magic
 
Intro to CakePHP
Intro to CakePHPIntro to CakePHP
Intro to CakePHP
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI WorkloadsQNIBTerminal Plus InfiniBand - Containerized MPI Workloads
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
 
HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 Highlights
 
Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...
Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...
Hyperion EPM APIs - Added value from HFM, Workspace, FDM, Smartview, and Shar...
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
ACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC ApplicationsACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC Applications
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Linaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Linaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
Linaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
Linaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
Linaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
Linaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
Linaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
Linaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
Linaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

New Process/Thread Runtime

  • 1. New Process/Thread Runtime Process in Process Techniques for Practical Address-Space Sharing Atsushi Hori (RIKEN) Dec. 13, 2017
  • 2. Arm HPC Workshop@Akihabara 2017 Background • The rise of many-core architectures • The current parallel execution models are designed for multi-core architectures • Shall we have a new parallel execution model ? 2
  • 3. Arm HPC Workshop@Akihabara 2017 What to be shared and what not to be shared • Isolated address spaces • slow communication • Shared variables • contentions on shared variables 3 Address Space Isolated Shared Variables Privatized Multi-Process (MPI) Shared ?? Multi-Thread (OpenMP)
  • 4. Arm HPC Workshop@Akihabara 2017 What to be shared and what not to be shared • Isolated address spaces • slow communication • Shared variables • contentions on shared variables 4 Address Space Isolated Shared Variables Privatized Multi-Process (MPI) 3rd Exec. Model Shared ?? Multi-Thread (OpenMP)
  • 5. Arm HPC Workshop@Akihabara 2017 Implementation of 3rd Execution Model • MPC (by CEA) • Multi-thread approach • Compiler converts all variables thread local • a.out and b.out cannot run simultaneously • PVAS (by RIKEN) • Multi-process approach • Patched Linux • OS kernel allows processes to share address space • MPC, PVAS, and SMARTMAP are not portable 5
  • 6. Arm HPC Workshop@Akihabara 2017 Why portability matters ? • On the large supercomputers (i.e. the K), modified OS kernel or kernel module is not allowed for users to install • When I tried to port PVAS onto McKernel, core developer denies the modification • DO NOT CONTAMINATE MY CODE !! 6
  • 7. Arm HPC Workshop@Akihabara 2017 PiP is very PORTABLE 7 CPU OS Xeon and Xeon Phi x86_64 Linux x86_64 McKernel the K and FX10 SPARC64 XTCOS ARM (Opteron A1170) Aarch64 Linux 0 0.1 0.2 1 10 100 200 Time[S] # Tasks -- Xeon PiP:preload PiP:thread Fork&Exec Vfork&Exec PosixSpawn Pthread 0 1 2 1 10 100 200# Tasks -- KNL 0 0.1 0.2 1 10 100 200 # Tasks -- Aarch64 0 1 2 1 10 100 200 # Tasks -- K Task Spawning Time
  • 8. Arm HPC Workshop@Akihabara 2017 Portability • PiP can run the machines where • pthread_create() (, or clone() system call) • PIE • dlmopen() are supported • PiP does not run on • BG/Q PIE is not supported • Windows PIE is not fully supported • Mac OSX dlmopen() is not supported • FACT: All machines listed in Top500 (Nov. 2017) use Linux family OS !! 8
  • 9. Arm HPC Workshop@Akihabara 2017 • User-level implementation of 3rd exec. model • Portable and practical Process in Process (PiP) 9 555555554000-555555556000 r-xp ... /PIP/test/basic 555555755000-555555756000 r--p ... /PIP/test/basic 555555756000-555555757000 rw-p ... /PIP/test/basic 555555757000-555555778000 rw-p ... [heap] 7fffe8000000-7fffe8021000 rw-p ... 7fffe8021000-7fffec000000 ---p ... 7ffff0000000-7ffff0021000 rw-p ... 7ffff0021000-7ffff4000000 ---p ... 7ffff4b24000-7ffff4c24000 rw-p ... 7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so 7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so 7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so 7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so 7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic 7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic 7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic 7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic 7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so 7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so 7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so 7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so 7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic 7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic 7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic 7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic ... 7ffff5a52000-7ffff5a56000 rw-p ... ... 7ffff5c6e000-7ffff5c72000 rw-p ... 7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so 7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so 7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so 7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so 7ffff602e000-7ffff6033000 rw-p ... 7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so 7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so 7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so 7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so 7ffff63ef000-7ffff63f4000 rw-p ... 7ffff63f4000-7ffff63f5000 ---p ... 7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641] 7ffff6bf5000-7ffff6bf6000 ---p ... 7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640] 7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so 7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so 7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so 7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so 7ffff77b2000-7ffff77b7000 rw-p ... ... 7ffff79cf000-7ffff79d3000 rw-p ... 7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so 7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so 7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so 7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so 7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so 7ffff7edc000-7ffff7fe0000 rw-p ... 7ffff7ff7000-7ffff7ffa000 rw-p ... 7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso] 7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so 7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so 7ffff7ffe000-7ffff7fff000 rw-p ... 7ffffffde000-7ffffffff000 rw-p ... [stack] ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall] Program Glibc Address Space Task-0 int x; Task-1 int x; : a.out Task-(n-1) int x; Task-(n) int a; : b.out Task-(m-1) int a; :
  • 10. Arm HPC Workshop@Akihabara 2017 Why address space sharing is better ? • Memory mapping techniques in multi-process model • POSIX (SYS-V, mmap, ..) shmem • XPMEM • Same page table is shared by tasks • no page table coherency overhead • saving memory for page tables • pointers can be used as they are 10 Memory mapping must maintain page table coherency -> OVERHEAD (system call, page fault, and page table size) shared region Page Table shared region Page Table Proc-0 Proc-1 coherent Shared Physical Memory Pages
  • 11. Arm HPC Workshop@Akihabara 2017 Memory Mapping vs. PiP 11 for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria concurrency because the alysis is processing the data. ction 7.4, we chose the latter n is exible enough to allow rms forms to cover several OS n our evaluation as listed in platform H/W info. Clock Memory Network .6GHz 64 GiB ConnectX-3 .4GHz 96(+16) GiB Omni-Path .0GHz 16 GiB Tofu n Section 7.1 and 7.3 without using ne with cache quadrant mode. platform S/W info. Table 5. Overhead of XPMEM and POSIX shmem functions (Wallaby/Linux) XPMEM Cycles xpmem_make() 1,585 xpmem_get() 15,294 xpmem_attach() 2,414 xpmem_detach() 19,183 xpmem_release() 693 POSIX Shmem Cycles Sender shm_open() 22,294 ftruncate() 4,080 mmap() 5,553 close() 6,017 Receiver shm_open() 13,522 mmap() 16,232 close() 16,746 6.2 Page Fault Overhead Figure 4 shows the time series of each access using the same microbenchmark program used in the preceding subsection. Element access was strided with 64 bytes so that each cache block was accessed only once, to eliminate the cache block eect. In the XPMEM case, the mmap()ed region was attached by using the XPMEM functions. The upper-left graph in this gure shows the time series using POSIX shmem and XPMEM, and the lower-left graph shows the time series using PiP. Both graphs on the left-hand side show spikes at every 4 KiB. Because of space limitations, we do not show (Xeon/Linux) 10 100 1,000 5,000 AccessTime[Tick] ShmemXPMEM XPMEM PageSize:4KiB PageSize:2MiB 10 100 500 0 4,096 8,192 12,288 16,384 Array Elements [Byte offset] PiP:process PiP:thread 0 4,096 8,192 12,288 16,384 Array Elements [Byte offset] PiP:process PiP:thread (Xeon/Linux) PiP takes less than 100 clocks !!
  • 12. Arm HPC Workshop@Akihabara 2017 Process in Process (PiP) • dlmopen (not a typo of dlopen) • load a program having a new name space • The same variable “foo” can have multiple instances having different addresses • Position Independent Executable (PIE) • PIE programs can be loaded at any location • Combine dlmopen and PIE • load a PIE program with dlmopen • We can privatize variables in the same address space 12
  • 13. Arm HPC Workshop@Akihabara 2017 Glibc Issue • In the current Glibc, dlmopen() can create up to 16 name spaces only • Each PiP task requires one name space to have privatized variables • Many-core architecture can run more than 16 PiP tasks, up to the number of CPU cores • Glibc patch is also provided to have more number of name spaces, in case 16 is not enough • Changing the size of name space stable • Currently 260 PiP tasks can be created • Some workaround codes can be found in PiP library code 13
  • 14. Arm HPC Workshop@Akihabara 2017 PiP Showcases 14
  • 15. Arm HPC Workshop@Akihabara 2017 Showcase 1 : MPI pt2pt • Current Eager/Rndv. 2 Copies • PiP Rndv. 1 Copy 15 (Xeon/Linux) 1 4 16 64 256 1024 4096 16384 65536 Bandwidth(MB/s) Message Size (Bytes) eager-2copy rndv-2copy PiP (rndv-1copy) PiP is 3.5x faster @ 128KB better
  • 16. Arm HPC Workshop@Akihabara 2017 Showcase 2 : MPI DDT • Derived Data Type (DDT) Communication • Non-contiguous data transfer • Current pack - send - unpack (3 copies) • PiP non-contig send (1 copy) 16 0 0.5 1 1.5 2 64K 16, 128 32K, 32, 128 16K, 64, 128 8K, 128, 128 4K, 256, 128 2K, 512, 128 1K, 1K, 128 512, 2K, 128 256, 4K, 128 128, 8K, 128 64, 16K, 128 NormolizedTime Count of double elements in X,Y,Z dimentions eager-2copy (base) rndv-2copy PiP Non-contig Vec Non-contig Vec (Xeon/Linux) better
  • 17. Arm HPC Workshop@Akihabara 2017 Showcase 3 : MPI_Win_allocate_shared (1/2) 17 MPI Implementation int main(int argc, char **argv) { MPI_Init(argc, argv); ... MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, comm, mem, win); ... MPI_Win_shared_query(win, north, sz, dsp_unit, northptr); MPI_Win_shared_query(win, south, sz, dsp_unit, southptr); MPI_Win_shared_query(win, east, sz, dsp_unit, eastptr); MPI_Win_shared_query(win, west, sz, dsp_unit, westptr); ... MPI_Win_lock_all(0, win); for(int iter=0; iterniters; ++iter) { MPI_Win_sync(win); MPI_Barrier(shmcomm); /* stencil computation */ } MPI_Win_unlock_all(win); ... } PiP Implementation int main(int argc, char **argv) { pip_init( pipid, p, NULL, 0 ); ... mem = malloc( size ); ... pip_get_addr( north, mem, northptr ); pip_get_addr( south, mem, southptr ); pip_get_addr( east, mem, eastptr ); pip_get_addr( west, mem, westptr ); ... for(int iter=0; iterniters; ++iter) { pip_barrier( p ); ... /* stencil computation */ } ... pip_fin(); }
  • 18. Arm HPC Workshop@Akihabara 2017 Showcase 3 : MPI_Win_allocate_shared (2/2) 18 1E+2 1E+3 1E+4 1E+5 1E+6 0.1 1 10 100 1 10 100 1,000 TotalPageTableSize[KiB] PTSizePercentagetoArraySize(MPI) # Tasks -- KNL PiP MPI Percentage 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1 10 100 1,000 #TotalPageFaults # Tasks -- KNL PiP MPI 5P Stencil (4K x 4K) Size of Page Tables# Page Faults better
  • 19. Arm HPC Workshop@Akihabara 2017 Showcase 4 : In Situ 19 LAMMPSProcess InsituProcess Pre1allocated SharedBuffer Copy%in Copy%out Gather data chunks Analysis Dump copydata LAMMPSprocess Insituprocess Copy%out Gather data chunks Analysis Dump data Original SHMEM-based In Situ PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,12 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 SlowdownRatio(basedonw/oIn-situ) LAMMPS: 3d Lennard-Jones melt (xx, yy, zz) POSIX shmem PiP LAMMPS in situ: POSIX shmem vs. PiP On Xeon + Linux • LAMMPSprocessranwithfourOpenMP threads; • Insituprocessiswithsinglethread; • O(N2)comp. cost data transfer cost at(12,12,12). (Xeon/Linux) better
  • 20. Arm HPC Workshop@Akihabara 2017 Showcase 5 : SNAP 20 683.3% 379.1% 207.9% 153.0% 106.4% 91.6% 83.3% 430.5% 221.2% 123.0% 68.3% 42.0% 27.7% 22.0% 1.6% 1.7% 1.7% 2.2% 2.5% 3.3% 3.8% 0 0.5 1 1.5 2 2.5 3 3.5 4 0 100 200 300 400 500 600 700 800 16 32 64 128 256 512 1024 Speedup(PiP%vs%Threads) Solve%Time%(s) Number%of%Cores MPICH/Threads MPICH/PiP Speedup(PiPvsThreads) PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode). • ( MPI + OpenMP ) ( MPI + PiP ) better better
  • 21. Arm HPC Workshop@Akihabara 2017 Showcase 5 : Using in Hybrid MPI + “X” as the “X” (2) 21 ! PiP based(parallelism – Easy(application(data(sharing(across(cores – No(multithreading(safety(overhead – Naturally(utilizing(multiple(network(ports(( Network(Ports MPI(stack APP(data 1 4 16 64 256 1024 4096 16384 65536 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M MessageSize(Bytes) KMessages/sbewteenPiPtasks 1(Pair 4(Pairs 16(Pairs 64(Pairs 1 4 16 64 256 1024 4096 16384 65536 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M MessageSize(Bytes) KMessages/sbetweenthreads 1(Pair 4(Pairs 16(Pairs 64(Pairs 683.3 430.5 1.6 0 100 200 300 400 500 600 700 800 16 SolveTime(s) Multipair message rate (osu_mbw_mr ) between two OFP nodes (Xeon Phi + Linux, flat mode). PiP V.S strong sca
  • 22. Arm HPC Workshop@Akihabara 2017 Research Collaboration • ANL (Dr. Pavan and Dr. Min) — DOE-MEXT • MPICH • UT/ICL (Prof. Bosilca) • Open MPI • CEA (Dr. Pérache) — CEA-RIKEN • MPC • UIUC (Prof. Kale) — JLESC • AMPI • Intel (Dr. Dayal) • In Situ 22
  • 23. Arm HPC Workshop@Akihabara 2017 Summary • Process in Process (PIP) • New implementation of the 3rd execution model • better than memory mapping techniques • PiP is portable and practical because of user-level implementation • can run on the K and OFP super computers • Showcases prove PiP can improve performance 23
  • 24. Arm HPC Workshop@Akihabara 2017 Final words • The Glib issues will be reported to Redhat • We are seeking PiP applications not only HPC but also Enterprise 24