Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New Process/Thread Runtime

1,106 views

Published on

Atsushi Hori
RIKEN
New portable and practical parallel execution model, Process in Process (PiP in short) will be presented. PiP tasks share the same virtual address space like the multi-thread model and privatized variables like the multi-process model. Because of this, PiP provides the best of two worlds, multi-process (MPI) and multi-thread (OpenMP). Researcher, System Software Development Team, RIKEN

Published in: Technology
  • Be the first to comment

  • Be the first to like this

New Process/Thread Runtime

  1. 1. New Process/Thread Runtime Process in Process Techniques for Practical Address-Space Sharing Atsushi Hori (RIKEN) Dec. 13, 2017
  2. 2. Arm HPC Workshop@Akihabara 2017 Background • The rise of many-core architectures • The current parallel execution models are designed for multi-core architectures • Shall we have a new parallel execution model ? 2
  3. 3. Arm HPC Workshop@Akihabara 2017 What to be shared and what not to be shared • Isolated address spaces • slow communication • Shared variables • contentions on shared variables 3 Address Space Isolated Shared Variables Privatized Multi-Process (MPI) Shared ?? Multi-Thread (OpenMP)
  4. 4. Arm HPC Workshop@Akihabara 2017 What to be shared and what not to be shared • Isolated address spaces • slow communication • Shared variables • contentions on shared variables 4 Address Space Isolated Shared Variables Privatized Multi-Process (MPI) 3rd Exec. Model Shared ?? Multi-Thread (OpenMP)
  5. 5. Arm HPC Workshop@Akihabara 2017 Implementation of 3rd Execution Model • MPC (by CEA) • Multi-thread approach • Compiler converts all variables thread local • a.out and b.out cannot run simultaneously • PVAS (by RIKEN) • Multi-process approach • Patched Linux • OS kernel allows processes to share address space • MPC, PVAS, and SMARTMAP are not portable 5
  6. 6. Arm HPC Workshop@Akihabara 2017 Why portability matters ? • On the large supercomputers (i.e. the K), modified OS kernel or kernel module is not allowed for users to install • When I tried to port PVAS onto McKernel, core developer denies the modification • DO NOT CONTAMINATE MY CODE !! 6
  7. 7. Arm HPC Workshop@Akihabara 2017 PiP is very PORTABLE 7 CPU OS Xeon and Xeon Phi x86_64 Linux x86_64 McKernel the K and FX10 SPARC64 XTCOS ARM (Opteron A1170) Aarch64 Linux 0 0.1 0.2 1 10 100 200 Time[S] # Tasks -- Xeon PiP:preload PiP:thread Fork&Exec Vfork&Exec PosixSpawn Pthread 0 1 2 1 10 100 200# Tasks -- KNL 0 0.1 0.2 1 10 100 200 # Tasks -- Aarch64 0 1 2 1 10 100 200 # Tasks -- K Task Spawning Time
  8. 8. Arm HPC Workshop@Akihabara 2017 Portability • PiP can run the machines where • pthread_create() (, or clone() system call) • PIE • dlmopen() are supported • PiP does not run on • BG/Q PIE is not supported • Windows PIE is not fully supported • Mac OSX dlmopen() is not supported • FACT: All machines listed in Top500 (Nov. 2017) use Linux family OS !! 8
  9. 9. Arm HPC Workshop@Akihabara 2017 • User-level implementation of 3rd exec. model • Portable and practical Process in Process (PiP) 9 555555554000-555555556000 r-xp ... /PIP/test/basic 555555755000-555555756000 r--p ... /PIP/test/basic 555555756000-555555757000 rw-p ... /PIP/test/basic 555555757000-555555778000 rw-p ... [heap] 7fffe8000000-7fffe8021000 rw-p ... 7fffe8021000-7fffec000000 ---p ... 7ffff0000000-7ffff0021000 rw-p ... 7ffff0021000-7ffff4000000 ---p ... 7ffff4b24000-7ffff4c24000 rw-p ... 7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so 7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so 7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so 7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so 7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic 7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic 7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic 7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic 7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so 7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so 7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so 7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so 7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic 7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic 7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic 7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic ... 7ffff5a52000-7ffff5a56000 rw-p ... ... 7ffff5c6e000-7ffff5c72000 rw-p ... 7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so 7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so 7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so 7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so 7ffff602e000-7ffff6033000 rw-p ... 7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so 7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so 7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so 7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so 7ffff63ef000-7ffff63f4000 rw-p ... 7ffff63f4000-7ffff63f5000 ---p ... 7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641] 7ffff6bf5000-7ffff6bf6000 ---p ... 7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640] 7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so 7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so 7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so 7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so 7ffff77b2000-7ffff77b7000 rw-p ... ... 7ffff79cf000-7ffff79d3000 rw-p ... 7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so 7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so 7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so 7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so 7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so 7ffff7edc000-7ffff7fe0000 rw-p ... 7ffff7ff7000-7ffff7ffa000 rw-p ... 7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso] 7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so 7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so 7ffff7ffe000-7ffff7fff000 rw-p ... 7ffffffde000-7ffffffff000 rw-p ... [stack] ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall] Program Glibc Address Space Task-0 int x; Task-1 int x; : a.out Task-(n-1) int x; Task-(n) int a; : b.out Task-(m-1) int a; :
  10. 10. Arm HPC Workshop@Akihabara 2017 Why address space sharing is better ? • Memory mapping techniques in multi-process model • POSIX (SYS-V, mmap, ..) shmem • XPMEM • Same page table is shared by tasks • no page table coherency overhead • saving memory for page tables • pointers can be used as they are 10 Memory mapping must maintain page table coherency -> OVERHEAD (system call, page fault, and page table size) shared region Page Table shared region Page Table Proc-0 Proc-1 coherent Shared Physical Memory Pages
  11. 11. Arm HPC Workshop@Akihabara 2017 Memory Mapping vs. PiP 11 for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria concurrency because the alysis is processing the data. ction 7.4, we chose the latter n is exible enough to allow rms forms to cover several OS n our evaluation as listed in platform H/W info. Clock Memory Network .6GHz 64 GiB ConnectX-3 .4GHz 96(+16) GiB Omni-Path .0GHz 16 GiB Tofu n Section 7.1 and 7.3 without using ne with cache quadrant mode. platform S/W info. Table 5. Overhead of XPMEM and POSIX shmem functions (Wallaby/Linux) XPMEM Cycles xpmem_make() 1,585 xpmem_get() 15,294 xpmem_attach() 2,414 xpmem_detach() 19,183 xpmem_release() 693 POSIX Shmem Cycles Sender shm_open() 22,294 ftruncate() 4,080 mmap() 5,553 close() 6,017 Receiver shm_open() 13,522 mmap() 16,232 close() 16,746 6.2 Page Fault Overhead Figure 4 shows the time series of each access using the same microbenchmark program used in the preceding subsection. Element access was strided with 64 bytes so that each cache block was accessed only once, to eliminate the cache block eect. In the XPMEM case, the mmap()ed region was attached by using the XPMEM functions. The upper-left graph in this gure shows the time series using POSIX shmem and XPMEM, and the lower-left graph shows the time series using PiP. Both graphs on the left-hand side show spikes at every 4 KiB. Because of space limitations, we do not show (Xeon/Linux) 10 100 1,000 5,000 AccessTime[Tick] ShmemXPMEM XPMEM PageSize:4KiB PageSize:2MiB 10 100 500 0 4,096 8,192 12,288 16,384 Array Elements [Byte offset] PiP:process PiP:thread 0 4,096 8,192 12,288 16,384 Array Elements [Byte offset] PiP:process PiP:thread (Xeon/Linux) PiP takes less than 100 clocks !!
  12. 12. Arm HPC Workshop@Akihabara 2017 Process in Process (PiP) • dlmopen (not a typo of dlopen) • load a program having a new name space • The same variable “foo” can have multiple instances having different addresses • Position Independent Executable (PIE) • PIE programs can be loaded at any location • Combine dlmopen and PIE • load a PIE program with dlmopen • We can privatize variables in the same address space 12
  13. 13. Arm HPC Workshop@Akihabara 2017 Glibc Issue • In the current Glibc, dlmopen() can create up to 16 name spaces only • Each PiP task requires one name space to have privatized variables • Many-core architecture can run more than 16 PiP tasks, up to the number of CPU cores • Glibc patch is also provided to have more number of name spaces, in case 16 is not enough • Changing the size of name space stable • Currently 260 PiP tasks can be created • Some workaround codes can be found in PiP library code 13
  14. 14. Arm HPC Workshop@Akihabara 2017 PiP Showcases 14
  15. 15. Arm HPC Workshop@Akihabara 2017 Showcase 1 : MPI pt2pt • Current Eager/Rndv. 2 Copies • PiP Rndv. 1 Copy 15 (Xeon/Linux) 1 4 16 64 256 1024 4096 16384 65536 Bandwidth(MB/s) Message Size (Bytes) eager-2copy rndv-2copy PiP (rndv-1copy) PiP is 3.5x faster @ 128KB better
  16. 16. Arm HPC Workshop@Akihabara 2017 Showcase 2 : MPI DDT • Derived Data Type (DDT) Communication • Non-contiguous data transfer • Current pack - send - unpack (3 copies) • PiP non-contig send (1 copy) 16 0 0.5 1 1.5 2 64K 16, 128 32K, 32, 128 16K, 64, 128 8K, 128, 128 4K, 256, 128 2K, 512, 128 1K, 1K, 128 512, 2K, 128 256, 4K, 128 128, 8K, 128 64, 16K, 128 NormolizedTime Count of double elements in X,Y,Z dimentions eager-2copy (base) rndv-2copy PiP Non-contig Vec Non-contig Vec (Xeon/Linux) better
  17. 17. Arm HPC Workshop@Akihabara 2017 Showcase 3 : MPI_Win_allocate_shared (1/2) 17 MPI Implementation int main(int argc, char **argv) { MPI_Init(argc, argv); ... MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, comm, mem, win); ... MPI_Win_shared_query(win, north, sz, dsp_unit, northptr); MPI_Win_shared_query(win, south, sz, dsp_unit, southptr); MPI_Win_shared_query(win, east, sz, dsp_unit, eastptr); MPI_Win_shared_query(win, west, sz, dsp_unit, westptr); ... MPI_Win_lock_all(0, win); for(int iter=0; iterniters; ++iter) { MPI_Win_sync(win); MPI_Barrier(shmcomm); /* stencil computation */ } MPI_Win_unlock_all(win); ... } PiP Implementation int main(int argc, char **argv) { pip_init( pipid, p, NULL, 0 ); ... mem = malloc( size ); ... pip_get_addr( north, mem, northptr ); pip_get_addr( south, mem, southptr ); pip_get_addr( east, mem, eastptr ); pip_get_addr( west, mem, westptr ); ... for(int iter=0; iterniters; ++iter) { pip_barrier( p ); ... /* stencil computation */ } ... pip_fin(); }
  18. 18. Arm HPC Workshop@Akihabara 2017 Showcase 3 : MPI_Win_allocate_shared (2/2) 18 1E+2 1E+3 1E+4 1E+5 1E+6 0.1 1 10 100 1 10 100 1,000 TotalPageTableSize[KiB] PTSizePercentagetoArraySize(MPI) # Tasks -- KNL PiP MPI Percentage 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1 10 100 1,000 #TotalPageFaults # Tasks -- KNL PiP MPI 5P Stencil (4K x 4K) Size of Page Tables# Page Faults better
  19. 19. Arm HPC Workshop@Akihabara 2017 Showcase 4 : In Situ 19 LAMMPSProcess InsituProcess Pre1allocated SharedBuffer Copy%in Copy%out Gather data chunks Analysis Dump copydata LAMMPSprocess Insituprocess Copy%out Gather data chunks Analysis Dump data Original SHMEM-based In Situ PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,12 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 SlowdownRatio(basedonw/oIn-situ) LAMMPS: 3d Lennard-Jones melt (xx, yy, zz) POSIX shmem PiP LAMMPS in situ: POSIX shmem vs. PiP On Xeon + Linux • LAMMPSprocessranwithfourOpenMP threads; • Insituprocessiswithsinglethread; • O(N2)comp. cost data transfer cost at(12,12,12). (Xeon/Linux) better
  20. 20. Arm HPC Workshop@Akihabara 2017 Showcase 5 : SNAP 20 683.3% 379.1% 207.9% 153.0% 106.4% 91.6% 83.3% 430.5% 221.2% 123.0% 68.3% 42.0% 27.7% 22.0% 1.6% 1.7% 1.7% 2.2% 2.5% 3.3% 3.8% 0 0.5 1 1.5 2 2.5 3 3.5 4 0 100 200 300 400 500 600 700 800 16 32 64 128 256 512 1024 Speedup(PiP%vs%Threads) Solve%Time%(s) Number%of%Cores MPICH/Threads MPICH/PiP Speedup(PiPvsThreads) PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode). • ( MPI + OpenMP ) ( MPI + PiP ) better better
  21. 21. Arm HPC Workshop@Akihabara 2017 Showcase 5 : Using in Hybrid MPI + “X” as the “X” (2) 21 ! PiP based(parallelism – Easy(application(data(sharing(across(cores – No(multithreading(safety(overhead – Naturally(utilizing(multiple(network(ports(( Network(Ports MPI(stack APP(data 1 4 16 64 256 1024 4096 16384 65536 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M MessageSize(Bytes) KMessages/sbewteenPiPtasks 1(Pair 4(Pairs 16(Pairs 64(Pairs 1 4 16 64 256 1024 4096 16384 65536 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M MessageSize(Bytes) KMessages/sbetweenthreads 1(Pair 4(Pairs 16(Pairs 64(Pairs 683.3 430.5 1.6 0 100 200 300 400 500 600 700 800 16 SolveTime(s) Multipair message rate (osu_mbw_mr ) between two OFP nodes (Xeon Phi + Linux, flat mode). PiP V.S strong sca
  22. 22. Arm HPC Workshop@Akihabara 2017 Research Collaboration • ANL (Dr. Pavan and Dr. Min) — DOE-MEXT • MPICH • UT/ICL (Prof. Bosilca) • Open MPI • CEA (Dr. Pérache) — CEA-RIKEN • MPC • UIUC (Prof. Kale) — JLESC • AMPI • Intel (Dr. Dayal) • In Situ 22
  23. 23. Arm HPC Workshop@Akihabara 2017 Summary • Process in Process (PIP) • New implementation of the 3rd execution model • better than memory mapping techniques • PiP is portable and practical because of user-level implementation • can run on the K and OFP super computers • Showcases prove PiP can improve performance 23
  24. 24. Arm HPC Workshop@Akihabara 2017 Final words • The Glib issues will be reported to Redhat • We are seeking PiP applications not only HPC but also Enterprise 24

×