Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

4th Systems Paper Survey Seminar

167 views

Published on

SC '17 Report and introducing PapyrusKV

Published in: Software
  • Be the first to comment

4th Systems Paper Survey Seminar

  1. 1. Attending report of SC ’17 Ryo Matsumiya
  2. 2. Self introduction • Ryo Matsumiya • Twitter: @mattn_ • https://sites.google.com/site/ryomatsumiya0101/ • Ph.D. student (D2) • Oyama lab. (UEC, B4-M2) • Endo lab. (Titech, D1-) • Major topic: Distributed and parallel processing and its software architecture considering memory (storage) hierarchy • Memory Hierarchy, Memory-centric Computing, Data-intensive Computing, Big Data, Task Parallelism, Programming System, System Software, GPGPU, Storage System
  3. 3. About SC (1/3) • ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis • DO NOT confuse similar conferences! • International Conference on Supercomputing (ICS) • International Supercomputing Conference (ISC) • Top conf. in the field of HPC • About 13,000 attendees • Including 3,500 international (non-US) attendees in SC ’17
  4. 4. About SC (2/3) • Technical session • Doctoral forum • Poster session • Tutorial session • Panel session • Invited talks + Keynote talks • Workshops • 38 official workshops • BoF session • TOP 500 is announced • Exhibition • 250+ organizations
  5. 5. About SC (3/3) • SC ’17 was held in Colorado Convention Center, Denver • SC ’15: Austin, SC ’16: Salt Lake City • SC ’18: Dallas, SC ’19: Denver • Acceptance Rate: 61/327 = 19 % • Best paper: Extreme Scale Multi-Physics Simulations of the Tsunamigenic 2004 Sumatra Megathrust Earthquake • Technical University of Munich + Ludwig-Maximilians-Universität München • Best poster: AI with Super-Computed Data for Monte Carlo Earthquake Hazard Classification • RIKEN + UT • Gordon Bell Prize: 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios
  6. 6. PapyrusKV: A High-Performance Parallel Key- Value Store for Distributed NVM Architectures • Distributed KVS developed by ORNL • No system-level daemons and servers • C++ library using Papyrus • Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage (IPDPS ’17) • Open source • https://code.ornl.gov/eck/papyrus • Considering memory hierarchy • Private SSDs + Private DRAMs • TSUBAME (Titech), Stampede (TACC) • Shared SSDs (burst buffers) + Private DRAMs • Oakforest-PACS (JCAHPC), Cori (LBNL)
  7. 7. API functions
  8. 8. Put operation overview
  9. 9. Structure overview • Each process has four Memtables and a SSTable • Memtable • Used as caches • Local memtable, Remote memtable, Local immutable memtable, Remote immutable memtable • Stored in DRAM • SSTable • Sorted String Table • Stored in NVRAM
  10. 10. Data placement • DBs are divided into files • Each process has its own file • In local SSD architectures, the file is stored in a SSD of its process • In shared SSD architectures, all files are stored in the Burst Buffer(s) • Each KV-pair is assigned to a process • The process is decided by (hash(key) % # of processes)
  11. 11. Local cache policy • LRU+FIFO • The cache is pushed to LRU-queue firstly • Mutable-memtable(s) • The FIFO-queue is pushed an element when it is evicted from the LRU- queue • Immutable-memtable(s) • Evicted elements from the FIFO-queue are written-back to SSDs Key Value ... ... Key Value ... ... Key Value ... ... LRU FIFO Mutable memtable Immutable memtable SSTable DRAM SSD
  12. 12. Data structure of tables • LSM-Tree • Used by HBase, LevelDB, etc • In PapyrusKV, trees of MemTables are red-black trees • The trees in the SSDs are binary trees O‘Neil et al, The log-structured merge-tree (LSM-tree), Acta Infomatica Vol.33 pp.351-385
  13. 13. Remote cache policy • Can be changed with papyruskv_consistency() • Two consistency mode • Sequential consistency • Relaxed consistency • papyruskv_protect() under relaxed consisntency can make remote caches available • With PAPYRUSKV_RDONLY, remote read caches are available • With PAPYRUSKV_WRONLY, asynchronously writing back is available • Consistency can be guaranteed by calling papyruskv_barrier()
  14. 14. Storage group (1/2) • Extra memory copying is caused when a process gets a KV- pair of another process in the same node DRAM DRAM KV-pair of Proc. A Process A Process B
  15. 15. Storage group (2/2) • Solution: directly copying if under relaxed consistency DRAM DRAM KV-pair of Proc. A Process A Process B
  16. 16. Checkpoint/Restart
  17. 17. Performance evaluation • Single node performance compared with Lustre • Put, Get, Barrier • Multiple nodes performance • Relaxed (+ barrier) • Sequential (+ barrier) • Combining reads/updates • Checkpoint/Restart performance • Comparison with MDHIM • Real HPC application
  18. 18. Evaluation setup
  19. 19. Barrier operation
  20. 20. Get operation
  21. 21. Multiple nodes put/get performance
  22. 22. Combining read/update
  23. 23. Checkpoint/Restart performance
  24. 24. Comparing with MDHIM
  25. 25. Real HPC application: De-novo genome assembly Evangelos Georganas, Scalable Parallel Algorithms for Genome Analysis, Ph.D. Thesis, UC Berkeley
  26. 26. Application benchmarking • Comparing with Unified Parallel C (UPC) implementation • Not use SSDs • Dataset is human chr14 dataset • Executed on Cori
  27. 27. Summary • PapyrusKV is a KVS for HPC Clusters • C++ library based • PapyrusKV supports both private and shared SSD architectures • SSDs are used as persistent memory • DRAMs are used as caches • LSM-Tree based cache mechanism • Users can specify consistent policies
  28. 28. Other affective papers in SC ’17 • Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1 • 28 authors! (including three Japanese) • Observing overheads from MPI standard • Gravel: Fine-Grain GPU-Initiated Network Messages • UW-Madison + AMD Research • Network interface for GPU kernel • Related: GPUnet [OSDI ’14], GPUrdma [ROSS ’16] • Reducing GPU overheads • Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments • Barcelona Supercomputing Center + IBM
  29. 29. Call for Jobs • Hire me! • Interested in large parallel and/or distributed software • System software as well as applications • Not only research, developing and business are also welcome • I have the best record of (LOC×# of nodes in parallel÷# of developers) of the active Japanese system-software students...maybe :-D

×