Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory (NVM)

278 views

Published on

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Published in: Technology
  • Paid To Write? Earn up to $200/day on with simple writing jobs. ●●● https://tinyurl.com/vvgf8vz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory (NVM)

  1. 1. Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory (NVM) DataWorks Summit 2019 | Washington, DC by Xiaoyi Lu The Ohio State University luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi Dhabaleswar K. (DK) Panda The Ohio State University panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Dipti Shankar The Ohio State University shankard@cse.ohio-state.edu http://www.cse.ohio-state.edu/~shankar.50
  2. 2. DataWorks Summit, 2019 2Network Based Computing Laboratory • Substantial impact on designing and utilizing data management and processing systems in multiple tiers – Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase – Back-end data analytics (Offline) • HDFS, MapReduce, Spark Big Data Management and Processing on Modern Clusters
  3. 3. DataWorks Summit, 2019 3Network Based Computing Laboratory Big Data Processing with Apache Big Data Analytics Stacks • Major components included: – MapReduce (Batch) – Spark (Iterative and Interactive) – HBase (Query) – HDFS (Storage) – RPC (Inter-process communication) • Underlying Hadoop Distributed File System (HDFS) used by MapReduce, Spark, HBase, and many others • Model scales but high amount of communication and I/O can be further optimized! HDFS MapReduce Apache Big Data Analytics Stacks User Applications HBase Hadoop Common (RPC) Spark
  4. 4. DataWorks Summit, 2019 4Network Based Computing Laboratory Drivers of Modern HPC Cluster and Data Center Architecture • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) – Single Root I/O Virtualization (SR-IOV) • NVM and NVMe-SSD • Accelerators (NVIDIA GPGPUs and FPGAs) High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors Cloud CloudSDSC Comet TACC Stampede Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM
  5. 5. DataWorks Summit, 2019 5Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x) • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache Kafka • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 305 organizations from 35 countries • More than 29,750 downloads from the project site (April ‘19) The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Significant performance improvement with ‘RDMA+DRAM’ compared to default Sockets- based designs; How about RDMA+NVRAM?
  6. 6. DataWorks Summit, 2019 6Network Based Computing Laboratory Non-Volatile Memory (NVM) and NVMe-SSD 3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*] • Non-Volatile Memory (NVM) provides byte-addressability with persistence • The huge explosion of data in diverse fields require fast analysis and storage • NVMs provide the opportunity to build high-throughput storage systems for data-intensive applications • Storage technology is moving rapidly towards NVM [*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/
  7. 7. DataWorks Summit, 2019 7Network Based Computing Laboratory • Popular methods employed by recent works to emulate NVRAM performance model over DRAM • Two ways: – Emulate byte-addressable NVRAM over DRAM – Emulate block-based NVM device over DRAM NVRAM Emulation based on DRAM Application Virtual File System Block Device PCMDisk (RAM-Disk + Delay) DRAM mmap/memcpy/msync (DAX) Application Persistent Memory Library Clflush + Delay DRAM pmem_memcpy_persist (DAX) Load/store Load/Store open/read/write/close
  8. 8. DataWorks Summit, 2019 8Network Based Computing Laboratory • NRCIO: NVM-aware RDMA-based Communication and I/O Schemes • NRCIO for Big Data Analytics • NVMe-SSD based Big Data Analytics • Conclusion and Q&A Presentation Outline
  9. 9. DataWorks Summit, 2019 9Network Based Computing Laboratory Design Scope (NVM for RDMA) D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVM N-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAM N-to-N over RDMA: Communication buffers for client and server are allocated in NVM DRAM NVM HDFS-RDMA (RDMADFSClient) HDFS-RDMA (RDMADFSServer) Client CPU Server CPU PCIe NIC PCIe NIC Client Server NVM DRAM HDFS-RDMA (RDMADFSClient) HDFS-RDMA (RDMADFSServer) Client CPU Server CPU PCIePCIe NIC NIC Client Server NVM NVM HDFS-RDMA (RDMADFSClient) HDFS-RDMA (RDMADFSServer) Client CPU Server CPU PCIePCIe NIC NIC Client Server D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common)
  10. 10. DataWorks Summit, 2019 10Network Based Computing Laboratory NVRAM-aware RDMA-based Communication in NRCIO NRCIO RDMA Write over NVRAM NRCIO RDMA Read over NVRAM
  11. 11. DataWorks Summit, 2019 11Network Based Computing Laboratory DRAM-TO-NVRAM RDMA-Aware Communication with NRCIO • Comparison of communication latency using NRCIO RDMA read and write communication protocols over InfiniBand EDR HCA with DRAM as source and NVRAM as destination • {NxDRAM} NVRAM emulation mode = Nx NVRAM write slowdown vs. DRAM with clflushopt (emulated) + sfence • Smaller impact of time-for-persistence on the end-to-end latencies for small messages vs. large messages => larger number of cache lines to flush 0 5 10 15 20 25 256 4K 16K 256 4K 16K 256 4K 16K 1xDRAM 2xDRAM 5xDRAM Latency(us) Data Size (Bytes) NRCIO-RW NRCIO-RR 0 0.5 1 1.5 2 2.5 3 3.5 256K 1M 4M 256K 1M 4M 256K 1M 4M 1xDRAM 2xDRAM 5xDRAM Latency(ms) Data Size (Bytes) NRCIO-RW NRCIO-RR
  12. 12. DataWorks Summit, 2019 12Network Based Computing Laboratory NVRAM-TO-NVRAM RDMA-Aware Communication with NRCIO • Comparison of communication latency using NRCIO RDMA read and write communication protocols over InfiniBand EDR HCA vs. DRAM • {Ax, By} NVRAM emulation mode = Ax NVRAM read slowdown and Bx NVRAM write slowdown vs. NVRAM • High end-to-end latencies due to slower writes to non-volatile persistent memory • E.g., 3.9x for {1x,2x} and 8x for {2x,5x} 0 0.5 1 1.5 2 2.5 3 3.5 256K 1M 4M 256K 1M 4M 256K 1M 4M No Persist (D2D) 1x,2x 2x,5x Latency(ms) Data Size (Bytes) NRCIO-RW NRCIO-RR 0 5 10 15 20 25 64 1K 16K 64 1K 16K 64 1K 16K No Persist (D2D) 1x,2x 2x,5x Latency(us) Data Size (Bytes) NRCIO-RW NRCIO-RR
  13. 13. DataWorks Summit, 2019 13Network Based Computing Laboratory • NRCIO: NVM-aware RDMA-based Communication and I/O Schemes • NRCIO for Big Data Analytics • NVMe-SSD based Big Data Analytics • Conclusion and Q&A Presentation Outline
  14. 14. DataWorks Summit, 2019 14Network Based Computing Laboratory • Files are divided into fixed sized blocks – Blocks divided into packets • NameNode: stores the file system namespace • DataNode: stores data blocks in local storage devices • Uses block replication for fault tolerance – Replication enhances data-locality and read throughput • Communication and I/O intensive • Java Sockets based communication • Data needs to be persistent, typically on SSD/HDD NameNode DataNodes Client Opportunities of Using NVRAM+RDMA in HDFS
  15. 15. DataWorks Summit, 2019 15Network Based Computing Laboratory Design Overview of NVM and RDMA-aware HDFS (NVFS) • Design Features • RDMA over NVM • HDFS I/O with NVM • Block Access • Memory Access • Hybrid design • NVM with SSD as a hybrid storage for HDFS I/O • Co-Design with Spark and HBase • Cost-effectiveness • Use-case Applications and Benchmarks Hadoop MapReduce Spark HBase Co-Design (Cost-Effectiveness, Use-case) RDMA Receiver RDMA Sender DFSClient RDMA Replicator RDMA Receiver NVFS -BlkIO Writer/Reader NVM NVFS- MemIO SSD SSD SSD NVM and RDMA-aware HDFS (NVFS) DataNode N. S. Islam, M. W. Rahman , X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, 24th International Conference on Supercomputing (ICS), June 2016
  16. 16. DataWorks Summit, 2019 16Network Based Computing Laboratory Evaluation with Hadoop MapReduce 0 50 100 150 200 250 300 350 Write Read AverageThroughput(MBps) HDFS (56Gbps) NVFS-BlkIO (56Gbps) NVFS-MemIO (56Gbps) • TestDFSIO on SDSC Comet (32 nodes) – Write: NVFS-MemIO gains by 4x over HDFS – Read: NVFS-MemIO gains by 1.2x over HDFS TestDFSIO 0 200 400 600 800 1000 1200 1400 Write Read AverageThroughput(MBps) HDFS (56Gbps) NVFS-BlkIO (56Gbps) NVFS-MemIO (56Gbps) 4x 1.2x 4x 2x SDSC Comet (32 nodes: 80 GB, SATA-SSDs) OSU Nowlab (4 nodes: 8 GB, NVMe-SSDs) • TestDFSIO on OSU Nowlab (4 nodes) – Write: NVFS-MemIO gains by 4x over HDFS – Read: NVFS-MemIO gains by 2x over HDFS
  17. 17. DataWorks Summit, 2019 17Network Based Computing Laboratory Evaluation with HBase 0 100 200 300 400 500 600 700 800 8:800K 16:1600K 32:3200K Throughput(ops/s) Cluster Size : No. of Records HDFS (56Gbps) NVFS (56Gbps) HBase 100% insert 0 200 400 600 800 1000 1200 8:800K 16:1600K 32:3200K Throughput(ops/s) Cluster Size : Number of Records HBase 50% read, 50% update • YCSB 100% Insert on SDSC Comet (32 nodes) – NVFS-BlkIO gains by 21% by storing only WALs to NVM • YCSB 50% Read, 50% Update on SDSC Comet (32 nodes) – NVFS-BlkIO gains by 20% by storing only WALs to NVM 20%21%
  18. 18. DataWorks Summit, 2019 18Network Based Computing Laboratory Opportunities to Use NVRAM+RDMA in MapReduce Disk Operations • Map and Reduce Tasks carry out the total job execution – Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent) – Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent) • Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take place in SSD/HDD typically Bulk Data Transfer
  19. 19. DataWorks Summit, 2019 19Network Based Computing Laboratory Opportunities to Use NVRAM in MapReduce-RDMA DesignInputFiles OutputFiles IntermediateData Map Task Read Map Spill Merge Map Task Read Map Spill Merge Reduce Task Shuffle Reduce In- Mem Merge Reduce Task Shuffle Reduce In- Mem Merge RDMA All Operations are In- Memory Opportunities exist to improve the performance with NVRAM
  20. 20. DataWorks Summit, 2019 20Network Based Computing Laboratory NVRAM-Assisted Map Spilling in MapReduce-RDMA InputFiles OutputFiles IntermediateData Map Task Read Map Spill Merge Map Task Read Map Spill Merge Reduce Task Shuffle Reduce In- Mem Merge Reduce Task Shuffle Reduce In- Mem Merge RDMA NVRAM  Minimizes the disk operations in Spill phase M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016. M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on HPC Systems? IEEE BigData 2017.
  21. 21. DataWorks Summit, 2019 21Network Based Computing Laboratory Comparison with Sort and TeraSort • RMR-NVM achieves 2.37x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 55% compared to MR-IPoIB, 28% compared to RMR 2.37x 55% 2.48x 51% • RMR-NVM achieves 2.48x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 51% compared to MR-IPoIB, 31% compared to RMR
  22. 22. DataWorks Summit, 2019 22Network Based Computing Laboratory Evaluation of Intel HiBench Workloads • We evaluate different HiBench workloads with Huge data sets on 8 nodes • Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB: – Sort: 42% (25 GB) – TeraSort: 39% (32 GB) – PageRank: 21% (5 million pages) • Other workloads: – WordCount: 18% (25 GB) – KMeans: 11% (100 million samples)
  23. 23. DataWorks Summit, 2019 23Network Based Computing Laboratory Evaluation of PUMA Workloads • We evaluate different PUMA workloads on 8 nodes with 30GB data size • Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB : – AdjList: 39% – SelfJoin: 58% – RankedInvIndex: 39% • Other workloads: – SeqCount: 32% – InvIndex: 18%
  24. 24. DataWorks Summit, 2019 24Network Based Computing Laboratory • NRCIO: NVM-aware RDMA-based Communication and I/O Schemes • NRCIO for Big Data Analytics • NVMe-SSD based Big Data Analytics • Conclusion and Q&A Presentation Outline
  25. 25. DataWorks Summit, 2019 25Network Based Computing Laboratory Overview of NVMe Standard • NVMe is the standardized interface for PCIe SSDs • Built on ‘RDMA’ principles – Submission and completion I/O queues – Similar semantics as RDMA send/recv queues – Asynchronous command processing • Up to 64K I/O queues, with up to 64K commands per queue • Efficient small random I/O operation • MSI/MSI-X and interrupt aggregation NVMe Command Processing Source: NVMExpress.org
  26. 26. DataWorks Summit, 2019 26Network Based Computing Laboratory Overview of NVMe-over-Fabric • Remote access to flash with NVMe over the network • RDMA fabric is of most importance – Low latency makes remote access feasible – 1 to 1 mapping of NVMe I/O queues to RDMA send/recv queues NVMf Architecture I/O Submission Queue I/O Completion Queue RDMA Fabric SQ RQ NVMe Low latency overhead compared to local I/O
  27. 27. DataWorks Summit, 2019 27Network Based Computing Laboratory Design Challenges with NVMe-SSD • QoS – Hardware-assisted QoS • Persistence – Flushing buffered data • Performance – Consider flash related design aspects – Read/Write performance skew – Garbage collection • Virtualization – SR-IOV hardware support – Namespace isolation • New software systems – Disaggregated Storage with NVMf – Persistent Caches Co-design
  28. 28. DataWorks Summit, 2019 28Network Based Computing Laboratory Evaluation with RocksDB 0 5 10 15 Insert Overwrite Random Read Latency (us) POSIX SPDK 0 100 200 300 400 500 Write Sync Read Write Latency (us) POSIX SPDK • 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write • Overwrite: Compaction and flushing in background – Low potential for improvement • Read: Performance much worse; Additional tuning/optimization required
  29. 29. DataWorks Summit, 2019 29Network Based Computing Laboratory Evaluation with RocksDB 0 5000 10000 15000 20000 Write Sync Read Write Throughput (ops/sec) POSIX SPDK 0 100000 200000 300000 400000 500000 600000 Insert Overwrite Random Read Throughput (ops/sec) POSIX SPDK • 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write • Overwrite: Compaction and flushing in background – Low potential for improvement • Read: Performance much worse; Additional tuning/optimization required
  30. 30. DataWorks Summit, 2019 30Network Based Computing Laboratory QoS-aware SPDK Design 0 50 100 150 1 5 9 13 17 21 25 29 33 37 41 45 49 Bandwidth(MB/s) Time Scenario 1 High Priority Job (WRR) Medium Priority Job (WRR) High Priority Job (OSU-Design) Medium Priority Job (OSU-Design) 0 1 2 3 4 5 2 3 4 5 JobBandwidthRatio Scenario Synthetic Application Scenarios SPDK-WRR OSU-Design Desired • Synthetic application scenarios with different QoS requirements – Comparison using SPDK with Weighted Round Robbin NVMe arbitration • Near desired job bandwidth ratios • Stable and consistent bandwidth S. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and Provisioning QoS for NVMe SSDs, 11th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), Dec 2018
  31. 31. DataWorks Summit, 2019 31Network Based Computing Laboratory Conclusion and Future Work • Big Data Analytics needs high-performance NVM-aware RDMA-based Communication and I/O Schemes • Proposed a new library, NRCIO (work-in-progress) • Re-design HDFS storage architecture with NVRAM • Re-design RDMA-MapReduce with NVRAM • Design Big Data analytics stacks with NVMe and NVMf protocols • Results are promising • Further optimizations in NRCIO • Co-design with more Big Data analytics frameworks • TensorFlow, Object Storage, Database, etc.
  32. 32. DataWorks Summit, 2019 32Network Based Computing Laboratory Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi shankard@cse.ohio-state.edu http://www.cse.ohio-state.edu/~shankar.50

×