High-Performance Big Data Analytics with RDMA
over NVM and NVMe-SSD
Talk at OFA Workshop 2018
by
Xiaoyi Lu
The Ohio State University
E-mail: luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
OFA Workshop ‘18 2Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Big Data Management and Processing on Modern Clusters
OFA Workshop ‘18 3Network Based Computing Laboratory
Big Data Processing with Apache Big Data Analytics Stacks
• Major components included:
– MapReduce (Batch)
– Spark (Iterative and Interactive)
– HBase (Query)
– HDFS (Storage)
– RPC (Inter-process communication)
• Underlying Hadoop Distributed File
System (HDFS) used by MapReduce,
Spark, HBase, and many others
• Model scales but high amount of
communication and I/O can be further
optimized!
HDFS
MapReduce
Apache Big Data Analytics Stacks
User Applications
HBase
Hadoop Common (RPC)
Spark
OFA Workshop ‘18 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and FPGAs)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
OFA Workshop ‘18 5Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 280 organizations from 34 countries
• More than 25,750 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Available for x86 and OpenPOWER
Significant performance improvement
with ‘RDMA+DRAM’ compared to
default Sockets-based designs;
How about RDMA+NVRAM?
OFA Workshop ‘18 6Network Based Computing Laboratory
Non-Volatile Memory (NVM) and NVMe-SSD
3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*]
• Non-Volatile Memory (NVM) provides byte-addressability with persistence
• The huge explosion of data in diverse fields require fast analysis and storage
• NVMs provide the opportunity to build high-throughput storage systems for data-intensive
applications
• Storage technology is moving rapidly towards NVM
[*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/
OFA Workshop ‘18 7Network Based Computing Laboratory
• Popular methods employed by recent works to emulate NVRAM performance
model over DRAM
• Two ways:
– Emulate byte-addressable NVRAM over DRAM
– Emulate block-based NVM device over DRAM
NVRAM Emulation based on DRAM
Application
Virtual File System
Block Device PCMDisk
(RAM-Disk + Delay)
DRAM
mmap/memcpy/msync (DAX)
Application
Persistent Memory Library
Clflush + Delay
DRAM
pmem_memcpy_persist
Load/store
Load/Store
open/read/write/close
OFA Workshop ‘18 8Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline
OFA Workshop ‘18 9Network Based Computing Laboratory
Design Scope (NVM for RDMA)
D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA
D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVM
N-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAM
N-to-N over RDMA: Communication buffers for client and server are allocated in NVM
Client Server Client Server Client Server
D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common)
OFA Workshop ‘18 10Network Based Computing Laboratory
NVRAM-aware Communication in NRCIO
NRCIO Send/Recv over NVRAM NRCIO RDMA_Read over NVRAM
OFA Workshop ‘18 11Network Based Computing Laboratory
NRCIO Send/Recv Emulation over PCM
• Comparison of communication latency using NRCIO send/receive semantics over InfiniBand QDR
network and PCM memory
• High communication latencies due to slower writes to non-volatile persistent memory
• NVRAM-to-Remote-NVRAM (NVRAM-NVRAM) => ~10x overhead vs. DRAM-DRAM
• DRAM-to-Remote-NVRAM (DRAM-NVRAM) => ~8x overhead vs. DRAM-DRAM
• NVRAM-to-Remote-DRAM (NVRAM-DRAM) => ~4x overhead vs. DRAM-DRAM
0
20000
40000
60000
80000
100000
120000
140000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Latency(us)
IB Send/Recv Emulation over PCM
DRAM-DRAM NVRAM-DRAM
DRAM-NVRAM NVRAM-NVRAM
DRAM-SSD-Msync_Per_Line DRAM-SSD-Msync_Per_8KPage
DRAM-SSD-Msync_Per_4KPage
OFA Workshop ‘18 12Network Based Computing Laboratory
NRCIO RDMA-Read Emulation over PCM
0
20000
40000
60000
80000
100000
120000
140000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Latency(us)
Message Size (bytes)
IB RDMA_READ Emulation over PCM
DRAM-DRAM NVRAM-DRAM
DRAM-NVRAM NVRAM-NVRAM
DRAM-SSD-Msync_Per_Line DRAM-SSD-Msync_Per_8KPage
DRAM-SSD-Msync_Per_4KPage
• Communication latency with NRCIO RDMA-Read over InfiniBand QDR + PCM memory
• Communication overheads for large messages due to slower writes into NVRAM from
remote memory; similar to Send/Receive
• RDMA-Read outperforms Send/Receive for large messages; as observed for DRAM-DRAM
OFA Workshop ‘18 13Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline
OFA Workshop ‘18 14Network Based Computing Laboratory
• Files are divided into fixed sized blocks
– Blocks divided into packets
• NameNode: stores the file system namespace
• DataNode: stores data blocks in local storage
devices
• Uses block replication for fault tolerance
– Replication enhances data-locality and read
throughput
• Communication and I/O intensive
• Java Sockets based communication
• Data needs to be persistent, typically on
SSD/HDD
NameNode
DataNodes
Client
Opportunities of Using NVRAM+RDMA in HDFS
OFA Workshop ‘18 15Network Based Computing Laboratory
Design Overview of NVM and RDMA-aware HDFS (NVFS)
• Design Features
• RDMA over NVM
• HDFS I/O with NVM
• Block Access
• Memory Access
• Hybrid design
• NVM with SSD as a hybrid
storage for HDFS I/O
• Co-Design with Spark and HBase
• Cost-effectiveness
• Use-case
Applications and Benchmarks
Hadoop MapReduce Spark HBase
Co-Design
(Cost-Effectiveness, Use-case)
RDMA
Receiver
RDMA
Sender
DFSClient
RDMA
Replicator
RDMA
Receiver
NVFS
-BlkIO
Writer/Reader
NVM
NVFS-
MemIO
SSD SSD SSD
NVM and RDMA-aware HDFS (NVFS)
DataNode
N. S. Islam, M. W. Rahman , X. Lu, and D. K.
Panda, High Performance Design for HDFS with
Byte-Addressability of NVM and RDMA, 24th
International Conference on Supercomputing
(ICS), June 2016
OFA Workshop ‘18 16Network Based Computing Laboratory
Evaluation with Hadoop MapReduce
0
50
100
150
200
250
300
350
Write Read
AverageThroughput(MBps)
HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
• TestDFSIO on SDSC Comet (32 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 1.2x over
HDFS
TestDFSIO
0
200
400
600
800
1000
1200
1400
Write Read
AverageThroughput(MBps)
HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
4x
1.2x
4x
2x
SDSC Comet (32 nodes) OSU Nowlab (4 nodes)
• TestDFSIO on OSU Nowlab (4 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 2x over
HDFS
OFA Workshop ‘18 17Network Based Computing Laboratory
Evaluation with HBase
0
100
200
300
400
500
600
700
800
8:800K 16:1600K 32:3200K
Throughput(ops/s)
Cluster Size : No. of Records
HDFS (56Gbps) NVFS (56Gbps)
HBase 100% insert
0
200
400
600
800
1000
1200
8:800K 16:1600K 32:3200K
Throughput(ops/s)
Cluster Size : Number of Records
HBase 50% read, 50% update
• YCSB 100% Insert on SDSC Comet (32 nodes)
– NVFS-BlkIO gains by 21% by storing only WALs to NVM
• YCSB 50% Read, 50% Update on SDSC Comet (32 nodes)
– NVFS-BlkIO gains by 20% by storing only WALs to NVM
20%21%
OFA Workshop ‘18 18Network Based Computing Laboratory
Opportunities to Use NVRAM+RDMA in MapReduce
Disk Operations
• Map and Reduce Tasks carry out the total job execution
– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent)
– Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent)
• Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take
place in SSD/HDD typically
Bulk Data Transfer
OFA Workshop ‘18 19Network Based Computing Laboratory
Opportunities to Use NVRAM in MapReduce-RDMA
DesignInputFiles
OutputFiles
IntermediateData
Map Task
Read Map
Spill
Merge
Map Task
Read Map
Spill
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
RDMA
All Operations are In-
Memory
Opportunities exist to
improve the
performance with
NVRAM
OFA Workshop ‘18 20Network Based Computing Laboratory
NVRAM-Assisted Map Spilling in MapReduce-RDMA
InputFiles
OutputFiles
IntermediateData
Map Task
Read Map
Spill
Merge
Map Task
Read Map
Spill
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
RDMA
NVRAM
 Minimizes the disk operations in Spill phase
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016.
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on
HPC Systems? IEEE BigData 2017.
OFA Workshop ‘18 21Network Based Computing Laboratory
Comparison with Sort and TeraSort
• RMR-NVM achieves 2.37x benefit for Map
phase compared to RMR and MR-IPoIB;
overall benefit 55% compared to MR-IPoIB,
28% compared to RMR
2.37x
55%
2.48x
51%
• RMR-NVM achieves 2.48x benefit for Map
phase compared to RMR and MR-IPoIB;
overall benefit 51% compared to MR-IPoIB,
31% compared to RMR
OFA Workshop ‘18 22Network Based Computing Laboratory
Evaluation of Intel HiBench Workloads
• We evaluate different HiBench
workloads with Huge data sets
on 8 nodes
• Performance benefits for
Shuffle-intensive workloads
compared to MR-IPoIB:
– Sort: 42% (25 GB)
– TeraSort: 39% (32 GB)
– PageRank: 21% (5 million pages)
• Other workloads:
– WordCount: 18% (25 GB)
– KMeans: 11% (100 million samples)
OFA Workshop ‘18 23Network Based Computing Laboratory
Evaluation of PUMA Workloads
• We evaluate different PUMA
workloads on 8 nodes with
30GB data size
• Performance benefits for
Shuffle-intensive workloads
compared to MR-IPoIB :
– AdjList: 39%
– SelfJoin: 58%
– RankedInvIndex: 39%
• Other workloads:
– SeqCount: 32%
– InvIndex: 18%
OFA Workshop ‘18 24Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline
OFA Workshop ‘18 25Network Based Computing Laboratory
Overview of NVMe Standard
• NVMe is the standardized interface
for PCIe SSDs
• Built on ‘RDMA’ principles
– Submission and completion I/O
queues
– Similar semantics as RDMA send/recv
queues
– Asynchronous command processing
• Up to 64K I/O queues, with up to 64K
commands per queue
• Efficient small random I/O operation
• MSI/MSI-X and interrupt aggregation
NVMe Command Processing
Source: NVMExpress.org
OFA Workshop ‘18 26Network Based Computing Laboratory
Overview of NVMe-over-Fabric
• Remote access to flash with NVMe
over the network
• RDMA fabric is of most importance
– Low latency makes remote access
feasible
– 1 to 1 mapping of NVMe I/O queues
to RDMA send/recv queues
NVMf Architecture
I/O
Submission
Queue
I/O
Completion
Queue
RDMA Fabric
SQ RQ
NVMe
Low latency
overhead compared
to local I/O
OFA Workshop ‘18 27Network Based Computing Laboratory
Design Challenges with NVMe-SSD
• QoS
– Hardware-assisted QoS
• Persistence
– Flushing buffered data
• Performance
– Consider flash related design aspects
– Read/Write performance skew
– Garbage collection
• Virtualization
– SR-IOV hardware support
– Namespace isolation
• New software systems
– Disaggregated Storage with NVMf
– Persistent Caches
Co-design
OFA Workshop ‘18 28Network Based Computing Laboratory
Evaluation with RocksDB
0
5
10
15
Insert Overwrite Random Read
Latency (us)
POSIX SPDK
0
100
200
300
400
500
Write Sync Read Write
Latency (us)
POSIX SPDK
• 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required
OFA Workshop ‘18 29Network Based Computing Laboratory
Evaluation with RocksDB
0
5000
10000
15000
20000
Write Sync Read Write
Throughput (ops/sec)
POSIX SPDK
0
100000
200000
300000
400000
500000
600000
Insert Overwrite Random Read
Throughput (ops/sec)
POSIX SPDK
• 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required
OFA Workshop ‘18 30Network Based Computing Laboratory
QoS-aware SPDK Design
0
50
100
150
1 5 9 13 17 21 25 29 33 37 41 45 49
Bandwidth(MB/s)
Time
Scenario 1
High Priority Job (WRR) Medium Priority Job (WRR)
High Priority Job (OSU-Design) Medium Priority Job (OSU-Design)
0
1
2
3
4
5
2 3 4 5
JobBandwidthRatio
Scenario
Synthetic Application Scenarios
SPDK-WRR OSU-Design Desired
• Synthetic application scenarios with different QoS requirements
– Comparison using SPDK with Weighted Round Robbin NVMe arbitration
• Near desired job bandwidth ratios
• Stable and consistent bandwidth
S. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and
Provisioning QoS for NVMe SSDs, (Under Review)
OFA Workshop ‘18 31Network Based Computing Laboratory
Conclusion and Future Work
• Exploring NVM-aware RDMA-based Communication and I/O Schemes
for Big Data Analytics
• Proposed a new library, NRCIO (work-in-progress)
• Re-design HDFS storage architecture with NVRAM
• Re-design RDMA-MapReduce with NVRAM
• Design Big Data analytics stacks with NVMe and NVMf protocols
• Results are promising
• Further optimizations in NRCIO
• Co-design with more Big Data analytics frameworks
OFA Workshop ‘18 32Network Based Computing Laboratory
The 4th International Workshop on
High-Performance Big Data Computing (HPBDC)
HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018
Workshop Date: May 21st, 2018
Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment
Six Regular Research Papers and Two Short Research Papers
Panel Topic: Which Framework is the Best for High-Performance Deep Learning:
Big Data Framework or HPC Framework?
http://web.cse.ohio-state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjunction with IPDPS’17
http://web.cse.ohio-state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjunction with IPDPS’16
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
OFA Workshop ‘18 33Network Based Computing Laboratory
luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD

  • 1.
    High-Performance Big DataAnalytics with RDMA over NVM and NVMe-SSD Talk at OFA Workshop 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
  • 2.
    OFA Workshop ‘182Network Based Computing Laboratory • Substantial impact on designing and utilizing data management and processing systems in multiple tiers – Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase – Back-end data analytics (Offline) • HDFS, MapReduce, Spark Big Data Management and Processing on Modern Clusters
  • 3.
    OFA Workshop ‘183Network Based Computing Laboratory Big Data Processing with Apache Big Data Analytics Stacks • Major components included: – MapReduce (Batch) – Spark (Iterative and Interactive) – HBase (Query) – HDFS (Storage) – RPC (Inter-process communication) • Underlying Hadoop Distributed File System (HDFS) used by MapReduce, Spark, HBase, and many others • Model scales but high amount of communication and I/O can be further optimized! HDFS MapReduce Apache Big Data Analytics Stacks User Applications HBase Hadoop Common (RPC) Spark
  • 4.
    OFA Workshop ‘184Network Based Computing Laboratory Drivers of Modern HPC Cluster and Data Center Architecture • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) – Single Root I/O Virtualization (SR-IOV) • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Accelerators (NVIDIA GPGPUs and FPGAs) High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors Cloud CloudSDSC Comet TACC Stampede Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM
  • 5.
    OFA Workshop ‘185Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 280 organizations from 34 countries • More than 25,750 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Available for x86 and OpenPOWER Significant performance improvement with ‘RDMA+DRAM’ compared to default Sockets-based designs; How about RDMA+NVRAM?
  • 6.
    OFA Workshop ‘186Network Based Computing Laboratory Non-Volatile Memory (NVM) and NVMe-SSD 3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*] • Non-Volatile Memory (NVM) provides byte-addressability with persistence • The huge explosion of data in diverse fields require fast analysis and storage • NVMs provide the opportunity to build high-throughput storage systems for data-intensive applications • Storage technology is moving rapidly towards NVM [*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/
  • 7.
    OFA Workshop ‘187Network Based Computing Laboratory • Popular methods employed by recent works to emulate NVRAM performance model over DRAM • Two ways: – Emulate byte-addressable NVRAM over DRAM – Emulate block-based NVM device over DRAM NVRAM Emulation based on DRAM Application Virtual File System Block Device PCMDisk (RAM-Disk + Delay) DRAM mmap/memcpy/msync (DAX) Application Persistent Memory Library Clflush + Delay DRAM pmem_memcpy_persist Load/store Load/Store open/read/write/close
  • 8.
    OFA Workshop ‘188Network Based Computing Laboratory • NRCIO: NVM-aware RDMA-based Communication and I/O Schemes • NRCIO for Big Data Analytics • NVMe-SSD based Big Data Analytics • Conclusion and Q&A Presentation Outline
  • 9.
    OFA Workshop ‘189Network Based Computing Laboratory Design Scope (NVM for RDMA) D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVM N-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAM N-to-N over RDMA: Communication buffers for client and server are allocated in NVM Client Server Client Server Client Server D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common)
  • 10.
    OFA Workshop ‘1810Network Based Computing Laboratory NVRAM-aware Communication in NRCIO NRCIO Send/Recv over NVRAM NRCIO RDMA_Read over NVRAM
  • 11.
    OFA Workshop ‘1811Network Based Computing Laboratory NRCIO Send/Recv Emulation over PCM • Comparison of communication latency using NRCIO send/receive semantics over InfiniBand QDR network and PCM memory • High communication latencies due to slower writes to non-volatile persistent memory • NVRAM-to-Remote-NVRAM (NVRAM-NVRAM) => ~10x overhead vs. DRAM-DRAM • DRAM-to-Remote-NVRAM (DRAM-NVRAM) => ~8x overhead vs. DRAM-DRAM • NVRAM-to-Remote-DRAM (NVRAM-DRAM) => ~4x overhead vs. DRAM-DRAM 0 20000 40000 60000 80000 100000 120000 140000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Latency(us) IB Send/Recv Emulation over PCM DRAM-DRAM NVRAM-DRAM DRAM-NVRAM NVRAM-NVRAM DRAM-SSD-Msync_Per_Line DRAM-SSD-Msync_Per_8KPage DRAM-SSD-Msync_Per_4KPage
  • 12.
    OFA Workshop ‘1812Network Based Computing Laboratory NRCIO RDMA-Read Emulation over PCM 0 20000 40000 60000 80000 100000 120000 140000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Latency(us) Message Size (bytes) IB RDMA_READ Emulation over PCM DRAM-DRAM NVRAM-DRAM DRAM-NVRAM NVRAM-NVRAM DRAM-SSD-Msync_Per_Line DRAM-SSD-Msync_Per_8KPage DRAM-SSD-Msync_Per_4KPage • Communication latency with NRCIO RDMA-Read over InfiniBand QDR + PCM memory • Communication overheads for large messages due to slower writes into NVRAM from remote memory; similar to Send/Receive • RDMA-Read outperforms Send/Receive for large messages; as observed for DRAM-DRAM
  • 13.
    OFA Workshop ‘1813Network Based Computing Laboratory • NRCIO: NVM-aware RDMA-based Communication and I/O Schemes • NRCIO for Big Data Analytics • NVMe-SSD based Big Data Analytics • Conclusion and Q&A Presentation Outline
  • 14.
    OFA Workshop ‘1814Network Based Computing Laboratory • Files are divided into fixed sized blocks – Blocks divided into packets • NameNode: stores the file system namespace • DataNode: stores data blocks in local storage devices • Uses block replication for fault tolerance – Replication enhances data-locality and read throughput • Communication and I/O intensive • Java Sockets based communication • Data needs to be persistent, typically on SSD/HDD NameNode DataNodes Client Opportunities of Using NVRAM+RDMA in HDFS
  • 15.
    OFA Workshop ‘1815Network Based Computing Laboratory Design Overview of NVM and RDMA-aware HDFS (NVFS) • Design Features • RDMA over NVM • HDFS I/O with NVM • Block Access • Memory Access • Hybrid design • NVM with SSD as a hybrid storage for HDFS I/O • Co-Design with Spark and HBase • Cost-effectiveness • Use-case Applications and Benchmarks Hadoop MapReduce Spark HBase Co-Design (Cost-Effectiveness, Use-case) RDMA Receiver RDMA Sender DFSClient RDMA Replicator RDMA Receiver NVFS -BlkIO Writer/Reader NVM NVFS- MemIO SSD SSD SSD NVM and RDMA-aware HDFS (NVFS) DataNode N. S. Islam, M. W. Rahman , X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, 24th International Conference on Supercomputing (ICS), June 2016
  • 16.
    OFA Workshop ‘1816Network Based Computing Laboratory Evaluation with Hadoop MapReduce 0 50 100 150 200 250 300 350 Write Read AverageThroughput(MBps) HDFS (56Gbps) NVFS-BlkIO (56Gbps) NVFS-MemIO (56Gbps) • TestDFSIO on SDSC Comet (32 nodes) – Write: NVFS-MemIO gains by 4x over HDFS – Read: NVFS-MemIO gains by 1.2x over HDFS TestDFSIO 0 200 400 600 800 1000 1200 1400 Write Read AverageThroughput(MBps) HDFS (56Gbps) NVFS-BlkIO (56Gbps) NVFS-MemIO (56Gbps) 4x 1.2x 4x 2x SDSC Comet (32 nodes) OSU Nowlab (4 nodes) • TestDFSIO on OSU Nowlab (4 nodes) – Write: NVFS-MemIO gains by 4x over HDFS – Read: NVFS-MemIO gains by 2x over HDFS
  • 17.
    OFA Workshop ‘1817Network Based Computing Laboratory Evaluation with HBase 0 100 200 300 400 500 600 700 800 8:800K 16:1600K 32:3200K Throughput(ops/s) Cluster Size : No. of Records HDFS (56Gbps) NVFS (56Gbps) HBase 100% insert 0 200 400 600 800 1000 1200 8:800K 16:1600K 32:3200K Throughput(ops/s) Cluster Size : Number of Records HBase 50% read, 50% update • YCSB 100% Insert on SDSC Comet (32 nodes) – NVFS-BlkIO gains by 21% by storing only WALs to NVM • YCSB 50% Read, 50% Update on SDSC Comet (32 nodes) – NVFS-BlkIO gains by 20% by storing only WALs to NVM 20%21%
  • 18.
    OFA Workshop ‘1818Network Based Computing Laboratory Opportunities to Use NVRAM+RDMA in MapReduce Disk Operations • Map and Reduce Tasks carry out the total job execution – Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent) – Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent) • Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take place in SSD/HDD typically Bulk Data Transfer
  • 19.
    OFA Workshop ‘1819Network Based Computing Laboratory Opportunities to Use NVRAM in MapReduce-RDMA DesignInputFiles OutputFiles IntermediateData Map Task Read Map Spill Merge Map Task Read Map Spill Merge Reduce Task Shuffle Reduce In- Mem Merge Reduce Task Shuffle Reduce In- Mem Merge RDMA All Operations are In- Memory Opportunities exist to improve the performance with NVRAM
  • 20.
    OFA Workshop ‘1820Network Based Computing Laboratory NVRAM-Assisted Map Spilling in MapReduce-RDMA InputFiles OutputFiles IntermediateData Map Task Read Map Spill Merge Map Task Read Map Spill Merge Reduce Task Shuffle Reduce In- Mem Merge Reduce Task Shuffle Reduce In- Mem Merge RDMA NVRAM  Minimizes the disk operations in Spill phase M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016. M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on HPC Systems? IEEE BigData 2017.
  • 21.
    OFA Workshop ‘1821Network Based Computing Laboratory Comparison with Sort and TeraSort • RMR-NVM achieves 2.37x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 55% compared to MR-IPoIB, 28% compared to RMR 2.37x 55% 2.48x 51% • RMR-NVM achieves 2.48x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 51% compared to MR-IPoIB, 31% compared to RMR
  • 22.
    OFA Workshop ‘1822Network Based Computing Laboratory Evaluation of Intel HiBench Workloads • We evaluate different HiBench workloads with Huge data sets on 8 nodes • Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB: – Sort: 42% (25 GB) – TeraSort: 39% (32 GB) – PageRank: 21% (5 million pages) • Other workloads: – WordCount: 18% (25 GB) – KMeans: 11% (100 million samples)
  • 23.
    OFA Workshop ‘1823Network Based Computing Laboratory Evaluation of PUMA Workloads • We evaluate different PUMA workloads on 8 nodes with 30GB data size • Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB : – AdjList: 39% – SelfJoin: 58% – RankedInvIndex: 39% • Other workloads: – SeqCount: 32% – InvIndex: 18%
  • 24.
    OFA Workshop ‘1824Network Based Computing Laboratory • NRCIO: NVM-aware RDMA-based Communication and I/O Schemes • NRCIO for Big Data Analytics • NVMe-SSD based Big Data Analytics • Conclusion and Q&A Presentation Outline
  • 25.
    OFA Workshop ‘1825Network Based Computing Laboratory Overview of NVMe Standard • NVMe is the standardized interface for PCIe SSDs • Built on ‘RDMA’ principles – Submission and completion I/O queues – Similar semantics as RDMA send/recv queues – Asynchronous command processing • Up to 64K I/O queues, with up to 64K commands per queue • Efficient small random I/O operation • MSI/MSI-X and interrupt aggregation NVMe Command Processing Source: NVMExpress.org
  • 26.
    OFA Workshop ‘1826Network Based Computing Laboratory Overview of NVMe-over-Fabric • Remote access to flash with NVMe over the network • RDMA fabric is of most importance – Low latency makes remote access feasible – 1 to 1 mapping of NVMe I/O queues to RDMA send/recv queues NVMf Architecture I/O Submission Queue I/O Completion Queue RDMA Fabric SQ RQ NVMe Low latency overhead compared to local I/O
  • 27.
    OFA Workshop ‘1827Network Based Computing Laboratory Design Challenges with NVMe-SSD • QoS – Hardware-assisted QoS • Persistence – Flushing buffered data • Performance – Consider flash related design aspects – Read/Write performance skew – Garbage collection • Virtualization – SR-IOV hardware support – Namespace isolation • New software systems – Disaggregated Storage with NVMf – Persistent Caches Co-design
  • 28.
    OFA Workshop ‘1828Network Based Computing Laboratory Evaluation with RocksDB 0 5 10 15 Insert Overwrite Random Read Latency (us) POSIX SPDK 0 100 200 300 400 500 Write Sync Read Write Latency (us) POSIX SPDK • 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write • Overwrite: Compaction and flushing in background – Low potential for improvement • Read: Performance much worse; Additional tuning/optimization required
  • 29.
    OFA Workshop ‘1829Network Based Computing Laboratory Evaluation with RocksDB 0 5000 10000 15000 20000 Write Sync Read Write Throughput (ops/sec) POSIX SPDK 0 100000 200000 300000 400000 500000 600000 Insert Overwrite Random Read Throughput (ops/sec) POSIX SPDK • 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write • Overwrite: Compaction and flushing in background – Low potential for improvement • Read: Performance much worse; Additional tuning/optimization required
  • 30.
    OFA Workshop ‘1830Network Based Computing Laboratory QoS-aware SPDK Design 0 50 100 150 1 5 9 13 17 21 25 29 33 37 41 45 49 Bandwidth(MB/s) Time Scenario 1 High Priority Job (WRR) Medium Priority Job (WRR) High Priority Job (OSU-Design) Medium Priority Job (OSU-Design) 0 1 2 3 4 5 2 3 4 5 JobBandwidthRatio Scenario Synthetic Application Scenarios SPDK-WRR OSU-Design Desired • Synthetic application scenarios with different QoS requirements – Comparison using SPDK with Weighted Round Robbin NVMe arbitration • Near desired job bandwidth ratios • Stable and consistent bandwidth S. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and Provisioning QoS for NVMe SSDs, (Under Review)
  • 31.
    OFA Workshop ‘1831Network Based Computing Laboratory Conclusion and Future Work • Exploring NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics • Proposed a new library, NRCIO (work-in-progress) • Re-design HDFS storage architecture with NVRAM • Re-design RDMA-MapReduce with NVRAM • Design Big Data analytics stacks with NVMe and NVMf protocols • Results are promising • Further optimizations in NRCIO • Co-design with more Big Data analytics frameworks
  • 32.
    OFA Workshop ‘1832Network Based Computing Laboratory The 4th International Workshop on High-Performance Big Data Computing (HPBDC) HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018 Workshop Date: May 21st, 2018 Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment Six Regular Research Papers and Two Short Research Papers Panel Topic: Which Framework is the Best for High-Performance Deep Learning: Big Data Framework or HPC Framework? http://web.cse.ohio-state.edu/~luxi/hpbdc2018 HPBDC 2017 was held in conjunction with IPDPS’17 http://web.cse.ohio-state.edu/~luxi/hpbdc2017 HPBDC 2016 was held in conjunction with IPDPS’16 http://web.cse.ohio-state.edu/~luxi/hpbdc2016
  • 33.
    OFA Workshop ‘1833Network Based Computing Laboratory luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/