MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data over 100Gbps networks
Streaming Exa-‐scale Data over 100Gbps Networks Mehmet Balman Computa/onal Research Division Lawrence Berkeley Na/onal Laboratory Collaborators: Eric Pouyoul, Yushu Yao, E. Wes Bethel, Burlen Loring, Prabhat, John Shalf, Alex Sim, Arie Shoshani, Dean N. Williams, Brian L. Tierney
Outline • A recent 100Gbps demo by ESnet and Internet2 at SC11 • One of the applica=ons: • Data movement of large datasets with many ﬁles (Scaling the Earth System Grid to 100Gbps Networks)
Climate Data Distribution • ESG data nodes • Data replica=on in the ESG Federa=on • Local copies • data ﬁles are copied into temporary storage in HPC centers for post-‐processing and further climate analysis.
Climate Data over 100Gbps • Data volume in climate applica=ons is increasing exponen=ally. • An important challenge in managing ever increasing data sizes in climate science is the large variance in ﬁle sizes. • Climate simula=on data consists of a mix of rela=vely small and large ﬁles with irregular ﬁle size distribu=on in each dataset. • Many small ﬁles
Keep the data channel full request request a file data send file send datarequest a file send file RPC FTP • Concurrent transfers • Parallel streams
lots-‐of-‐small-‐<iles problem! <ile-‐centric tools? l Not necessarily high-‐speed (same distance) - Latency is s=ll a problem request a dataset send data 100Gbps pipe 10Gbps pipe
Framework for the Memory-‐mapped Network Channel memory caches are logically mapped between client and server
Advantages • Decoupling I/O and network opera=ons • front-‐end (I/O processing) • back-‐end (networking layer) • Not limited by the characteris=cs of the ﬁle sizes On the ﬂy tar approach, bundling and sending many ﬁles together • Dynamic data channel management Can increase/decrease the parallelism level both in the network communica=on and I/O read/write opera=ons, without closing and reopening the data channel connec=on (as is done in regular FTP variants).
The SC11 100Gbps Demo • CMIP3 data (35TB) from the GPFS ﬁlesystem at NERSC • Block size 4MB • Each block’s data sec=on was aligned according to the system pagesize. • 1GB cache both at the client and the server • At NERSC, 8 front-‐end threads on each host for reading data ﬁles in parallel. • At ANL/ORNL, 4 front-‐end threads for processing received data blocks. • 4 parallel TCP streams (four back-‐end threads) were used for each host-‐to-‐host connec=on.
MemzNet: memory-‐mapped zero-‐copy network channel Front-‐end Memory network threads (access blocks Memory to memory Front-‐end blocks blocks) threads (access to memory blocks)memory caches are logically mapped between client and server
ANI Middleware Testbed ANI 100Gbps NERSC To ESnet ANL 10G To ESnet 1GE 10G nersc-asw1 Site Router testbed (nersc-mr2) ANI 100G Network 1GE anl-asw1 1 GE nersc-C2940 ANL Site switch Router 1 GE 100G anl-C2940 100G switch 1 GE 1 GE eth0 1 GE nersc-app 100G 100G nersc-diskpt-1 NICs: 1 GE 4x10GE (MM) 1 GE 2: 2x10G Myricom eth2-5 1: 4x10G HotLava 1 GE eth0 nersc-diskpt-1 10GE (MM) nersc-diskpt-2 NICs: 10GE (MM) 1 GE eth0 1: 2x10G Myricom 4x10GE (MM) 1: 2x10G Chelsio eth2-5 ANI 100G anl-app 1: 6x10G HotLava ANI 100G eth0 Router anl-mempt-1 NICs: Router eth2-5 eth0 4x10GE (MM) nersc-diskpt-2 4x 10GE (MM) 2: 2x10G Myricom nersc-diskpt-3 NICs: 4x10GE (MM) 1: 2x10G Myricom eth2-5 anl-mempt-1 1: 2x10G Mellanox eth0 1: 6x10G HotLava eth0 anl-mempt-2 NICs: eth2-5 nersc-diskpt-3 2: 2x10G Myricom 4x10GE (MM) anl-mempt-2 eth0 anl-mempt-3 NICs: eth2-5 1: 2x10G Myricom 4x10GE (MM) 1: 2x10G MellanoxNote: ANI 100G routers and 100G wave available till summer 2012;Testbed resources after that subject funding availability. anl-mempt-3 Updated December 11, 2011 SC11 100Gbps demo
Many TCP Streams (a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface trafﬁc, packages per second (blue) and bytes per second, over a singleNIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbpspipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents adifferent test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
Effects of many streams ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16,32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M
MemzNet’s Performance SC11 demo GridFTP MemzNetTCP buﬀer size is set to 50MB ANI Testbed
Acknowledgements Peter Nugent, Zarija Lukic , Patrick Dorn, Evangelos Chaniotakis, John Christman, Chin Guok, Chris Tracy, Lauren Rotman, Jason Lee, Shane Canon, Tina Declerck, Cary Whitney, Ed Holohan, Adam Scovel, Linda Winkler, Jason Hill, Doug Fuller, Susan Hicks, Hank Childs, Mark Howison, Aaron Thomas, John Dugan, Gopal Vaswani