Experiences with 100Gbps  Network ApplicationsAnalyzing Data Movements and  Identifying Techniques forNext-generation High...
OutlineClimate Data as a typical science scenario •  100Gbps Climate100 demoMemzNet: Memory-mapped zero-copy NetworkChanne...
100Gbps networking has finally arrived!Applications’ PerspectiveIncreasing the bandwidth is not sufficient by itself; we n...
The need for 100GbpsModern science is Data driven and Collaborative innature •  The largest collaborations are most likely...
ANI   100Gbps   Demo   Late 2011 – early 2012•  100Gbps demo by ESnet and                                        •  Visual...
ESG (Earth Systems Grid)•  Over 2,700 sites•  25,000 users                      •  IPCC Fifth Assessment                  ...
Data distribution for climate science How scientific data movement and analysis between geographically disparate supercomp...
Climate Data AnalysisPetabytes of data•  Data is usually in a remote location•  How can we access efficiently?
Tuning Network Transfers
Adaptive Tuning   Worked well with large files !
Climate Data over 100Gbps•  Data volume in climate applications is increasing   exponentially.•  An important challenge in...
lots-of-small-files problem!           file-centric tools?                                 request  request a file        ...
Moving climate files efficiently
MemzNet: memory-mapped zero-copy               network channel    Front-­‐end	            Memory	     network    threads	 ...
Advantages•  Decoupling I/O and network operations     •  front-end (I/O processing)     •  back-end (networking layer)•  ...
AdvantagesThe synchronization of the memory cache isaccomplished based on the tag header.•  Application processes interact...
100Gbps demo environment RRT: Seattle – NERSC 16ms      NERSC – ANL     50ms      NERSC – ORNL 64ms
100Gbps Demo•  CMIP3 data (35TB) from the GPFS filesystem at NERSC •  Block size 4MB •  Each block’s data section was alig...
83Gbpsthroughput
Framework for the Memory-mapped             Network Channelmemory caches are logically mapped between client andserver
MemzNet’s FeaturesData files are aggregated and divided into simple blocks.Blocks are tagged and streamed over the network...
Performance at SC11 demo          GridFTP                MemzNetTCP buffer size is set to 50MB
Advance Network Initiative     Esnets’s 100Gbps Testbed
Performance results in the 100Gbps ANI               testbed
MemzNet’s Performance                                           100Gbps demo            GridFTP              MemzNetTCP bu...
Challenge?•  High-bandwidth brings new challenges!  •  We need substantial amount of processing power and involvement    o...
MemzNet•  MemzNet: Memory-mapped Network Channel   •  High-performance data movementMemzNet is an initial effort to put a ...
Framework for the Memory-mapped             Network Channelmemory caches are logically mapped between client andserver
Advance Network Initiatieve      Esnets’s 100Gbps Testbed
Initial Tests²  Many TCP sockets oversubscribe the network  and cause performance degradation.²  Host system performance...
Many Concurrent Streams(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic,...
Effects of many concurrent streamsANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent...
Parallel Streams - 10Gbps ANI testbed 10Gbps connection: Interface traffic vs the number of parallel streams [1, 2, 4, 8, 1...
Parallel Streams – 40GbpsANI testbed 40Gbps (4x10NICs, single host): Interface traffic vs thenumber of parallel streams [1,...
Parallel streams - 40Gbps Figure 5: ANI testbed 40Gbps (4x10NICs, single host): Throughput vs the number of parallel strea...
Host tuning With proper tuning, we achieved 98Gbps using only 3 sending hosts, 3 receiving hosts, 10 10GE NICS, and 10 TCP...
NIC/TCP Tuning•  We are using Myricom 10G NIC (100Gbps testbed)  •  Download latest drive/firmware from vendor site     • ...
100Gbps = It’s full of frames !•  Problem:  •  Interrupts are very expensive  •  Even with jumbo frames and driver optimiz...
Host Tuning Results        45        40        35        30        25 Gbps        20                                      ...
Initial Tests²  Many TCP sockets oversubscribe the network  and cause performance degradation.²  Host system performance...
End-to-end Data Movement•  Tuning parameters•  Host performance•  Multiple streams on the host systems•  Multiple NICs and...
MemzNet’s Architecture for data streaming
MemzNet•  MemzNet: Memory-mapped Network Channel   •  High-performance data movementMemzNet is an initial effort to put a ...
Framework for the Memory-mapped             Network Channelmemory caches are logically mapped between client andserver
Related Work•  Luigi Rizzo ’s netmap •  proposes a new API to send/receive data over    the network •  http://info.iet.uni...
Future Work•  Integrate MemzNet with scientific data formats for remote data analysis•  Provide a user library for filteri...
Testbed Results  http://www.es.net/RandD/100g-testbed/    •  Proposal process    •  Testbed Description, etc.•  http://www...
AcknowledgementsCollaborators:Eric Pouyoul, Yushu Yao, E. Wes Bethel BurlenLoring, Prabhat, John Shalf, Alex Sim, ArieShos...
NDM 2013 Workshop (tentative)                      The 3rd International Workshop on                      Network-aware Da...
Upcoming SlideShare
Loading in …5
×

Analyzing Data Movements and Identifying Techniques for Next-generation Networks

349 views
237 views

Published on



Uc davis talk-jan28th2013_balman

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
349
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Analyzing Data Movements and Identifying Techniques for Next-generation Networks

  1. 1. Experiences with 100Gbps Network ApplicationsAnalyzing Data Movements and Identifying Techniques forNext-generation High-bandwidth Networks Mehmet Balman Computational Research Division Lawrence Berkeley National Laboratory
  2. 2. OutlineClimate Data as a typical science scenario •  100Gbps Climate100 demoMemzNet: Memory-mapped zero-copy NetworkChannelAdvance Network Initiative (ANI) Testbed •  100Gbps ExperimentsFuture Research Plans
  3. 3. 100Gbps networking has finally arrived!Applications’ PerspectiveIncreasing the bandwidth is not sufficient by itself; we needcareful evaluation of high-bandwidth networks from theapplications’ perspective. 1Gbps to 10Gbps transition (10 years ago) Application did not run 10 times faster because there was more bandwidth available
  4. 4. The need for 100GbpsModern science is Data driven and Collaborative innature •  The largest collaborations are most likely to depend on distributed architectures. •  LHC (distributed architecture) data generation, distribution, and analysis. •  The volume of data produced by of genomic sequencers is rising exponentially. •  In climate science, researchers must analyze observational and simulation data located at facilities around the world
  5. 5. ANI 100Gbps Demo Late 2011 – early 2012•  100Gbps demo by ESnet and •  Visualization of remotely located data Internet2 (Cosmology)•  Application design issues and host tuning strategies to scale to •  Data movement of large datasets with 100Gbps rates many files (Climate analysis)
  6. 6. ESG (Earth Systems Grid)•  Over 2,700 sites•  25,000 users •  IPCC Fifth Assessment Report (AR5) 2PB •  IPCC Forth Assessment Report (AR4) 35TB
  7. 7. Data distribution for climate science How scientific data movement and analysis between geographically disparate supercomputing facilities can benefit from high-bandwidth networks?•  Local copies •  data files are copied into temporary storage in HPC centers for post- processing and further climate analysis.
  8. 8. Climate Data AnalysisPetabytes of data•  Data is usually in a remote location•  How can we access efficiently?
  9. 9. Tuning Network Transfers
  10. 10. Adaptive Tuning Worked well with large files !
  11. 11. Climate Data over 100Gbps•  Data volume in climate applications is increasing exponentially.•  An important challenge in managing ever increasing data sizes in climate science is the large variance in file sizes. •  Climate simulation data consists of a mix of relatively small and large files with irregular file size distribution in each dataset. • Many small files
  12. 12. lots-of-small-files problem! file-centric tools? request request a file data send file send datarequest a file send file RPC FTP •  Concurrent transfers •  Parallel streams
  13. 13. Moving climate files efficiently
  14. 14. MemzNet: memory-mapped zero-copy network channel Front-­‐end   Memory   network threads  (access   blocks Memory   to  memory   Front-­‐end   blocks blocks) threads   (access  to   memory   blocks)memory caches are logically mapped between client andserver
  15. 15. Advantages•  Decoupling I/O and network operations •  front-end (I/O processing) •  back-end (networking layer)•  Not limited by the characteristics of the file sizes•  On the fly tar approach, bundling and sending many files together•  Dynamic data channel management Can increase/decrease the parallelism level both in the network communication and I/O read/write operations, without closing and reopening the data channel connection (as is done in regular FTP variants).
  16. 16. AdvantagesThe synchronization of the memory cache isaccomplished based on the tag header.•  Application processes interact with the memory blocks.•  Enables out-of-order and asynchronous send receiveMemzNet is is not file-centric. Bookkeepinginformation is embedded inside each block. •  Can increase/decrease the number of parallel streams without closing and reopening the data channel.
  17. 17. 100Gbps demo environment RRT: Seattle – NERSC 16ms NERSC – ANL 50ms NERSC – ORNL 64ms
  18. 18. 100Gbps Demo•  CMIP3 data (35TB) from the GPFS filesystem at NERSC •  Block size 4MB •  Each block’s data section was aligned according to the system pagesize. •  1GB cache both at the client and the server•  At NERSC, 8 front-end threads on each host for reading data files in parallel.•  At ANL/ORNL, 4 front-end threads for processing received data blocks.•  4 parallel TCP streams (four back-end threads) were used for each host-to-host connection.
  19. 19. 83Gbpsthroughput
  20. 20. Framework for the Memory-mapped Network Channelmemory caches are logically mapped between client andserver
  21. 21. MemzNet’s FeaturesData files are aggregated and divided into simple blocks.Blocks are tagged and streamed over the network. Eachdata block’s tag includes information about the contentinside.Decouples disk and network IO operations; so, read/writethreads can work independently.Implements a memory cache managements system that isaccessed in blocks. These memory blocks are logicallymapped to the memory cache that resides in the remote site.
  22. 22. Performance at SC11 demo GridFTP MemzNetTCP buffer size is set to 50MB
  23. 23. Advance Network Initiative Esnets’s 100Gbps Testbed
  24. 24. Performance results in the 100Gbps ANI testbed
  25. 25. MemzNet’s Performance 100Gbps demo GridFTP MemzNetTCP buffer size is set to 50MB ANI Testbed
  26. 26. Challenge?•  High-bandwidth brings new challenges! •  We need substantial amount of processing power and involvement of multiple cores to fill a 40Gbps or 100Gbps network•  Fine-tuning, both in network and application layers, to take advantage of the higher network capacity.•  Incremental improvement in current tools? •  We cannot expect every application to tune and improve every time we change the link technology or speed.
  27. 27. MemzNet•  MemzNet: Memory-mapped Network Channel •  High-performance data movementMemzNet is an initial effort to put a new layer between theapplication and the transport layer.•  Main goal is to define a network channel so applications can directly use it without the burden of managing/tuning the network communication.
  28. 28. Framework for the Memory-mapped Network Channelmemory caches are logically mapped between client andserver
  29. 29. Advance Network Initiatieve Esnets’s 100Gbps Testbed
  30. 30. Initial Tests²  Many TCP sockets oversubscribe the network and cause performance degradation.²  Host system performance could easily be the bottleneck. •  TCP/UDP buffer tuning, using jumbo frames, and interrupt coalescing. •  Multi-core systems: IRQ binding is now essential for maximizing host performance.
  31. 31. Many Concurrent Streams(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a singleNIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbpspipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents adifferent test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
  32. 32. Effects of many concurrent streamsANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5minintervals], TCP buffer size is 50M
  33. 33. Parallel Streams - 10Gbps ANI testbed 10Gbps connection: Interface traffic vs the number of parallel streams [1, 2, 4, 8, 16, 32 64 streams -5min intervals], TCP buffer size is set to 50MB
  34. 34. Parallel Streams – 40GbpsANI testbed 40Gbps (4x10NICs, single host): Interface traffic vs thenumber of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals],TCP buffer size is set to 50MB
  35. 35. Parallel streams - 40Gbps Figure 5: ANI testbed 40Gbps (4x10NICs, single host): Throughput vs the number of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals], TCP buffer size is set to 50MB
  36. 36. Host tuning With proper tuning, we achieved 98Gbps using only 3 sending hosts, 3 receiving hosts, 10 10GE NICS, and 10 TCP flowsImage source: “Experiences with 100Gbps network applications” In Proceedings ofthe fifth international workshop on Data-Intensive Distributed Computing, 2012
  37. 37. NIC/TCP Tuning•  We are using Myricom 10G NIC (100Gbps testbed) •  Download latest drive/firmware from vendor site •  Version of driver in RHEL/CentOS fairly old •  Enable MSI-X •  Increase txgueuelen /sbin/ifconfig eth2 txqueuelen 10000! •  Increase Interrupt coalescence! /usr/sbin/ethtool -C eth2 rx-usecs 100!•  TCP Tuning: net.core.rmem_max = 67108864! net.core.wmem_max = 67108864! net.core.netdev_max_backlog = 250000From “Experiences with 100Gbps network applications” In Proceedings of the fifthinternational workshop on Data-Intensive Distributed Computing, 2012
  38. 38. 100Gbps = It’s full of frames !•  Problem: •  Interrupts are very expensive •  Even with jumbo frames and driver optimization, there is still too many interrupts.•  Solution: •  Turn off Linux irqbalance (chkconfig irqbalance off) •  Use /proc/interrupt to get the list of interrupts •  Dedicate an entire processor core for each 10G interface •  Use /proc/irq/<irq-number>/smp_affinity to bind rx/tx queues to a specific core.From “Experiences with 100Gbps network applications” In Proceedings of the fifthinternational workshop on Data-Intensive Distributed Computing, 2012
  39. 39. Host Tuning Results 45 40 35 30 25 Gbps 20 without tuning 15 with tuning 10 5 0 Interrupt Interrupt IRQ Binding IRQ Binding coalescing coalescing (TCP) (UDP) (TCP) (UDP)Image source: “Experiences with 100Gbps network applications” In Proceedings ofthe fifth international workshop on Data-Intensive Distributed Computing, 2012
  40. 40. Initial Tests²  Many TCP sockets oversubscribe the network and cause performance degradation.²  Host system performance could easily be the bottleneck. •  TCP/UDP buffer tuning, using jumbo frames, and interrupt coalescing. •  Multi-core systems: IRQ binding is now essential for maximizing host performance.
  41. 41. End-to-end Data Movement•  Tuning parameters•  Host performance•  Multiple streams on the host systems•  Multiple NICs and multiple cores•  Effect of the application design
  42. 42. MemzNet’s Architecture for data streaming
  43. 43. MemzNet•  MemzNet: Memory-mapped Network Channel •  High-performance data movementMemzNet is an initial effort to put a new layer between theapplication and the transport layer.•  Main goal is to define a network channel so applications can directly use it without the burden of managing/tuning the network communication.
  44. 44. Framework for the Memory-mapped Network Channelmemory caches are logically mapped between client andserver
  45. 45. Related Work•  Luigi Rizzo ’s netmap •  proposes a new API to send/receive data over the network •  http://info.iet.unipi.it/~luigi/netmap/ •  RDMA programming model •  MemzNet can be used for RDMA-enabled transfers • memory-management middleware •  GridFTP popen + reordering ?
  46. 46. Future Work•  Integrate MemzNet with scientific data formats for remote data analysis•  Provide a user library for filtering and subsetting of data.
  47. 47. Testbed Results http://www.es.net/RandD/100g-testbed/ •  Proposal process •  Testbed Description, etc.•  http://www.es.net/RandD/100g-testbed/results/ •  RoCE (RDMA over Ethernet) tests •  RDMA implementation (RFPT100) •  100Gbps demo paper •  Efficient Data Transfer Protocols •  MemzNet (Memory-mapped Zero-copy Network Channel)
  48. 48. AcknowledgementsCollaborators:Eric Pouyoul, Yushu Yao, E. Wes Bethel BurlenLoring, Prabhat, John Shalf, Alex Sim, ArieShoshani, and Brian L. TierneySpecial Thanks:Peter Nugent, Zarija Lukic , Patrick Dorn, EvangelosChaniotakis, John Christman, Chin Guok, Chris Tracy,Lauren Rotman, Jason Lee, Shane Canon, Tina Declerck,Cary Whitney, Ed Holohan, Adam Scovel, Linda Winkler,Jason Hill, Doug Fuller, Susan Hicks, Hank Childs, MarkHowison, Aaron Thomas, John Dugan, Gopal Vaswani
  49. 49. NDM 2013 Workshop (tentative) The 3rd International Workshop on Network-aware Data Management to be held in conjunction with the EEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC’13) http://sdm.lbl.gov/ndmEditing a book (CRC press): NDM researchfrontiers (tentative)

×