Experience	  with	  100Gbps	  Network	                Applications	  Mehmet	  Balman,	  Eric	  Pouyoul,	  Yushu	  Yao,...
 Experience	  with	  100Gbps	  Network	  Applications	                  Mehmet	  Balman	        ComputaConal	  Research	  ...
Outline	  •  A	  recent	  100Gbps	  demo	  by	  ESnet	  and	  Internet2	     at	  SC11	  •  Two	  applicaCons:	     •  Vis...
The	  Need	  for	  100Gbps	  networks	  Modern	  science	  is	  Data	  driven	  and	  Collabora?ve	  in	  nature	    •  Th...
ESG	  (Earth	  Systems	  Grid)	  •  Over	  2,700	  sites	  •  25,000	  users	                                  •  IPCC	  F...
100Gbps	  networks	  arrived	  •  Increasing	  network	  bandwidth	  is	  an	  important	  step	     toward	  tackling	  e...
Applications’	  Perspective	  •  Increasing	  the	  bandwidth	  is	  not	  sufficient	  by	  itself;	     we	  need	  carefu...
The	  SC11	  100Gbps	  demo	  •  100Gbps	  connecCon	  between	  ANL	  (Argon),	  NERSC	  at	  LBNL,	     ORNL	  (Oak	  Ri...
Demo	  ConFiguration	  RRT:	  	  	  	  	  	  	  	  	  Sea_le	  –	  NERSC	  	  16ms	  	  	  	  	  	  	  	  	  	  	  	  	  	...
Visualizing	  the	  Universe	  at	  100Gbps	  •  VisaPult	  for	  streaming	  data	  •  Paraview	  for	  rendering	  	  • ...
Demo	  ConFiguration	       Infiniband               10GigE                                 1 GigE     Connection         ...
UDP	  shufFling	       •  UDP	  packets	  include	  posiCon	  (x,	  y,	  z)	  informaCon	            (1024ˆ3	  matrix)	  	...
Data	  Flow	  •  For	  each	  Cme	  step,	  the	  input	  data	  is	  split	  into	  32	     streams	  along	  the	  z-­‐d...
Performance	  Optimization	  •  Each	  10Gbps	  NIC	  in	  the	  system	  is	  bound	  to	  a	  specific	     core.	  •  Re...
Network	  Utilization	  •  2.3TB	  of	  data	  from	  NERSC	  to	  SC11	  booth	  in	  Sea_le	  in	     ≈	  3.4	  minutes	...
Demo	  
Climate	  Data	  Distribution	  •  ESG	  data	  nodes	     •  Data	  replicaCon	  in	  the	        ESG	  FederaCon	  •  Lo...
Climate	  Data	  over	  100Gbps	  •  Data	  volume	  in	  climate	  applicaCons	  is	  increasing	     exponenCally.	  •  ...
Keep	  the	  data	  channel	  full	                                         request  request a file                       ...
lots-­‐of-­‐small-­‐Files	  problem!	                        File-­‐centric	  tools?	  	  l    Not	  necessarily	  high-­...
Block-­‐based	          Front-­‐end	            Memory	     network        threads	  (access	     blocks                 M...
Moving	  climate	  Files	  efFiciently	  
Advantages	  •  Decoupling	  I/O	  and	  network	  operaCons	           •  front-­‐end	  (I/O,	  proccesing)	           • ...
Demo	  ConFiguration	  Disk	  to	  memory	  /	  reading	  from	  GPFS	  (NERSC),	  max	  120Gbps	  read	  performance	  	  
The	  SC11	  100Gbps	  Demo	  •  CMIP3	  data	  (35TB)	  from	  the	  GPFS	  filesystem	  at	  NERSC	  	  	       •  Block	...
83Gbps	  	  throughput	  
MemzNet:	  memory-­‐mapped	  zero-­‐copy	                 Network	  channel	          Front-­‐end	            Memory	     ...
ANI	  100Gbps	  	  testbed	                                                                      ANI Middleware Testbed NE...
ANI Middleware Testbed                                                                                                    ...
Many	  TCP	  Streams	  ANI testbed 100Gbps (10x10NICs, three hosts): Throughputvs the number of parallel streams [1, 2, 4,...
ANI testbed 100Gbps (10x10NICs, three hosts): Interfacetraffic vs the number of concurrent transfers [1, 2, 4, 8,16, 32 64 ...
Performance	  at	  SC11	  demo	                      GridFTP                               MemzNetTCP	  buffer	  size	  is	...
Performance	  results	  in	  the	  100Gbps	            ANI	  testbed	  
Host	  tuning	  	  With	  proper	  tuning,	  we	  achieved	  98Gbps	  using	  only	  3	  sending	  hosts,	  3	  receiving	...
NIC/TCP	  Tuning	  •  We	  are	  using	  Myricom	  10G	  NIC	  (100Gbps	  testbed)	     •  Download	  latest	  drive/firmwa...
100Gbps	  =	  It’s	  full	  of	  frames	  !	  •  Problem:	     •  Interrupts	  are	  very	  expensive	     •  Even	  with	...
Host Tuning Results       45       40       35       30       25Gbps       20                                             ...
Conclusion	  •  Host	  tuning	  &	  host	  performance	  	  •  MulCple	  NICs	  and	  mulCple	  cores	  •  The	  effect	  o...
Acknowledgements	  Peter	   Nugent,	   Zarija	   Lukic	   ,	   Patrick	   Dorn,	   Evangelos	  Chaniotakis,	   John	   Chr...
Ques?ons?	  Contact:	    • Mehmet	  Balman	  mbalman@lbl.gov	    • Brian	  Tierney	  blCerney@es.net	  
Upcoming SlideShare
Loading in...5
×

HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands

299

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
299
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands

  1. 1.   Experience  with  100Gbps  Network   Applications  Mehmet  Balman,  Eric  Pouyoul,  Yushu  Yao,  E.  Wes  Bethel,  Burlen  Loring,  Prabhat,  John  Shalf,  Alex  Sim,  and  Brian  L.  Tierney   DIDC  –  Del/,  the  Netherlands   June  19,  2012  
  2. 2.  Experience  with  100Gbps  Network  Applications   Mehmet  Balman   ComputaConal  Research  Division   Lawrence  Berkeley  NaConal  Laboratory   DIDC  –  Del/,  the  Netherlands   June  19,  2012  
  3. 3. Outline  •  A  recent  100Gbps  demo  by  ESnet  and  Internet2   at  SC11  •  Two  applicaCons:   •  VisualizaCon  of  remotely  located  data   (Cosmology)   •  Data  movement  of  large    datasets  with  many   files  (Climate  analysis)  Our  experience  in  applicaCon  design  issues  and  host  tuning  strategies  to  scale  to  100Gbps  rates  
  4. 4. The  Need  for  100Gbps  networks  Modern  science  is  Data  driven  and  Collabora?ve  in  nature   •  The  largest  collaboraCons  are  most  likely  to   depend  on  distributed  architectures.     •  LHC  (distributed  architecture)  data  generaCon,   distribuCon,  and  analysis.     •  The  volume  of  data  produced  by  of  genomic   sequencers  is  rising  exponenCally.     •  In  climate  science,  researchers  must  analyze   observaConal  and  simulaCon  data  located  at   faciliCes  around  the  world    
  5. 5. ESG  (Earth  Systems  Grid)  •  Over  2,700  sites  •  25,000  users   •  IPCC  Fi[h  Assessment  Report   (AR5)  2PB     •  IPCC  Forth  Assessment  Report   (AR4)  35TB  
  6. 6. 100Gbps  networks  arrived  •  Increasing  network  bandwidth  is  an  important  step   toward  tackling  ever-­‐growing  scienCfic  datasets.   •  1Gbps  to  10Gbps  transiCon  (10  years  ago)   •  ApplicaCon  did  not  run  10  Cmes  faster  because   there  was  more  bandwidth  available   In  order  to  take  advantage  of  the  higher  network   capacity,  we  need  to  pay  close  a_enCon  to  the   applicaCon  design  and  host  tuning  issues  
  7. 7. Applications’  Perspective  •  Increasing  the  bandwidth  is  not  sufficient  by  itself;   we  need  careful  evaluaCon  of  high-­‐bandwidth   networks  from  the  applicaCons’  perspecCve.    •  Real  ?me  streaming  and  visualiza?on  of  cosmology   data   •  How  high  network  capability  enables  remotely  located   scien6sts  to  gain  insights  from  large  data  volumes?  •  Data  distribu?on  for  climate  science   •   How  scien6fic  data  movement  and  analysis  between   geographically  disparate  supercompu6ng  facili6es  can   benefit  from  high-­‐bandwidth  networks?  
  8. 8. The  SC11  100Gbps  demo  •  100Gbps  connecCon  between  ANL  (Argon),  NERSC  at  LBNL,   ORNL  (Oak  Ridge),  and  the  SC11  booth  (Sea_le)  
  9. 9. Demo  ConFiguration  RRT:                  Sea_le  –  NERSC    16ms                                    NERSC  –  ANL              50ms            NERSC  –  ORNL        64ms  
  10. 10. Visualizing  the  Universe  at  100Gbps  •  VisaPult  for  streaming  data  •  Paraview  for  rendering    •  For  visualizaCon  purposes,  occasional  packet  loss  is   acceptable:  (using  UDP)  •  90Gbps  of  the  bandwidth  is  used  for  the  full  dataset   •  4x10Gbps  NICs  (4  hosts)  •  10Gbps  of  the  bandwidth  is  used  for  1/8  of  the  same   dataset   •  10Gbps  NIC  (one  host)  
  11. 11. Demo  ConFiguration   Infiniband 10GigE 1 GigE Connection Connection Connection Sender 01 Receive/ Render Sender 02 H1 Receive/ Sender 03 Render H2 …… Receive/ LBL Render Sender 16 NERSC 100G H3 Pipe Booth Router Router Receive/ Render H4 IB Cloud Low High Gigabit Bandwidth Bandwidth Ethernet Receive/ Vis SrvFlash-based Render/VisGPFS Cluster Low High Bandwidth Bandwidth Display Display The  1Gbps  connecCon  is  used  for  synchronizaCon  and  communicaCon   of  the  rendering  applicaCon,  not  for  transfer  of  the  raw  data      
  12. 12. UDP  shufFling   •  UDP  packets  include  posiCon  (x,  y,  z)  informaCon   (1024ˆ3  matrix)      •  MTU  is  9000.  The  largest  possible  packet    size  under   8972  bytes  (MTU  size  minus  IP  and  UDP  headers)      •  560  points  in  each  packet,  8968  bytes   •  3  integers  (x,y,z)  +  a  floaCng  point  value   UDP Packet Batch#,n X1Y1Z1D1 X2Y2Z2D2 …… XnYnZnDn In the final run n=560, packet size is 8968 bytes
  13. 13. Data  Flow  •  For  each  Cme  step,  the  input  data  is  split  into  32   streams  along  the  z-­‐direcCon;  each  stream  contains   a  conCguous  slice  of  the  size  1024  ∗  1024  ∗  32.    •  32  streams  for  90Gbps  demo  (high-­‐bandwidth)  •  4  streams  for  10Gbps  demo  (low  bandwidth,  1/8  of   the  data)   GPFS Flow of Data Flash Stager Shuffler Receiver Render SW /dev/shm /dev/shm Send Server at NERSC Receive Server at SC Booth
  14. 14. Performance  Optimization  •  Each  10Gbps  NIC  in  the  system  is  bound  to  a  specific   core.  •  Receiver  processes  are  also  bound  to  the  same  core  •  Renderer  same  NUMA  node  but  different  core   (accessing  to  the  same  memory  region)   NUMA Node 10G Port Mem Mem Receiver 1 Receiver 2 Core Render 1 Render 2 Mem Mem
  15. 15. Network  Utilization  •  2.3TB  of  data  from  NERSC  to  SC11  booth  in  Sea_le  in   ≈  3.4  minutes  •  For  each  Cmestep,  corresponding  to  16GB  of  data,  it   took  ≈  1.4  seconds  to  transfer    and  ≈  2.5  seconds  for   rendering  before  the  image  was  updated.  •   Peak  ≈  99Gbps.    •  Average  ≈  85Gbps  
  16. 16. Demo  
  17. 17. Climate  Data  Distribution  •  ESG  data  nodes   •  Data  replicaCon  in  the   ESG  FederaCon  •  Local  copies   •  data  files  are  copied   into  temporary   storage  in  HPC  centers   for  post-­‐processing   and  further  climate   analysis.    
  18. 18. Climate  Data  over  100Gbps  •  Data  volume  in  climate  applicaCons  is  increasing   exponenCally.  •  An  important  challenge  in  managing  ever  increasing  data  sizes   in  climate  science  is  the  large  variance  in  file  sizes.     •  Climate  simulaCon  data  consists  of  a  mix  of  relaCvely  small  and   large  files  with  irregular  file  size  distribuCon  in  each  dataset.     •  Many  small  files  
  19. 19. Keep  the  data  channel  full   request request a file data send file send datarequest a file send file RPC FTP •  Concurrent  transfers   •  Parallel  streams  
  20. 20. lots-­‐of-­‐small-­‐Files  problem!   File-­‐centric  tools?    l  Not  necessarily  high-­‐speed  (same  distance)   -  Latency  is  sCll  a  problem   request a dataset send data 100Gbps pipe 10Gbps pipe
  21. 21. Block-­‐based   Front-­‐end   Memory   network threads  (access   blocks Memory   to  memory   Front-­‐end   blocks blocks) threads   (access  to   memory   blocks)memory  caches  are  logically  mapped  between  client  and  server    
  22. 22. Moving  climate  Files  efFiciently  
  23. 23. Advantages  •  Decoupling  I/O  and  network  operaCons   •  front-­‐end  (I/O,  proccesing)   •  back-­‐end  (networking  layer)    •  Not  limited  by  the  characterisCcs  of  the  file  sizes    On  the  fly  tar  approach,    bundling  and  sending    many  files  together  •  Dynamic  data  channel  management    Can  increase/decrease  the  parallelism  level  both    in  the  network  communicaCon  and  I/O  read/write    operaCons,  without  closing  and  reopening  the    data  channel  connecCon  (as  is  done  in  regular  FTP    variants).    
  24. 24. Demo  ConFiguration  Disk  to  memory  /  reading  from  GPFS  (NERSC),  max  120Gbps  read  performance    
  25. 25. The  SC11  100Gbps  Demo  •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC       •  Block  size  4MB   •  Each  block’s  data  secCon  was  aligned  according  to  the   system  pagesize.     •  1GB  cache  both  at  the  client  and  the  server      •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data   files  in  parallel.  •   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received   data  blocks.  •   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for   each  host-­‐to-­‐host  connecCon.    
  26. 26. 83Gbps    throughput  
  27. 27. MemzNet:  memory-­‐mapped  zero-­‐copy   Network  channel   Front-­‐end   Memory   network threads  (access   blocks Memory   to  memory   Front-­‐end   blocks blocks) threads   (access  to   memory   blocks)memory  caches  are  logically  mapped  between  client  and  server    
  28. 28. ANI  100Gbps    testbed   ANI Middleware Testbed NERSC To ESnet ANL 10G To ESnet 1GE 10G nersc-asw1 Site Router (nersc-mr2) ANI 100G Network 1GE anl-asw1 1 GE nersc-C2940 ANL Site switch Router 1 GE 100G anl-C2940 100G switch 1 GE 1 GE eth0 1 GE nersc-app 100G 100G nersc-diskpt-1 NICs: 1 GE 4x10GE (MM) 1 GE 2: 2x10G Myricom eth2-5 1: 4x10G HotLava 1 GE eth0 nersc-diskpt-1 10GE (MM) nersc-diskpt-2 NICs: 10GE (MM) 1 GE eth0 1: 2x10G Myricom 4x10GE (MM) 1: 2x10G Chelsio eth2-5 ANI 100G anl-app 1: 6x10G HotLava ANI 100G eth0 Router anl-mempt-1 NICs: Router eth2-5 eth0 4x10GE (MM) nersc-diskpt-2 4x 10GE (MM) 2: 2x10G Myricom nersc-diskpt-3 NICs: 4x10GE (MM) 1: 2x10G Myricom eth2-5 anl-mempt-1 1: 2x10G Mellanox eth0 1: 6x10G HotLava eth0 anl-mempt-2 NICs: eth2-5 nersc-diskpt-3 2: 2x10G Myricom 4x10GE (MM) anl-mempt-2 eth0 anl-mempt-3 NICs: eth2-5 1: 2x10G Myricom 4x10GE (MM) 1: 2x10G Mellanox Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability. anl-mempt-3 Updated December 11, 2011
  29. 29. ANI Middleware Testbed ANI  100Gbps    NERSC To ESnet ANL 10G To ESnet 1GE 10G nersc-asw1 Site Router testbed   (nersc-mr2) ANI 100G Network 1GE anl-asw1 1 GE nersc-C2940 ANL Site switch Router 1 GE 100G anl-C2940 100G switch 1 GE 1 GE eth0 1 GE nersc-app 100G 100G nersc-diskpt-1 NICs: 1 GE 4x10GE (MM) 1 GE 2: 2x10G Myricom eth2-5 1: 4x10G HotLava 1 GE eth0 nersc-diskpt-1 10GE (MM) nersc-diskpt-2 NICs: 10GE (MM) 1 GE eth0 1: 2x10G Myricom 4x10GE (MM) 1: 2x10G Chelsio eth2-5 ANI 100G anl-app 1: 6x10G HotLava ANI 100G eth0 Router anl-mempt-1 NICs: Router eth2-5 eth0 4x10GE (MM) nersc-diskpt-2 4x 10GE (MM) 2: 2x10G Myricom nersc-diskpt-3 NICs: 4x10GE (MM) 1: 2x10G Myricom eth2-5 anl-mempt-1 1: 2x10G Mellanox eth0 1: 6x10G HotLava eth0 anl-mempt-2 NICs: eth2-5 nersc-diskpt-3 2: 2x10G Myricom 4x10GE (MM) anl-mempt-2 eth0 anl-mempt-3 NICs: eth2-5 1: 2x10G Myricom 4x10GE (MM) 1: 2x10G MellanoxNote: ANI 100G routers and 100G wave available till summer 2012;Testbed resources after that subject funding availability. anl-mempt-3 Updated December 11, 2011 SC11  100Gbps     demo  
  30. 30. Many  TCP  Streams  ANI testbed 100Gbps (10x10NICs, three hosts): Throughputvs the number of parallel streams [1, 2, 4, 8, 16, 32 64streams - 5min intervals], TCP buffer size is 50M
  31. 31. ANI testbed 100Gbps (10x10NICs, three hosts): Interfacetraffic vs the number of concurrent transfers [1, 2, 4, 8,16, 32 64 concurrent jobs - 5min intervals], TCP buffersize is 50M
  32. 32. Performance  at  SC11  demo   GridFTP MemzNetTCP  buffer  size  is  set  to  50MB    
  33. 33. Performance  results  in  the  100Gbps   ANI  testbed  
  34. 34. Host  tuning    With  proper  tuning,  we  achieved  98Gbps  using  only  3  sending  hosts,  3  receiving  hosts,  10  10GE  NICS,  and  10  TCP  flows  
  35. 35. NIC/TCP  Tuning  •  We  are  using  Myricom  10G  NIC  (100Gbps  testbed)   •  Download  latest  drive/firmware  from  vendor  site   •  Version  of  driver  in  RHEL/CentOS  fairly  old   •  Enable  MSI-­‐X   •  Increase  txgueuelen   /sbin/ifconfig eth2 txqueuelen 10000! •  Increase  Interrupt  coalescence! /usr/sbin/ethtool -C eth2 rx-usecs 100!•  TCP  Tuning:   net.core.rmem_max = 67108864! net.core.wmem_max = 67108864! net.core.netdev_max_backlog =  250000  
  36. 36. 100Gbps  =  It’s  full  of  frames  !  •  Problem:   •  Interrupts  are  very  expensive   •  Even  with  jumbo  frames  and  driver  opCmizaCon,     there  is  sCll  too  many  interrupts.  •  SoluCon:   •  Turn  off  Linux  irqbalance  (chkconfig  irqbalance  off)   •  Use  /proc/interrupt  to  get  the  list  of  interrupts   •  Dedicate  an  enCre  processor  core  for  each  10G   interface   •  Use  /proc/irq/<irq-­‐number>/smp_affinity  to  bind  rx/tx   queues  to  a  specific  core.  
  37. 37. Host Tuning Results 45 40 35 30 25Gbps 20 without tuning 15 with tuning 10 5 0 Interrupt Interrupt IRQ Binding IRQ Binding coalescing coalescing (TCP) (UDP) (TCP) (UDP)
  38. 38. Conclusion  •  Host  tuning  &  host  performance    •  MulCple  NICs  and  mulCple  cores  •  The  effect  of  the  applicaCon  design   •  TCP/UDP  buffer  tuning,  using  jumbo  frames,  and   interrupt  coalescing.       •  MulC-­‐core  systems:  IRQ  binding  is  now  essenCal   for  maximizing  host  performance.    
  39. 39. Acknowledgements  Peter   Nugent,   Zarija   Lukic   ,   Patrick   Dorn,   Evangelos  Chaniotakis,   John   Christman,   Chin   Guok,   Chris   Tracy,   Lauren  Rotman,   Jason   Lee,   Shane   Canon,   Tina   Declerck,   Cary  Whitney,   Ed   Holohan,     Adam   Scovel,   Linda   Winkler,   Jason   Hill,  Doug  Fuller,    Susan  Hicks,  Hank  Childs,  Mark  Howison,  Aaron  Thomas,  John  Dugan,  Gopal  Vaswani  
  40. 40. Ques?ons?  Contact:   • Mehmet  Balman  mbalman@lbl.gov   • Brian  Tierney  blCerney@es.net  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×