Streaming	
  Exa-­‐scale	
  Data	
  over	
  100Gbps	
  
Networks	
  
Mehmet	
  Balman	
  
	
  
Scien.fic	
  Data	
  Managem...
Outline	
  
•  A	
  recent	
  100Gbps	
  demo	
  by	
  ESnet	
  and	
  Internet2	
  
at	
  SC11	
  
	
  
	
  
	
  
•  One	...
ESG	
  (Earth	
  Systems	
  Grid)	
  
•  Over	
  2,700	
  sites	
  
•  25,000	
  users	
  
•  IPCC	
  FiNh	
  Assessment	
...
Applications’	
  Perspective	
  
•  Increasing	
  the	
  bandwidth	
  is	
  not	
  sufficient	
  by	
  itself;	
  
we	
  nee...
Climate	
  Data	
  Distribution	
  
•  ESG	
  data	
  nodes	
  
•  Data	
  replica=on	
  in	
  the	
  
ESG	
  Federa=on	
 ...
Climate	
  Data	
  over	
  100Gbps	
  
•  Data	
  volume	
  in	
  climate	
  applica=ons	
  is	
  increasing	
  
exponen=a...
Keep	
  the	
  data	
  channel	
  full	
  
FTP
RPC
request a file
request a file
send file
send file
request
data
send dat...
lots-­‐of-­‐small-­‐Diles	
  problem!	
  
Dile-­‐centric	
  tools?	
  	
  
l  Not	
  necessarily	
  high-­‐speed	
  (same...
Framework	
  for	
  the	
  Memory-­‐mapped	
  
Network	
  Channel	
  
memory	
  caches	
  are	
  logically	
  mapped	
  be...
Moving	
  climate	
  Diles	
  efDiciently	
  
The	
  SC11	
  100Gbps	
  demo	
  
environment	
  
Advantages	
  
•  Decoupling	
  I/O	
  and	
  network	
  opera=ons	
  
•  front-­‐end	
  (I/O	
  	
  processing)	
  
•  ba...
The	
  SC11	
  100Gbps	
  Demo	
  
•  CMIP3	
  data	
  (35TB)	
  from	
  the	
  GPFS	
  filesystem	
  at	
  NERSC	
  
•  Bl...
83Gbps	
  	
  
throughput	
  
MemzNet:	
  memory-­‐mapped	
  zero-­‐copy	
  
network	
  channel	
  
network
Front-­‐end	
  
threads	
  
(access	
  to	
 ...
ANI	
  100Gbps	
  	
  
testbed	
  
ANI 100G
Router
nersc-diskpt-2
nersc-diskpt-3
nersc-diskpt-1
nersc-C2940
switch
4x10GE ...
Many	
  TCP	
  Streams	
  
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface tra...
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16,
32 64 ...
MemzNet’s	
  Performance	
  	
  
TCP	
  buffer	
  size	
  is	
  set	
  to	
  50MB	
  	
  
MemzNetGridFTP
SC11 demo
ANI Test...
MemzNet’s	
  Architecture	
  for	
  data	
  
streaming	
  
 
Experience	
  with	
  100Gbps	
  Network	
  
Applications	
  
Mehmet	
  Balman,	
  Eric	
  Pouyoul,	
  Yushu	
  Yao,	
  ...
Acknowledgements	
  
Peter	
   Nugent,	
   Zarija	
   Lukic	
   ,	
   Patrick	
   Dorn,	
   Evangelos	
  
Chaniotakis,	
  ...
The	
  2nd	
  Interna=onal	
  Workshop	
  on	
  Network-­‐aware	
  
Data	
  Management	
  
to	
  be	
  held	
  in	
  conju...
Upcoming SlideShare
Loading in …5
×

Streaming exa-scale data over 100Gbps networks

357 views

Published on

Crd allhands meeting balman

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Streaming exa-scale data over 100Gbps networks

  1. 1. Streaming  Exa-­‐scale  Data  over  100Gbps   Networks   Mehmet  Balman     Scien.fic  Data  Management  Group   Computa.onal  Research  Division   Lawrence  Berkeley  Na.onal  Laboratory   CRD  All  Hands  Mee.ng   July  15,  2012  
  2. 2. Outline   •  A  recent  100Gbps  demo  by  ESnet  and  Internet2   at  SC11         •  One  of  the  applica=ons:   •  Data  movement  of  large    datasets  with  many   files  (Scaling  the  Earth  System  Grid  to  100Gbps   Networks)  
  3. 3. ESG  (Earth  Systems  Grid)   •  Over  2,700  sites   •  25,000  users   •  IPCC  FiNh  Assessment  Report   (AR5)  2PB     •  IPCC  Forth  Assessment  Report   (AR4)  35TB  
  4. 4. Applications’  Perspective   •  Increasing  the  bandwidth  is  not  sufficient  by  itself;   we  need  careful  evalua=on  of  high-­‐bandwidth   networks  from  the  applica=ons’  perspec=ve.       •  Data  distribu.on  for  climate  science   •   How  scien*fic  data  movement  and  analysis  between   geographically  disparate  supercompu*ng  facili*es  can   benefit  from  high-­‐bandwidth  networks?  
  5. 5. Climate  Data  Distribution   •  ESG  data  nodes   •  Data  replica=on  in  the   ESG  Federa=on   •  Local  copies   •  data  files  are  copied   into  temporary   storage  in  HPC  centers   for  post-­‐processing   and  further  climate   analysis.    
  6. 6. Climate  Data  over  100Gbps   •  Data  volume  in  climate  applica=ons  is  increasing   exponen=ally.   •  An  important  challenge  in  managing  ever  increasing  data  sizes   in  climate  science  is  the  large  variance  in  file  sizes.     •  Climate  simula=on  data  consists  of  a  mix  of  rela=vely  small  and   large  files  with  irregular  file  size  distribu=on  in  each  dataset.     •  Many  small  files  
  7. 7. Keep  the  data  channel  full   FTP RPC request a file request a file send file send file request data send data •  Concurrent  transfers   •  Parallel  streams  
  8. 8. lots-­‐of-­‐small-­‐Diles  problem!   Dile-­‐centric  tools?     l  Not  necessarily  high-­‐speed  (same  distance)   -  Latency  is  s=ll  a  problem   100Gbps pipe 10Gbps pipe request a dataset send data
  9. 9. Framework  for  the  Memory-­‐mapped   Network  Channel   memory  caches  are  logically  mapped  between  client  and  server    
  10. 10. Moving  climate  Diles  efDiciently  
  11. 11. The  SC11  100Gbps  demo   environment  
  12. 12. Advantages   •  Decoupling  I/O  and  network  opera=ons   •  front-­‐end  (I/O    processing)   •  back-­‐end  (networking  layer)     •  Not  limited  by  the  characteris=cs  of  the  file  sizes    On  the  fly  tar  approach,    bundling  and  sending    many  files  together   •  Dynamic  data  channel  management    Can  increase/decrease  the  parallelism  level  both    in  the  network  communica=on  and  I/O  read/write    opera=ons,  without  closing  and  reopening  the    data  channel  connec=on  (as  is  done  in  regular  FTP    variants).    
  13. 13. The  SC11  100Gbps  Demo   •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC   •  Block  size  4MB   •  Each  block’s  data  sec=on  was  aligned  according  to  the   system  pagesize.     •  1GB  cache  both  at  the  client  and  the  server       •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data   files  in  parallel.   •   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received   data  blocks.   •   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for   each  host-­‐to-­‐host  connec=on.    
  14. 14. 83Gbps     throughput  
  15. 15. MemzNet:  memory-­‐mapped  zero-­‐copy   network  channel   network Front-­‐end   threads   (access  to   memory   blocks) Front-­‐end   threads  (access   to  memory   blocks) Memory   blocks Memory   blocks memory  caches  are  logically  mapped  between  client  and  server    
  16. 16. ANI  100Gbps     testbed   ANI 100G Router nersc-diskpt-2 nersc-diskpt-3 nersc-diskpt-1 nersc-C2940 switch 4x10GE (MM) 4x 10GE (MM) Site Router (nersc-mr2) anl-mempt-2 anl-mempt-1 anl-app nersc-app NERSC ANL Updated December 11, 2011 ANI Middleware Testbed ANL Site Router 4x10GE (MM) 4x10GE (MM) 100G 100G 1GE 1 GE 1 GE 1 GE 1GE 1 GE 1 GE 1 GE 10G 10G To ESnet ANI 100G Router 4x10GE (MM) 100G 100G ANI 100G Network anl-mempt-1 NICs: 2: 2x10G Myricom anl-mempt-2 NICs: 2: 2x10G Myricom nersc-diskpt-1 NICs: 2: 2x10G Myricom 1: 4x10G HotLava nersc-diskpt-2 NICs: 1: 2x10G Myricom 1: 2x10G Chelsio 1: 6x10G HotLava nersc-diskpt-3 NICs: 1: 2x10G Myricom 1: 2x10G Mellanox 1: 6x10G HotLava Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability. nersc-asw1 anl-C2940 switch 1 GE anl-asw1 1 GE To ESnet eth0 eth0 eth0 eth0 eth0 eth0 eth2-5 eth2-5 eth2-5 eth2-5 eth2-5 eth0 anl-mempt-3 4x10GE (MM) eth2-5 eth0 1 GE anl-mempt-3 NICs: 1: 2x10G Myricom 1: 2x10G Mellanox 4x10GE (MM) 10GE (MM) 10GE (MM) SC11  100Gbps     demo  
  17. 17. Many  TCP  Streams   (a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).  
  18. 18. ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M Effects  of  many  streams  
  19. 19. MemzNet’s  Performance     TCP  buffer  size  is  set  to  50MB     MemzNetGridFTP SC11 demo ANI Testbed
  20. 20. MemzNet’s  Architecture  for  data   streaming  
  21. 21.   Experience  with  100Gbps  Network   Applications   Mehmet  Balman,  Eric  Pouyoul,  Yushu  Yao,  E.  Wes   Bethel,  Burlen  Loring,  Prabhat,  John  Shalf,  Alex   Sim,  and  Brian  L.  Tierney   DIDC  –  DelO,  the  Netherlands   June  19,  2012  
  22. 22. Acknowledgements   Peter   Nugent,   Zarija   Lukic   ,   Patrick   Dorn,   Evangelos   Chaniotakis,   John   Christman,   Chin   Guok,   Chris   Tracy,   Lauren   Rotman,   Jason   Lee,   Shane   Canon,   Tina   Declerck,   Cary   Whitney,  Ed  Holohan,    Adam  Scovel,  Linda  Winkler,  Jason  Hill,   Doug  Fuller,    Susan  Hicks,  Hank  Childs,  Mark  Howison,  Aaron   Thomas,  John  Dugan,  Gopal  Vaswani  
  23. 23. The  2nd  Interna=onal  Workshop  on  Network-­‐aware   Data  Management   to  be  held  in  conjunc=on  with   the  EEE/ACM  Interna=onal  Conference  for  High  Performance   Compu=ng,  Networking,  Storage  and  Analysis  (SC'12)     hip://sdm.lbl.gov/ndm/2012     Nov  11th,  2012     Papers  due  by   the  end  of  August   Last  year's  program  hip://sdm.lbl.gov/ndm/2011  

×