Streaming	
  Exa-­‐scale	
  Data	
  over	
  100Gbps	
  
Networks	
  
Mehmet	
  Balman	
  
	
  
Scien.fic	
  Data	
  Management	
  Group	
  
Computa.onal	
  Research	
  Division	
  
Lawrence	
  Berkeley	
  Na.onal	
  Laboratory	
  
CRD	
  All	
  Hands	
  Mee.ng	
  
July	
  15,	
  2012	
  
Outline	
  
•  A	
  recent	
  100Gbps	
  demo	
  by	
  ESnet	
  and	
  Internet2	
  
at	
  SC11	
  
	
  
	
  
	
  
•  One	
  of	
  the	
  applica=ons:	
  
•  Data	
  movement	
  of	
  large	
  	
  datasets	
  with	
  many	
  
files	
  (Scaling	
  the	
  Earth	
  System	
  Grid	
  to	
  100Gbps	
  
Networks)	
  
ESG	
  (Earth	
  Systems	
  Grid)	
  
•  Over	
  2,700	
  sites	
  
•  25,000	
  users	
  
•  IPCC	
  FiNh	
  Assessment	
  Report	
  
(AR5)	
  2PB	
  	
  
•  IPCC	
  Forth	
  Assessment	
  Report	
  
(AR4)	
  35TB	
  
Applications’	
  Perspective	
  
•  Increasing	
  the	
  bandwidth	
  is	
  not	
  sufficient	
  by	
  itself;	
  
we	
  need	
  careful	
  evalua=on	
  of	
  high-­‐bandwidth	
  
networks	
  from	
  the	
  applica=ons’	
  perspec=ve.	
  	
  
	
  
•  Data	
  distribu.on	
  for	
  climate	
  science	
  
•  	
  How	
  scien*fic	
  data	
  movement	
  and	
  analysis	
  between	
  
geographically	
  disparate	
  supercompu*ng	
  facili*es	
  can	
  
benefit	
  from	
  high-­‐bandwidth	
  networks?	
  
Climate	
  Data	
  Distribution	
  
•  ESG	
  data	
  nodes	
  
•  Data	
  replica=on	
  in	
  the	
  
ESG	
  Federa=on	
  
•  Local	
  copies	
  
•  data	
  files	
  are	
  copied	
  
into	
  temporary	
  
storage	
  in	
  HPC	
  centers	
  
for	
  post-­‐processing	
  
and	
  further	
  climate	
  
analysis.	
  	
  
Climate	
  Data	
  over	
  100Gbps	
  
•  Data	
  volume	
  in	
  climate	
  applica=ons	
  is	
  increasing	
  
exponen=ally.	
  
•  An	
  important	
  challenge	
  in	
  managing	
  ever	
  increasing	
  data	
  sizes	
  
in	
  climate	
  science	
  is	
  the	
  large	
  variance	
  in	
  file	
  sizes.	
  	
  
•  Climate	
  simula=on	
  data	
  consists	
  of	
  a	
  mix	
  of	
  rela=vely	
  small	
  and	
  
large	
  files	
  with	
  irregular	
  file	
  size	
  distribu=on	
  in	
  each	
  dataset.	
  	
  
•  Many	
  small	
  files	
  
Keep	
  the	
  data	
  channel	
  full	
  
FTP
RPC
request a file
request a file
send file
send file
request
data
send data
•  Concurrent	
  transfers	
  
•  Parallel	
  streams	
  
lots-­‐of-­‐small-­‐Diles	
  problem!	
  
Dile-­‐centric	
  tools?	
  	
  
l  Not	
  necessarily	
  high-­‐speed	
  (same	
  distance)	
  
-  Latency	
  is	
  s=ll	
  a	
  problem	
  
100Gbps pipe 10Gbps pipe
request a dataset
send data
Framework	
  for	
  the	
  Memory-­‐mapped	
  
Network	
  Channel	
  
memory	
  caches	
  are	
  logically	
  mapped	
  between	
  client	
  and	
  server	
  	
  
Moving	
  climate	
  Diles	
  efDiciently	
  
The	
  SC11	
  100Gbps	
  demo	
  
environment	
  
Advantages	
  
•  Decoupling	
  I/O	
  and	
  network	
  opera=ons	
  
•  front-­‐end	
  (I/O	
  	
  processing)	
  
•  back-­‐end	
  (networking	
  layer)	
  
	
  
•  Not	
  limited	
  by	
  the	
  characteris=cs	
  of	
  the	
  file	
  sizes	
  
	
  On	
  the	
  fly	
  tar	
  approach,	
  	
  bundling	
  and	
  sending	
  
	
  many	
  files	
  together	
  
•  Dynamic	
  data	
  channel	
  management	
  
	
  Can	
  increase/decrease	
  the	
  parallelism	
  level	
  both	
  
	
  in	
  the	
  network	
  communica=on	
  and	
  I/O	
  read/write	
  
	
  opera=ons,	
  without	
  closing	
  and	
  reopening	
  the	
  
	
  data	
  channel	
  connec=on	
  (as	
  is	
  done	
  in	
  regular	
  FTP	
  
	
  variants).	
  	
  
The	
  SC11	
  100Gbps	
  Demo	
  
•  CMIP3	
  data	
  (35TB)	
  from	
  the	
  GPFS	
  filesystem	
  at	
  NERSC	
  
•  Block	
  size	
  4MB	
  
•  Each	
  block’s	
  data	
  sec=on	
  was	
  aligned	
  according	
  to	
  the	
  
system	
  pagesize.	
  	
  
•  1GB	
  cache	
  both	
  at	
  the	
  client	
  and	
  the	
  server	
  	
  
	
  
•  At	
  NERSC,	
  8	
  front-­‐end	
  threads	
  on	
  each	
  host	
  for	
  reading	
  data	
  
files	
  in	
  parallel.	
  
•  	
  At	
  ANL/ORNL,	
  4	
  front-­‐end	
  threads	
  for	
  processing	
  received	
  
data	
  blocks.	
  
•  	
  4	
  parallel	
  TCP	
  streams	
  (four	
  back-­‐end	
  threads)	
  were	
  used	
  for	
  
each	
  host-­‐to-­‐host	
  connec=on.	
  	
  
83Gbps	
  	
  
throughput	
  
MemzNet:	
  memory-­‐mapped	
  zero-­‐copy	
  
network	
  channel	
  
network
Front-­‐end	
  
threads	
  
(access	
  to	
  
memory	
  
blocks)
Front-­‐end	
  
threads	
  (access	
  
to	
  memory	
  
blocks)
Memory	
  
blocks Memory	
  
blocks
memory	
  caches	
  are	
  logically	
  mapped	
  between	
  client	
  and	
  server	
  	
  
ANI	
  100Gbps	
  	
  
testbed	
  
ANI 100G
Router
nersc-diskpt-2
nersc-diskpt-3
nersc-diskpt-1
nersc-C2940
switch
4x10GE (MM)
4x 10GE (MM)
Site Router
(nersc-mr2)
anl-mempt-2
anl-mempt-1
anl-app
nersc-app
NERSC ANL
Updated December 11, 2011
ANI Middleware Testbed
ANL Site
Router
4x10GE (MM)
4x10GE (MM)
100G
100G
1GE
1 GE
1 GE
1 GE
1GE
1 GE
1 GE
1 GE
10G
10G
To ESnet
ANI 100G
Router
4x10GE (MM)
100G
100G
ANI 100G Network
anl-mempt-1 NICs:
2: 2x10G Myricom
anl-mempt-2 NICs:
2: 2x10G Myricom
nersc-diskpt-1 NICs:
2: 2x10G Myricom
1: 4x10G HotLava
nersc-diskpt-2 NICs:
1: 2x10G Myricom
1: 2x10G Chelsio
1: 6x10G HotLava
nersc-diskpt-3 NICs:
1: 2x10G Myricom
1: 2x10G Mellanox
1: 6x10G HotLava
Note: ANI 100G routers and 100G wave available till summer 2012;
Testbed resources after that subject funding availability.
nersc-asw1
anl-C2940
switch
1 GE
anl-asw1
1 GE
To ESnet
eth0
eth0
eth0
eth0
eth0
eth0
eth2-5
eth2-5
eth2-5
eth2-5
eth2-5
eth0
anl-mempt-3
4x10GE (MM)
eth2-5 eth0
1 GE
anl-mempt-3 NICs:
1: 2x10G Myricom
1: 2x10G Mellanox
4x10GE (MM)
10GE (MM)
10GE (MM)
SC11	
  100Gbps	
  	
  
demo	
  
Many	
  TCP	
  Streams	
  
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).	
  
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16,
32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M	

Effects	
  of	
  many	
  streams	
  
MemzNet’s	
  Performance	
  	
  
TCP	
  buffer	
  size	
  is	
  set	
  to	
  50MB	
  	
  
MemzNetGridFTP
SC11 demo
ANI Testbed
MemzNet’s	
  Architecture	
  for	
  data	
  
streaming	
  
 
Experience	
  with	
  100Gbps	
  Network	
  
Applications	
  
Mehmet	
  Balman,	
  Eric	
  Pouyoul,	
  Yushu	
  Yao,	
  E.	
  Wes	
  
Bethel,	
  Burlen	
  Loring,	
  Prabhat,	
  John	
  Shalf,	
  Alex	
  
Sim,	
  and	
  Brian	
  L.	
  Tierney	
  
DIDC	
  –	
  DelO,	
  the	
  Netherlands	
  
June	
  19,	
  2012	
  
Acknowledgements	
  
Peter	
   Nugent,	
   Zarija	
   Lukic	
   ,	
   Patrick	
   Dorn,	
   Evangelos	
  
Chaniotakis,	
   John	
   Christman,	
   Chin	
   Guok,	
   Chris	
   Tracy,	
   Lauren	
  
Rotman,	
   Jason	
   Lee,	
   Shane	
   Canon,	
   Tina	
   Declerck,	
   Cary	
  
Whitney,	
  Ed	
  Holohan,	
  	
  Adam	
  Scovel,	
  Linda	
  Winkler,	
  Jason	
  Hill,	
  
Doug	
  Fuller,	
   	
  Susan	
  Hicks,	
  Hank	
  Childs,	
  Mark	
  Howison,	
  Aaron	
  
Thomas,	
  John	
  Dugan,	
  Gopal	
  Vaswani	
  
The	
  2nd	
  Interna=onal	
  Workshop	
  on	
  Network-­‐aware	
  
Data	
  Management	
  
to	
  be	
  held	
  in	
  conjunc=on	
  with	
  
the	
  EEE/ACM	
  Interna=onal	
  Conference	
  for	
  High	
  Performance	
  
Compu=ng,	
  Networking,	
  Storage	
  and	
  Analysis	
  (SC'12)	
  
	
  
hip://sdm.lbl.gov/ndm/2012	
  
	
  
Nov	
  11th,	
  2012	
  
	
  
Papers	
  due	
  by	
   the	
  end	
  of	
  August	
  
Last	
  year's	
  program	
  hip://sdm.lbl.gov/ndm/2011	
  

Streaming exa-scale data over 100Gbps networks

  • 1.
    Streaming  Exa-­‐scale  Data  over  100Gbps   Networks   Mehmet  Balman     Scien.fic  Data  Management  Group   Computa.onal  Research  Division   Lawrence  Berkeley  Na.onal  Laboratory   CRD  All  Hands  Mee.ng   July  15,  2012  
  • 2.
    Outline   •  A  recent  100Gbps  demo  by  ESnet  and  Internet2   at  SC11         •  One  of  the  applica=ons:   •  Data  movement  of  large    datasets  with  many   files  (Scaling  the  Earth  System  Grid  to  100Gbps   Networks)  
  • 3.
    ESG  (Earth  Systems  Grid)   •  Over  2,700  sites   •  25,000  users   •  IPCC  FiNh  Assessment  Report   (AR5)  2PB     •  IPCC  Forth  Assessment  Report   (AR4)  35TB  
  • 4.
    Applications’  Perspective   • Increasing  the  bandwidth  is  not  sufficient  by  itself;   we  need  careful  evalua=on  of  high-­‐bandwidth   networks  from  the  applica=ons’  perspec=ve.       •  Data  distribu.on  for  climate  science   •   How  scien*fic  data  movement  and  analysis  between   geographically  disparate  supercompu*ng  facili*es  can   benefit  from  high-­‐bandwidth  networks?  
  • 5.
    Climate  Data  Distribution   •  ESG  data  nodes   •  Data  replica=on  in  the   ESG  Federa=on   •  Local  copies   •  data  files  are  copied   into  temporary   storage  in  HPC  centers   for  post-­‐processing   and  further  climate   analysis.    
  • 6.
    Climate  Data  over  100Gbps   •  Data  volume  in  climate  applica=ons  is  increasing   exponen=ally.   •  An  important  challenge  in  managing  ever  increasing  data  sizes   in  climate  science  is  the  large  variance  in  file  sizes.     •  Climate  simula=on  data  consists  of  a  mix  of  rela=vely  small  and   large  files  with  irregular  file  size  distribu=on  in  each  dataset.     •  Many  small  files  
  • 7.
    Keep  the  data  channel  full   FTP RPC request a file request a file send file send file request data send data •  Concurrent  transfers   •  Parallel  streams  
  • 8.
    lots-­‐of-­‐small-­‐Diles  problem!   Dile-­‐centric  tools?     l  Not  necessarily  high-­‐speed  (same  distance)   -  Latency  is  s=ll  a  problem   100Gbps pipe 10Gbps pipe request a dataset send data
  • 9.
    Framework  for  the  Memory-­‐mapped   Network  Channel   memory  caches  are  logically  mapped  between  client  and  server    
  • 10.
    Moving  climate  Diles  efDiciently  
  • 11.
    The  SC11  100Gbps  demo   environment  
  • 12.
    Advantages   •  Decoupling  I/O  and  network  opera=ons   •  front-­‐end  (I/O    processing)   •  back-­‐end  (networking  layer)     •  Not  limited  by  the  characteris=cs  of  the  file  sizes    On  the  fly  tar  approach,    bundling  and  sending    many  files  together   •  Dynamic  data  channel  management    Can  increase/decrease  the  parallelism  level  both    in  the  network  communica=on  and  I/O  read/write    opera=ons,  without  closing  and  reopening  the    data  channel  connec=on  (as  is  done  in  regular  FTP    variants).    
  • 13.
    The  SC11  100Gbps  Demo   •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC   •  Block  size  4MB   •  Each  block’s  data  sec=on  was  aligned  according  to  the   system  pagesize.     •  1GB  cache  both  at  the  client  and  the  server       •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data   files  in  parallel.   •   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received   data  blocks.   •   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for   each  host-­‐to-­‐host  connec=on.    
  • 14.
  • 15.
    MemzNet:  memory-­‐mapped  zero-­‐copy   network  channel   network Front-­‐end   threads   (access  to   memory   blocks) Front-­‐end   threads  (access   to  memory   blocks) Memory   blocks Memory   blocks memory  caches  are  logically  mapped  between  client  and  server    
  • 16.
    ANI  100Gbps     testbed   ANI 100G Router nersc-diskpt-2 nersc-diskpt-3 nersc-diskpt-1 nersc-C2940 switch 4x10GE (MM) 4x 10GE (MM) Site Router (nersc-mr2) anl-mempt-2 anl-mempt-1 anl-app nersc-app NERSC ANL Updated December 11, 2011 ANI Middleware Testbed ANL Site Router 4x10GE (MM) 4x10GE (MM) 100G 100G 1GE 1 GE 1 GE 1 GE 1GE 1 GE 1 GE 1 GE 10G 10G To ESnet ANI 100G Router 4x10GE (MM) 100G 100G ANI 100G Network anl-mempt-1 NICs: 2: 2x10G Myricom anl-mempt-2 NICs: 2: 2x10G Myricom nersc-diskpt-1 NICs: 2: 2x10G Myricom 1: 4x10G HotLava nersc-diskpt-2 NICs: 1: 2x10G Myricom 1: 2x10G Chelsio 1: 6x10G HotLava nersc-diskpt-3 NICs: 1: 2x10G Myricom 1: 2x10G Mellanox 1: 6x10G HotLava Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability. nersc-asw1 anl-C2940 switch 1 GE anl-asw1 1 GE To ESnet eth0 eth0 eth0 eth0 eth0 eth0 eth2-5 eth2-5 eth2-5 eth2-5 eth2-5 eth0 anl-mempt-3 4x10GE (MM) eth2-5 eth0 1 GE anl-mempt-3 NICs: 1: 2x10G Myricom 1: 2x10G Mellanox 4x10GE (MM) 10GE (MM) 10GE (MM) SC11  100Gbps     demo  
  • 17.
    Many  TCP  Streams   (a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).  
  • 18.
    ANI testbed 100Gbps(10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M Effects  of  many  streams  
  • 19.
    MemzNet’s  Performance     TCP  buffer  size  is  set  to  50MB     MemzNetGridFTP SC11 demo ANI Testbed
  • 20.
    MemzNet’s  Architecture  for  data   streaming  
  • 21.
      Experience  with  100Gbps  Network   Applications   Mehmet  Balman,  Eric  Pouyoul,  Yushu  Yao,  E.  Wes   Bethel,  Burlen  Loring,  Prabhat,  John  Shalf,  Alex   Sim,  and  Brian  L.  Tierney   DIDC  –  DelO,  the  Netherlands   June  19,  2012  
  • 22.
    Acknowledgements   Peter  Nugent,   Zarija   Lukic   ,   Patrick   Dorn,   Evangelos   Chaniotakis,   John   Christman,   Chin   Guok,   Chris   Tracy,   Lauren   Rotman,   Jason   Lee,   Shane   Canon,   Tina   Declerck,   Cary   Whitney,  Ed  Holohan,    Adam  Scovel,  Linda  Winkler,  Jason  Hill,   Doug  Fuller,    Susan  Hicks,  Hank  Childs,  Mark  Howison,  Aaron   Thomas,  John  Dugan,  Gopal  Vaswani  
  • 23.
    The  2nd  Interna=onal  Workshop  on  Network-­‐aware   Data  Management   to  be  held  in  conjunc=on  with   the  EEE/ACM  Interna=onal  Conference  for  High  Performance   Compu=ng,  Networking,  Storage  and  Analysis  (SC'12)     hip://sdm.lbl.gov/ndm/2012     Nov  11th,  2012     Papers  due  by   the  end  of  August   Last  year's  program  hip://sdm.lbl.gov/ndm/2011