Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Network-­‐aware	
  Data	
  
Management	
  for	
  Large-­‐scale	
  
Distributed	
  Applications	
  
June	
  24,	
  2015	
  ...
About	
  me:	
  
Ø 2013:	
  Performance,	
  Central	
  Engineering,	
  VMware,	
  Palo	
  Alto,	
  CA	
  
Ø 2009:	
  Com...
Why	
  Network-­‐aware?	
  
Networking	
  is	
  one	
  of	
  the	
  major	
  components	
  in	
  many	
  of	
  the	
  
sol...
Two	
  major	
  players:	
  
• AbstracMon	
  and	
  Programmability	
  
•  Rapid	
  Development,	
  Intelligent	
  service...
Outline	
  	
  
•  VSAN	
  +	
  VVOL	
  Storage	
  Performance	
  	
  in	
  Virtualized	
  
Environments	
  
•  PetaShare	...
VSAN:	
  virtual	
  SAN	
  
6	
  
VSAN	
  image:	
  blog.vmware.com	
  
Distributed	
  Object	
  
Storage	
  
	
  
Hybrid	...
VSAN	
  performance	
  work	
  in	
  a	
  nutshell	
  
7	
  
Observer	
  image:	
  blog.vmware.com	
  
•  Every	
  write	
...
8	
  
VVOL:	
  
virtual	
  volumes	
  
VVOL	
  image:	
  blog.vmware.com	
  
Offloading	
  	
  
control	
  operaMons	
  to	
...
VVOL	
  performance	
  work	
  
• Effect	
  of	
  the	
  latency	
  	
  in	
  control	
  path	
  	
  
• 	
  	
  	
   	
  li...
PetaShare	
  +	
  Stork	
  Data	
  Scheduler	
  
10	
  
AggregaMon	
  in	
  Data	
  Path:	
  
	
  
Advance	
  Buffer	
  Cac...
Adaptive	
  Tuning	
  +	
  Advanced	
  Buffer	
  
11	
  
AdapMve	
  Tuning	
  for	
  
Bulk	
  Transfer	
  	
  	
  
Buffer	
...
Outline	
  	
  
•  VSAN	
  +	
  VVOL	
  Storage	
  Performance	
  	
  in	
  Virtualized	
  
Environments	
  
•  PetaShare	...
100Gbps	
  networking	
  has	
  Linally	
  arrived!	
  
Applica>ons’	
  Perspec>ve	
  
Increasing	
   the	
   bandwidth	
 ...
ANI	
  
100Gbps	
  
Demo	
  
•  100Gbps	
  demo	
  by	
  ESnet	
  and	
  
Internet2	
  	
  
	
  
•  ApplicaMon	
  design	
...
Earth	
  System	
  Grid	
  Federation	
  (ESGF)	
  
15	
  
•  Over	
  2,700	
  sites	
  
•  25,000	
  users	
  
	
  
•  IP...
Application’s	
  
Perspective:	
  	
  
Climate	
  Data	
  Analysis	
  
16	
  
 
lots-­‐of-­‐small-­‐*iles	
  problem!	
  
*ile-­‐centric	
  tools?	
  	
  
FTP
RPC
request a file
request a file
send fi...
Many	
  Concurrent	
  Streams	
  
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interf...
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 ...
Analysis	
  of	
  	
  Core	
  AfLinities	
  
	
  (NUMA	
  Effect)	
  
20	
  Nathan	
  Hanford	
  et	
  al.	
  	
  NDM’13	
...
21	
  
Analysis	
  of	
  	
  Core	
  AfLinities	
  
	
  (NUMA	
  Effect)	
  
Nathan	
  Hanford	
  et	
  al.	
  
NDM’14	
  
 100Gbps	
  demo	
  environment	
  
RRT:	
  	
  Sea3le	
  –	
  NERSC	
  	
  16ms	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	...
Framework	
  for	
  the	
  Memory-­‐mapped	
  
Network	
  Channel	
  
+	
  SynchronizaMon	
  mechanism	
  for	
  RoCE	
  
...
Moving	
  climate	
  *iles	
  ef*iciently	
  
24	
  
Advantages	
  
•  Decoupling	
  I/O	
  and	
  network	
  operaMons	
  
•  front-­‐end	
  (I/O	
  	
  processing)	
  
•  ba...
MemzNet’s	
  Architecture	
  for	
  data	
  
streaming	
  
26	
  
100Gbps	
  Demo	
  
•  CMIP3	
  data	
  (35TB)	
  from	
  the	
  GPFS	
  filesystem	
  at	
  NERSC	
  
•  Block	
  size	
  ...
MemzNet’s	
  Performance	
  	
  
TCP	
  buffer	
  size	
  is	
  set	
  to	
  50MB	
  	
  
MemzNetGridFTP
100Gbps demo
ANI T...
Challenge?	
  
•  High-­‐bandwidth	
  brings	
  new	
  challenges!	
  
•  We	
  need	
  substanMal	
  amount	
  of	
  proc...
MemzNet	
  
•  MemzNet:	
  Memory-­‐mapped	
  Network	
  Channel	
  	
  
•  High-­‐performance	
  data	
  movement	
  
	
 ...
MemzNet	
  =	
  New	
  Execution	
  Model	
  
•  Luigi	
  Rizzo	
  ’s	
  netmap	
  	
  
•  proposes	
  a	
  new	
  API	
  ...
Outline	
  	
  
•  VSAN	
  +	
  VVOL	
  Storage	
  Performance	
  	
  in	
  Virtualized	
  
Environments	
  
•  PetaShare	...
Problem	
  Domain:	
  Esnet’s	
  OSCARS	
  
33	
  
ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-...
Reservation	
  Request	
  
•  Between	
  edge	
  routers	
  
	
  
Need	
  to	
  ensure	
  availability	
  of	
  the	
  req...
Reservation	
  
35	
  
v  Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (...
Example	
  
(Mme	
  t1,	
  t2)	
  :	
  
	
  
A	
  to	
  D	
  (600Mbps)	
  NO	
  
	
  
A	
  to	
  D	
  (500Mbps)	
  YES	
  ...
Example	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  /	
  900Mbps	
  (900Mbps)	
  
100	
  Mbps	
  /	
  900Mbps	
  (1000Mbps)	
 ...
Alternative	
  Approach:	
  Flexible	
  Reservations	
  
•  IF	
  the	
  requested	
  bandwidth	
  can	
  not	
  be	
  gua...
Bandwidth	
  Allocation	
  (time-­‐dependent)	
  
	
  	
  	
  	
  
	
  	
  
Modified	
  Dijstra's	
  
algorithms	
  (max	
 ...
Analogous Example
n  A vehicle travelling from city A to city B
n  There are multiple cities between A and B connected w...
Time steps
n  Time steps between t1 and t13
Mme	
  
t4	
  t2	
   t3	
  t1	
   t5	
   t6	
   t7	
   t8	
   t9	
   t10	
   ...
Static Graphs
Res	
  1	
   Res	
  1,2	
   Res	
  2	
  
t4	
  t1	
  
t6	
   t7	
   t9	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	...
Time Windows
Res	
  1,2	
   Res	
  2	
  
t1	
  
t6	
   t9	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  
100	
  Mbps	
  
400	
  ...
Time	
  Window	
  List	
  	
  
	
   	
   	
  (special	
  data	
  structures)	
  
now	
   infinite	
  
Time	
  windows	
  li...
Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require ...
Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation ...
Search Order - Time Windows
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12	
...
Search Order - Time Windows
Shortest	
  dura>on?	
  	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
...
Source	
  >	
  Network	
  >	
  Destination	
  
	
  
A
CB
D
800Mbps	
  
900Mbps	
   500Mbps	
  
1000Mbps	
  
300Mbps	
  
n2...
With	
  start/end	
  times	
  
•  	
  Each	
  transfer	
  request	
  has	
  start	
  and	
  end	
  Mmes	
  
•  n	
  transf...
Methodology	
  
•  Displace	
  other	
  jobs	
  to	
  open	
  space	
  for	
  the	
  new	
  request	
  
•  	
  we	
  can	
...
 	
  	
  	
  
52	
  
Recall	
  Time	
  Windows	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12...
Test	
  
	
  
54	
  
In	
  real	
  life,	
  number	
  of	
  
nodes	
  and	
  number	
  of	
  
reservaMon	
  in	
  a	
  giv...
Autonomic	
  Provisioning	
  System	
  
•  Generate	
  constraints	
  automaMcally	
  (without	
  user	
  input)	
  
•  Vo...
Data	
  Center	
  1	
  
Data	
  Center	
  2	
  
Data	
  node	
  B	
  
	
  (web	
  access)	
  
Experimental	
  
	
  facilit...
Example	
  
•  Experimental	
  facility	
  periodically	
  transfers	
  data	
  (i.e.	
  every	
  night)	
  
•  Data	
  re...
Virtual	
  Circuit	
  
ReservaMon	
  Engine	
  
Autonomic	
  provisioning	
  
system	
  
monitoring	
  
Reserva>on	
  Engi...
THANK	
  YOU	
  
	
  
Any	
  QuesMon/Comment?	
  	
  	
  	
  	
  
Mehmet	
  Balman	
  	
  	
  	
  	
  mbalman@lbl.gov	
  
...
Upcoming SlideShare
Loading in …5
×

Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015

72 views

Published on

IBM Research – Talk – June 24, 2015

Title:

Network-aware Data Management for Large Scale Distributed Applications

Abstract:
As current technology enables faster storage devices and larger interconnect bandwidth, there is a substantial need for novel system design and middleware architecture to address increasing latency, scalability, and throughput requirements. In this talk, I will outline network-aware data management and present solutions based on my past experience in large-scale data migration between remote repositories.

I will first describe my experience in the initial evaluation of 100Gbps network as a part of the Advance Network Initiative project. We needed intense fine-tuning in network, storage, and application layers, to take advantage of the higher network capacity. End-system bottlenecks and system performance play an important role especially in many-core platforms. I will introduce a special data movement prototype, successfully tested in one of the first 100Gbps demonstrations, in which applications map memory blocks for remote data, in contrast to the send/receive semantics. This prototype was used to stream climate data over wide-area for in-memory application processing and visualization.

Within this scope, I will introduce a flexible network reservation algorithm for on-demand bandwidth guaranteed virtual circuit services. Flexible reservations find best path in a time-dependent dynamic network topology to support predictable application performance. I will then present a data-scheduling model with advance provisioning, in which data movement operations are defined with earliest start and latest completion times.

I will conclude my talk with a very brief overview of my other related projects on performance engineering, hyper-converged virtual storage, and optimization in control and data path for virtualized environments.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015

  1. 1. Network-­‐aware  Data   Management  for  Large-­‐scale   Distributed  Applications   June  24,  2015   Mehmet  Balman   h3p://balman.info     Senior  Performance  Engineer  at  VMware  Inc.     Guest/Affiliate  at  Berkeley  Lab   1  
  2. 2. About  me:   Ø 2013:  Performance,  Central  Engineering,  VMware,  Palo  Alto,  CA   Ø 2009:  ComputaMonal  Research  Division  (CRD)  at  Lawrence  Berkeley   NaMonal  Laboratory  (LBNL)   Ø 2005:  Center  for  ComputaMon  &  Technology  (CCT),  Baton  Rouge,  LA   v Computer  Science,  Louisiana  State  University  (2010,2008)   v Bogazici  University,  Istanbul,  Turkey  (2006,2000)     Data  Transfer  Scheduling  with  Advance  ReservaMon  and  Provisioning,  Ph.D.                     Failure-­‐Awareness  and  Dynamic  AdaptaMon  in  Data  Scheduling,  M.S.   Parallel  Tetrahedral  Mesh  Refinement,  M.S.   2  
  3. 3. Why  Network-­‐aware?   Networking  is  one  of  the  major  components  in  many  of  the   soluMons  today   •  Distributed  data  and  compute  resources   •  CollaboraMon:  data  to  be  shared  between  remote  sites   •  Data  centers  are  complex  network  infrastructures     ü What  further  steps  are  necessary  to  take  full  advantage  of  future   networking  infrastructure?   ü How  are  we  going  to  deal  with  performance  problems?     ü How  can  we  enhance  data  management  services  and  make  them   network-­‐aware?     New  collabora>ons  between  data  management  and   networking  communi>es.   3  
  4. 4. Two  major  players:   • AbstracMon  and  Programmability   •  Rapid  Development,  Intelligent  services   •  OrchestraMng  compute,  storage,  and  network  resources  together   •  IntegraMon  and  deployment  of  complex  workflows   •  VirtualizaMon  (+containers)     •  Distributed  storage  (storage  wars)   •  Open  Source    (if  you  can’t  fix  it,  you  don’t  own  it)   •  Performance  Gap:   •  LimitaMon  in  current  system  so3ware  vs  foreseen    speed:   •  Hardware  is  fast,  Sofware  is  slow     •  Latency  vs  throughput  mismatch  will  lead  to  new   innovaGons   4  
  5. 5. Outline     •  VSAN  +  VVOL  Storage  Performance    in  Virtualized   Environments   •  PetaShare  Distributed  Storage  +  Stork  Data  Scheduler            Adap>ve  Tuning  +  Advanced  Buffers   •  Data  Streaming  in  High-­‐bandwidth  Networks   •  Climate100:  Advance  Network  IniMaMve  and  100Gbps  Demo   •  MemzNet:  Memory-­‐Mapped  Network  Zero-­‐copy  Channels     •  Core  Affinity  and  End  System  Tuning  in  High-­‐Throughput   Flows   •  Network  Reserva>on  and  Online  Scheduling  (QoS)   •  FlexRes:  A  Flexible  Network  ReservaMon  Algorithm   •  SchedSim:  Online  Scheduling  with  Advance  Provisioning       5  
  6. 6. VSAN:  virtual  SAN   6   VSAN  image:  blog.vmware.com   Distributed  Object   Storage     Hybrid  (SSD+HDD)  
  7. 7. VSAN  performance  work  in  a  nutshell   7   Observer  image:  blog.vmware.com   •  Every  write  operaMon  needs  to  go  over  network  (and   network  is  not  free)   •  Each  layer  (cache,  disk,  object  management,  etc.)  needs   resources  (CPU,  memory)   •  Resource  limitaMons  vs  Latency  effect   •  Needs  to  support  thousands  of  VMs   Placement  of  Objects:   •  Which  Host?   •  Which  Disk/SSD  in  the   Host?   What  if  there  are   failures,  migraMons,     and  if  we  need  to   rebalance  
  8. 8. 8   VVOL:   virtual  volumes   VVOL  image:  blog.vmware.com   Offloading     control  operaMons  to   the  storage  array       •  powerOn   •  powerOff   •  Delete   •  clone  
  9. 9. VVOL  performance  work   • Effect  of  the  latency    in  control  path     •         linked  clone  vs  VVOL  clones             9   Vsphere   Storage     Host   VASA  VP   Data  path   Control  path     •  Op>mize  service  latencies     •  Batching  (disklib)   •  Use  concurrent  opera>ons    
  10. 10. PetaShare  +  Stork  Data  Scheduler   10   AggregaMon  in  Data  Path:     Advance  Buffer  Cache  in  Petafs  and  Petashell  clients  by  aggregaMng   I/O  requests  to  minimize  the  number  of  network  messages  
  11. 11. Adaptive  Tuning  +  Advanced  Buffer   11   AdapMve  Tuning  for   Bulk  Transfer       Buffer  Cache  for   Remote  I/O  
  12. 12. Outline     •  VSAN  +  VVOL  Storage  Performance    in  Virtualized   Environments   •  PetaShare  Distributed  Storage  +  Stork  Data  Scheduler            Adap>ve  Tuning  +  Advanced  Buffers   •  Data  Streaming  in  High-­‐bandwidth  Networks   •  Climate100:  Advance  Network  IniMaMve  and  100Gbps  Demo   •  MemzNet:  Memory-­‐Mapped  Network  Zero-­‐copy  Channels     •  Core  Affinity  and  End  System  Tuning  in  High-­‐Throughput   Flows   •  Network  Reserva>on  and  Online  Scheduling  (QoS)   •  FlexRes:  A  Flexible  Network  ReservaMon  Algorithm   •  SchedSim:  Online  Scheduling  with  Advance  Provisioning       12  
  13. 13. 100Gbps  networking  has  Linally  arrived!   Applica>ons’  Perspec>ve   Increasing   the   bandwidth   is   not   sufficient   by   itself;   we   need   careful   evaluaMon   of   high-­‐bandwidth   networks   from   the   applicaMons’  perspecMve.       1Gbps  to  10Gbps  transiMon     (10  years  ago)   ApplicaMon  did  not  run  10  Mmes   faster  because  there  was  more   bandwidth  available   13  
  14. 14. ANI   100Gbps   Demo   •  100Gbps  demo  by  ESnet  and   Internet2       •  ApplicaMon  design  issues  and  host   tuning  strategies  to  scale  to  100Gbps   rates     •  VisualizaMon  of  remotely  located  data   (Cosmology)     •  Data  movement  of  large    datasets  with   many  files  (Climate  analysis)     14  
  15. 15. Earth  System  Grid  Federation  (ESGF)   15   •  Over  2,700  sites   •  25,000  users     •  IPCC  Fifh  Assessment  Report  (AR5)  2PB     •  IPCC  Forth  Assessment  Report  (AR4)  35TB   •  Remote    Data  Analysis   •  Bulk  Data  Movement  
  16. 16. Application’s   Perspective:     Climate  Data  Analysis   16  
  17. 17.   lots-­‐of-­‐small-­‐*iles  problem!   *ile-­‐centric  tools?     FTP RPC request a file request a file send file send file request data send data •  Keep  the  network  pipe  full   •  We  want  out-­‐of-­‐order  and  asynchronous  send  receive       17  
  18. 18. Many  Concurrent  Streams   (a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).       18  
  19. 19. ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M Effects  of  many  concurrent  streams   19  
  20. 20. Analysis  of    Core  AfLinities    (NUMA  Effect)   20  Nathan  Hanford  et  al.    NDM’13   Sandy  Bridge  Architecture   Receive  process    
  21. 21. 21   Analysis  of    Core  AfLinities    (NUMA  Effect)   Nathan  Hanford  et  al.   NDM’14  
  22. 22.  100Gbps  demo  environment   RRT:    Sea3le  –  NERSC    16ms                      NERSC  –  ANL              50ms                      NERSC  –  ORNL        64ms   22  
  23. 23. Framework  for  the  Memory-­‐mapped   Network  Channel   +  SynchronizaMon  mechanism  for  RoCE   -­‐  Keep  the  pipe  full  for  remote  analysis   23  
  24. 24. Moving  climate  *iles  ef*iciently   24  
  25. 25. Advantages   •  Decoupling  I/O  and  network  operaMons   •  front-­‐end  (I/O    processing)   •  back-­‐end  (networking  layer)     •  Not  limited  by  the  characterisMcs  of  the  file  sizes   •  On  the  fly  tar  approach,    bundling  and  sending    many  files   together   •  Dynamic  data  channel  management     Can   increase/decrease   the   parallelism   level   both     in   the   network   communicaMon   and   I/O   read/write   operaMons,   without   closing   and   reopening   the   data   channel   connecMon   (as   is   done   in   regular   FTP   variants).     MemzNet  is    is  not  file-­‐centric.  Bookkeeping  informaMon  is  embedded   inside  each  block.       25  
  26. 26. MemzNet’s  Architecture  for  data   streaming   26  
  27. 27. 100Gbps  Demo   •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC   •  Block  size  4MB   •  Each  block’s  data  secMon  was  aligned  according  to  the   system  pagesize.     •  1GB  cache  both  at  the  client  and  the  server     •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data  files   in  parallel.   •   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received  data   blocks.   •   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for   each  host-­‐to-­‐host  connecMon.     27  
  28. 28. MemzNet’s  Performance     TCP  buffer  size  is  set  to  50MB     MemzNetGridFTP 100Gbps demo ANI Testbed 28  
  29. 29. Challenge?   •  High-­‐bandwidth  brings  new  challenges!   •  We  need  substanMal  amount  of  processing  power  and  involvement  of   mulMple  cores  to  fill  a  40Gbps  or  100Gbps  network     •  Fine-­‐tuning,  both  in  network  and  applicaMon  layers,  to  take   advantage  of  the  higher  network  capacity.     •  Incremental  improvement  in  current  tools?   •  We  cannot  expect  every  applicaMon  to  tune  and  improve  every  Mme  we   change  the  link  technology  or  speed.       29  
  30. 30. MemzNet   •  MemzNet:  Memory-­‐mapped  Network  Channel     •  High-­‐performance  data  movement     MemzNet  is  an  iniMal  effort  to  put  a  new  layer   between  the  applicaMon  and  the  transport  layer.   •  Main  goal  is  to  define  a  network  channel  so  applicaMons  can   directly  use  it  without  the  burden  of  managing/tuning  the  network   communicaMon.     30   Tech  report:  LBNL-­‐6177E  
  31. 31. MemzNet  =  New  Execution  Model   •  Luigi  Rizzo  ’s  netmap     •  proposes  a  new  API  to  send/receive  data  over  the   network   • RDMA  programming  model   •  MemzNet  as  a  memory-­‐management  component   • IX:  Data  Plane  OS  (Adam  Baley  et  al.  @standford  –   similar  to  MemzNet’s  model)   •  mTCP  (even  based  /  replaces  send/receive  in  user  level)   •  Tanenbaum  et  al.    Minimizing  context  switches:   proposing  to  use  MONITOR/MWAIT  for   synchronizaMon   31  
  32. 32. Outline     •  VSAN  +  VVOL  Storage  Performance    in  Virtualized   Environments   •  PetaShare  Distributed  Storage  +  Stork  Data  Scheduler            Adap>ve  Tuning  +  Advanced  Buffers   •  Data  Streaming  in  High-­‐bandwidth  Networks   •  Climate100:  Advance  Network  IniMaMve  and  100Gbps  Demo   •  MemzNet:  Memory-­‐Mapped  Network  Zero-­‐copy  Channels     •  Core  Affinity  and  End  System  Tuning  in  High-­‐Throughput   Flows   •  Network  Reserva>on  and  Online  Scheduling  (QoS)   •  FlexRes:  A  Flexible  Network  ReservaMon  Algorithm   •  SchedSim:  Online  Scheduling  with  Advance  Provisioning       32  
  33. 33. Problem  Domain:  Esnet’s  OSCARS   33   ASIA-PACIFIC (ASGC/Kreonet2/ TWAREN) ASIA-PACIFIC (KAREN/KREONET2/ NUS-GP/ODN/ REANNZ/SINET/ TRANSPAC/TWAREN) AUSTRALIA (AARnet) LATIN AMERICA CLARA/CUDI CANADA (CANARIE) RUSSIA AND CHINA (GLORIAD) US R&E (DREN/Internet2/NLR) US R&E (DREN/Internet2/ NASA) US R&E (NASA/NISN/ USDOI) ASIA-PACIFIC (BNP/HEPNET) ASIA-PACIFIC (ASCC/KAREN/ KREONET2/NUS-GP/ ODN/REANNZ/ SINET/TRANSPAC) AUSTRALIA (AARnet) US R&E (DREN/Internet2/ NISN/NLR) US R&E (Internet2/ NLR) CERN US R&E (DREN/Internet2/ NISN) CANADA (CANARIE) LHCONE CANADA (CANARIE) FRANCE (OpenTransit) RUSSIA AND CHINA (GLORIAD) CERN (USLHCNet) ASIA-PACIFIC (SINET) EUROPE (GÉANT/ NORDUNET) EUROPE (GÉANT) LATIN AMERICA (AMPATH/CLARA) LATIN AMERICA (CLARA/CUDI) HOUSTON ALBUQUERQUE El PASO SUNNYVALE BOISE SEATTLE KANSAS CITY NASHVILLE WASHINGTON DC NEW YORK BOSTON CHICAGO DENVER SACRAMENTO ATLANTA PNNL SLAC AMES PPPL BNL ORNL JLAB FNAL ANL LBNL •  ConnecMng  experimental  faciliMes  and  supercompuMng  centers   •  On-­‐Demand  Secure  Circuits  and  Advance  ReservaMon  System     •  Guaranteed  between  collaboraMng  insMtuMons  by  delivering   network-­‐as-­‐a-­‐service     •  Co-­‐allocaMon  of  storage  and  network  resources         (SRM:  Storage  Resource  Manager)     OSCARS  provides  yes/no   answers  to  a  reservaMon   request  for  (bandwidth,   start_Gme,  end_Gme)   End-­‐to-­‐end  ReservaMon:    Storage+Network    
  34. 34. Reservation  Request   •  Between  edge  routers     Need  to  ensure  availability  of  the  requested  bandwidth  from  source  to   desGnaGon  for  the  requested  Gme  interval     v   R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}.   v  source/desMnaMon  end-­‐points   v  Requested  bandwidth   v  start/end  Mmes     Commi3ed  reservaMons  between  tstart  and  tend  are  examined       The  shortest  path  from  source  to  desMnaMon  is  calculated  based  on  the   engineering  metric  on  each  link,  and  a  bandwidth  guaranteed  path  is  set   up  to  commit  and  eventually  complete  the  reservaMon  request  for  the   given  Mme  period   34  
  35. 35. Reservation   35   v  Components (Graph): v node (router), port, link (connecting two ports) v engineering metric (~latency) v maximum bandwidth (capacity) v  Reservation: v source, destination, path, time v (time t1, t3) A -> B -> D (900Mbps) v (time t2, t3) A -> C -> D (400Mbps) v (time t4, t5) A -> B -> D (800Mpbs) A   C  B   D   800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   ReservaMon  1   ReservaMon  2   ReservaMon  3   t1   t2   t3   t4   t5  
  36. 36. Example   (Mme  t1,  t2)  :     A  to  D  (600Mbps)  NO     A  to  D  (500Mbps)  YES           A   C  B   D   0  Mbps  /  900Mbps  (900Mbps)   100  Mbps  /  900Mbps  (1000Mbps)   800  Mbps  /  0Mbps  (800Mbps)   500  Mbps  /  0Mbps  (500Mbps)   300  Mbps  /    0  Mbps  (300Mbps)   AcMve  reservaMon   reservaMon  1:  (Mme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)   reservaMon  2:  (Mme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)   reservaMon  3:  (Mme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)   available/  reserved   (capacity)     36  
  37. 37. Example   A   C  B   D   0  Mbps  /  900Mbps  (900Mbps)   100  Mbps  /  900Mbps  (1000Mbps)   400  Mbps  /  400Mbps  (800Mbps)   100  Mbps  /  400Mbps  (500Mbps)   300  Mbps  /    0  Mbps  (300Mbps)   (Mme  t1,  t3)  :     A  to  D  (500Mbps)  NO       A  to  C  (500Mbps)  No   (not  max-­‐FLOW!)           AcMve  reservaMon   reservaMon  1:  (Mme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)   reservaMon  2:  (Mme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)   reservaMon  3:  (Mme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)   available/  reserved   (capacity)     37  
  38. 38. Alternative  Approach:  Flexible  Reservations   •  IF  the  requested  bandwidth  can  not  be  guaranteed:   •  Try-­‐and-­‐error  unMl  get  an  available  reservaMon   •  Client  is  not  given  other  possible  opMons   •  How  can  we  enhance  the  OSCARS  reservaMon  system?   •  Be  Flexible:   •  Submit  constraints  and  the  system  suggests  possible  reservaMon  opMons   saMsfying  given  requirements   38    Rs '={  nsource  ,  ndesGnaGon,  MMAXbandwidth,  DdataSize,  tEarliestStart,  tLatestEnd}     ReservaMon  engine  finds  out  the  reservaMon            R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}     for  the  earliest  compleMon  or  for  the  shortest  duraMon     where  Mbandwidth≤  MMAXbandwidth  and  tEarliestStart  ≤  tstart  <  tend≤  tLatestEnd  .  
  39. 39. Bandwidth  Allocation  (time-­‐dependent)               Modified  Dijstra's   algorithms  (max  available   bandwidth):     •  BoUleneck  constraint     (not  addiMve)   •  QoS  constraint  is  addiMve   in  shortest  path,  etc)   39  The  maximum  bandwidth  available  for  allocaMon  from  a  source  node  to  a  desMnaMon   node   t1   t2   t3   t4   t5   t6  
  40. 40. Analogous Example n  A vehicle travelling from city A to city B n  There are multiple cities between A and B connected with separate highways. n  Each highway has a specific speed limit –  (maximum bandwidth) n  But we need to reduce our speed if there is high traffic load on the road n  We know the load on each highway for every time period –  (active reservations) n  The first question is which path the vehicle should follow in order to reach city B from city A as early as possible (earliest completion) •  Or, we can delay our journey and start later if the total travel time would be reduced. Second question is to find the route along with the starting time for shortest travel duration (shortest duration) 40   Advance bandwidth reservation: we have to set the speed limit before starting and cannot change during the journey  
  41. 41. Time steps n  Time steps between t1 and t13 Mme   t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13   ReservaMon  1   ReservaMon  2   ReservaMon  3   Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Mme   Mme  steps   Max (2r+1) time steps, where r is the number of reservations ts1   ts2   ts3   ts4   41  
  42. 42. Static Graphs Res  1   Res  1,2   Res  2   t4  t1   t6   t7   t9   A   C  B   D   0  Mbps   100  Mbps   800  Mbps   500  Mbps   300  Mbps)   A   C  B   D   0  Mbps   100  Mbps   400  Mbps   100  Mbps   300  Mbps)   A   C  B   D   900  Mbps   1000  Mbps   400  Mbps   100  Mbps   300  Mbps)   A   C  B   D   900  Mbps   1000  Mbps   800  Mbps   500  Mbps   300  Mbps)   t4   t6   t7   G(ts3)   G(ts4)  G(ts2)  G(ts1)   42  
  43. 43. Time Windows Res  1,2   Res  2   t1   t6   t9   A   C  B   D   0  Mbps   100  Mbps   400  Mbps   100  Mbps   300  Mbps   A   C  B   D   900  Mbps   1000  Mbps   400  Mbps   100  Mbps   300  Mbps   t6   Max (s × (s + 1))/2 time windows, where s is the number of time steps G(tw)=G(ts3)  x  G(ts4)   tw=ts1+ts2   Bo3leneck  constraint   G(tw)=G(ts1)  x  G(ts2)   tw=ts3+ts4   43  
  44. 44. Time  Window  List          (special  data  structures)   now   infinite   Time  windows  list   new  reservaMon:    reservaMon  1,  start  t1,  end  t10   now   t1   t10   infinite   Res  1   new  reservaMon:    reservaMon  2,  start  t12,  end  t20   now   t1   t10   t12   Res  1   t20   infinite   Res  2   44   Careful  sofware  design  makes  implementaMon  fast  and  efficient  
  45. 45. Performance max-bandwidth path ~ O(n^2 ) n is the number of nodes in the topology graph In the worst-case, we may require to search all time windows, (s × (s + 1))/2, where s is the number of time steps. If there are r committed reservations in the search period, there can be a maximum of 2r + 1 different time steps in the worst-case. Overall, the worst-case complexity is bounded by O(r^2 n^2 ) Note: r is relatively very small compared to the number of nodes n 45  
  46. 46. Example Reservation 1: (time t1, t6) A -> B -> D (900Mbps) Reservation 2: (time t4, t7) A -> C -> D (400Mbps) Reservation 3: (time t9, t12) A -> B -> D (700Mpbs) A   C  B   D   800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13   ReservaMon  1   ReservaMon  2   ReservaMon  3   from A to D (earliest completion) max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots earliest start = t1, latest finish t13 46  
  47. 47. Search Order - Time Windows Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Mme   windows   Res  1   Res  1,  2   Res  1,  2   2   Res  1,2     Res  1,  2   Res  2   Res  1,  2   Res  1,  2   t1-­‐-­‐t6   t4—t6   t1-­‐-­‐t4   t6—t7   t4—t7   t1—t7   t7—t9   t6—t9   t4—t9   t1—t9   Max  bandwidth  from  A  to  D   1.  900Mbps    (3)   2.  100Mbps    (2)   3.  100Mbps    (5)   4.  900Mbps    (1)   5.  100Mbps    (3)   6.  100Mbps    (6)   7.  900Mpbs    (2)   8.  900Mbps    (3)   9.  100Mbps    (5)   10.  100Mbps    (8)   ReservaMon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   47  
  48. 48. Search Order - Time Windows Shortest  dura>on?     Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Mme   windows   Res  3   Res  3  t9—t13   t12—t12   t9—t12   Max  bandwidth  from  A  to  D   1.  200Mbps    (3)   2.  900Mbps    (1)   3.  200Mbps    (4)      ReservaMon:  (A  to  D  )  (200Mbps)  start=t9  end=t13         Ø from  A  to  D,  max  bandwidth  =  200Mbps          volume  =  175Mbps  x  4  Mme  slots            earliest  start  =  t1,  latest  finish  t13        earliest  compleMon:    (  A  to  D  )  (100Mbps)  start=t1    end=t8      shortest  duraMon:          (  A  to  D  )  (200Mbps)  start=t9    end=t12.5     48  
  49. 49. Source  >  Network  >  Destination     A CB D 800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   n2   n1   Now  we  have     mulMple  requests   49  
  50. 50. With  start/end  times   •   Each  transfer  request  has  start  and  end  Mmes   •  n  transfer  requests  are  given  (each  request  has  a  specific  amount  of   profit)   •  ObjecMve  is  to  maximize  the  profit   •  If  profit  is  same  for  each  request,  then  objecMve  is  to   maximize  the  number  of  jobs  in  a  give  Mme  period     •  Unspli3able  Flow  Problem:   •  An  undirected  graph,     •  route  demand  from  source(s)  to  desMnaMons(s)  and  maximize/minimize   the  total  profit/cost     50    The  online  scheduling  method  here  is  inspired  from  Gale-­‐Shapley  algorithm  (also   known  as  stable  marriage  problem)  
  51. 51. Methodology   •  Displace  other  jobs  to  open  space  for  the  new  request   •   we  can  shif  max  n  jobs?   •  Never  accept  a  job  if  it  causes  other  commi3ed  jobs  to  break  their   criteria   •  Planning  ahead  (gives  opportunity  for  co-­‐allocaMon)   •  Gives  a  polynomial  approximaMon  algorithm   •  The  preference  converts  the  UFP  problem  into  Dijkstra  path   search   •  UMlizes  Mme  windows/Mme  steps  for  ranking  (be3er  than  earliest   deadline  first)   •  Earliest  compleMon  +  shortest  duraMon   •  Minimize  concurrency     •  Even  random  ranking  would  work  (relaxaMon  in  an  NP-­‐hard  problem   51  
  52. 52.         52  
  53. 53. Recall  Time  Windows   Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Mme   windows   Res  1   Res  1,  2   Res  1,  2   2   Res  1,2     Res  1,  2   Res  2   Res  1,  2   Res  1,  2   t1-­‐-­‐t6   t4—t6   t1-­‐-­‐t4   t6—t7   t4—t7   t1—t7   t7—t9   t6—t9   t4—t9   t1—t9   Max  bandwidth  from  A  to  D   1.  900Mbps    (3)   2.  100Mbps    (2)   3.  100Mbps    (5)   4.  900Mbps    (1)   5.  100Mbps    (3)   6.  100Mbps    (6)   7.  900Mpbs    (2)   8.  900Mbps      (3)   9.  100Mbps    (5)   10.  100Mbps    (8)   ReservaMon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   53  
  54. 54. Test     54   In  real  life,  number  of   nodes  and  number  of   reservaMon  in  a  given   search  interval  are   limited   See  AINA’13  paper  for  results    +  comparison  with  different  preference  metrics  
  55. 55. Autonomic  Provisioning  System   •  Generate  constraints  automaMcally  (without  user  input)   •  Volume  (elephant  flow?)   •  True  deadline  if  applicable   •  End-­‐host  resource  availability   •  Burst  rate  (fixed  bandwidth,  variable  bandwidth)   •  Update  constraints  according  to  feedback  and  monitoring   •  Minimize  operaMonal  cost   •  AlternaMve  to  manual  traffic  engineering     What  is  the  incenMve  to  make  correct  reservaMons?       55  
  56. 56. Data  Center  1   Data  Center  2   Data  node  B    (web  access)   Experimental    facility  A   *  (1)  Experimental  facility  A  generates  30T  of  data  every  day,  and  it  needs  to  be  stored  in   data  center  2,  before  the  next  run,  since  local  disk  space  is  limited   *  (2)  There  is  a  reservaMon  made  between  data  center  1  and  2.  It  is  used  to  replicate   data  files,  1P  total  size,  when  new  data  is  available  in  data  center  2   *  (3)  New  results  are  published  at  data  node  B,  we  expect  high  traffic  to  download   new  simulaMon  files  for  the  next  couple  of  months   Wide-­‐area   SDN   56  
  57. 57. Example   •  Experimental  facility  periodically  transfers  data  (i.e.  every  night)   •  Data  replicaMon  happens  occasionally,  and  it  will  take  a  week  to   move  1P  of  data.  If  could  get  delayed  couple  of  hours  with  no  harm   •  Wide-­‐area  download  traffic  will  increase  gradually,  most  of  the   traffic  will  be  during  the  day.     •  We  can  dynamically  increase  preference  for  download  traffic  in  the   mornings,  give  high  priority  for  transferring  data  from  the  facility  at  night,   and  use  rest  of  the  bandwidth  for  data  replicaMon  (and  allocate  some   bandwidth  to  confirm  that  it  would  finish  within  a  week  as  usual)   57  
  58. 58. Virtual  Circuit   ReservaMon  Engine   Autonomic  provisioning   system   monitoring   Reserva>on  Engine   –  Select  opMmal  path/Mme/bandwidth   –  maximize  the  number  of  admi3ed  requests   –   increase  overall  system  uMlizaMon  and  network  efficiency   –  Dynamically  update  the  selected  rouMng  path  for  network  efficiency   –  Modify  exisMng  reservaMons  dynamically  to  open  space/Mme  for  new   requests   58  
  59. 59. THANK  YOU     Any  QuesMon/Comment?           Mehmet  Balman          mbalman@lbl.gov     h3p://balman.info     59  

×