Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Network-­‐aware	
  Data	
  
Management	
  Middleware	
  for	
  
High	
  Throughput	
  Flows	
  
March	
  16,	
  2015	
  
M...
About	
  me:	
  
Ø 2013:	
  Performance,	
  Central	
  Engineering,	
  VMware,	
  Palo	
  Alto,	
  CA	
  
Ø 2009:	
  Com...
Why	
  Network-­‐aware?	
  
Networking	
  is	
  one	
  of	
  the	
  major	
  components	
  in	
  many	
  of	
  the	
  
sol...
Two	
  major	
  players:	
  
• AbstracFon	
  and	
  Programmability	
  
•  Rapid	
  Development,	
  Intelligent	
  service...
Outline	
  
•  Data	
  Streaming	
  in	
  High-­‐bandwidth	
  Networks	
  
•  Climate100:	
  Advance	
  Network	
  IniFaFv...
100Gbps	
  networking	
  has	
  Finally	
  arrived!	
  
Applica>ons’	
  Perspec>ve	
  
Increasing	
   the	
   bandwidth	
 ...
ANI	
  
100Gbps	
  
Demo	
  
•  100Gbps	
  demo	
  by	
  ESnet	
  and	
  
Internet2	
  	
  
	
  
•  ApplicaFon	
  design	
...
Earth	
  System	
  Grid	
  Federation	
  (ESGF)	
  
8	
  
•  Over	
  2,700	
  sites	
  
•  25,000	
  users	
  
	
  
•  IPC...
Application’s	
  
Perspective:	
  	
  
Climate	
  Data	
  Analysis	
  
9	
  
 
lots-­‐of-­‐small-­‐*iles	
  problem!	
  
*ile-­‐centric	
  tools?	
  	
  
FTP
RPC
request a file
request a file
send fi...
Many	
  Concurrent	
  Streams	
  
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interf...
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 ...
Analysis	
  of	
  	
  Core	
  AfFinities	
  
	
  (NUMA	
  Effect)	
  
13	
  Nathan	
  Hanford	
  et	
  al.	
  	
  NDM’13	
...
14	
  
Analysis	
  of	
  	
  Core	
  AfFinities	
  
	
  (NUMA	
  Effect)	
  
Nathan	
  Hanford	
  et	
  al.	
  
NDM’14	
  
 100Gbps	
  demo	
  environment	
  
RRT:	
  	
  Sea3le	
  –	
  NERSC	
  	
  16ms	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	...
Framework	
  for	
  the	
  Memory-­‐mapped	
  
Network	
  Channel	
  
+	
  SynchronizaFon	
  mechanism	
  for	
  RoCE	
  
...
Moving	
  climate	
  *iles	
  ef*iciently	
  
17	
  
Advantages	
  
•  Decoupling	
  I/O	
  and	
  network	
  operaFons	
  
•  front-­‐end	
  (I/O	
  	
  processing)	
  
•  ba...
MemzNet’s	
  Architecture	
  for	
  data	
  
streaming	
  
19	
  
100Gbps	
  Demo	
  
•  CMIP3	
  data	
  (35TB)	
  from	
  the	
  GPFS	
  filesystem	
  at	
  NERSC	
  
•  Block	
  size	
  ...
83Gbps	
  	
  
throughput	
  
21	
  
MemzNet’s	
  Performance	
  	
  
TCP	
  buffer	
  size	
  is	
  set	
  to	
  50MB	
  	
  
MemzNetGridFTP
100Gbps demo
ANI T...
Challenge?	
  
•  High-­‐bandwidth	
  brings	
  new	
  challenges!	
  
•  We	
  need	
  substanFal	
  amount	
  of	
  proc...
MemzNet	
  
•  MemzNet:	
  Memory-­‐mapped	
  Network	
  Channel	
  	
  
•  High-­‐performance	
  data	
  movement	
  
	
 ...
MemzNet	
  =	
  New	
  Execution	
  Model	
  
•  Luigi	
  Rizzo	
  ’s	
  netmap	
  	
  
•  proposes	
  a	
  new	
  API	
  ...
Problem	
  Domain:	
  Esnet’s	
  OSCARS	
  
26	
  
ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-...
Reservation	
  Request	
  
•  Between	
  edge	
  routers	
  
	
  
Need	
  to	
  ensure	
  availability	
  of	
  the	
  req...
Reservation	
  
28	
  
v  Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (...
Example	
  
(Fme	
  t1,	
  t2)	
  :	
  
	
  
A	
  to	
  D	
  (600Mbps)	
  NO	
  
	
  
A	
  to	
  D	
  (500Mbps)	
  YES	
  ...
Example	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  /	
  900Mbps	
  (900Mbps)	
  
100	
  Mbps	
  /	
  900Mbps	
  (1000Mbps)	
 ...
Alternative	
  Approach:	
  Flexible	
  Reservations	
  
•  IF	
  the	
  requested	
  bandwidth	
  can	
  not	
  be	
  gua...
Bandwidth	
  Allocation	
  (time-­‐dependent)	
  
	
  	
  	
  	
  
	
  	
  
Modified	
  Dijstra's	
  
algorithms	
  (max	
 ...
Analogous Example
n  A vehicle travelling from city A to city B
n  There are multiple cities between A and B connected w...
Time steps
n  Time steps between t1 and t13
Fme	
  
t4	
  t2	
   t3	
  t1	
   t5	
   t6	
   t7	
   t8	
   t9	
   t10	
   ...
Static Graphs
Res	
  1	
   Res	
  1,2	
   Res	
  2	
  
t4	
  t1	
  
t6	
   t7	
   t9	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	...
Time Windows
Res	
  1,2	
   Res	
  2	
  
t1	
  
t6	
   t9	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  
100	
  Mbps	
  
400	
  ...
Time	
  Window	
  List	
  	
  
	
   	
   	
  (special	
  data	
  structures)	
  
now	
   infinite	
  
Time	
  windows	
  li...
Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require ...
Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation ...
Search Order - Time Windows
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12	
...
Search Order - Time Windows
Shortest	
  dura>on?	
  	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
...
Source	
  >	
  Network	
  >	
  Destination	
  
	
  
A
CB
D
800Mbps	
  
900Mbps	
   500Mbps	
  
1000Mbps	
  
300Mbps	
  
n2...
With	
  start/end	
  times	
  
•  	
  Each	
  transfer	
  request	
  has	
  start	
  and	
  end	
  Fmes	
  
•  n	
  transf...
Methodology	
  
•  Displace	
  other	
  jobs	
  to	
  open	
  space	
  for	
  the	
  new	
  request	
  
•  	
  we	
  can	
...
 	
  	
  	
  
45	
  
Recall	
  Time	
  Windows	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12...
Test	
  
	
  
47	
  
In	
  real	
  life,	
  number	
  of	
  
nodes	
  and	
  number	
  of	
  
reservaFon	
  in	
  a	
  giv...
Autonomic	
  Provisioning	
  System	
  
•  Generate	
  constraints	
  automaFcally	
  (without	
  user	
  input)	
  
•  Vo...
Data	
  Center	
  1	
  
Data	
  Center	
  2	
  
Data	
  node	
  B	
  
	
  (web	
  access)	
  
Experimental	
  
	
  facilit...
Example	
  
•  Experimental	
  facility	
  periodically	
  transfers	
  data	
  (i.e.	
  every	
  night)	
  
•  Data	
  re...
Virtual	
  Circuit	
  
ReservaFon	
  Engine	
  
Autonomic	
  provisioning	
  
system	
  
monitoring	
  
Reserva>on	
  Engi...
Performance	
  Engineer	
  ?	
  
•  Sample	
  projects:	
  
•  VSAN	
  	
  	
  (Virtual	
  SAN)	
  
•  VVOL	
  	
  (Virtua...
THANK	
  YOU	
  
	
  
Any	
  QuesFon/Comment?	
  	
  	
  	
  	
  
Mehmet	
  Balman	
  	
  	
  	
  	
  mbalman@lbl.gov	
  
...
Upcoming SlideShare
Loading in …5
×

Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

277 views

Published on

Title: "Network-aware Data Management Middleware for High Throughput Flows"
- AT&T Research, Bedminster, NJ
– March 16 2015

Abstract:
Today’s data centric environment both in industry and scientific domains depends on the underlying network infrastructure and its performance, to deliver distributed application workloads. As current technology enables faster storage devices and larger interconnect bandwidth, there is a substantial need for novel system design and middleware architecture to address increasing latency, scalability, and throughput requirements. In this talk, I will outline network-aware data management and present solutions based on my past experience in large-scale data migration between remote repositories. I will first introduce a flexible network reservation algorithm for on-demand bandwidth guaranteed virtual circuit services. Flexible reservations find best path in a time-dependent dynamic network topology to support predictable application performance. I will then present a scheduling model with advance provisioning, in which data movement operations are defined with earliest start and latest completion times. The online scheduling of multiple requests relies on a simple but very efficient optimization technique to generate near optimum results on-the-fly, while satisfying users’ time and resource constraints.

Within this scope, I will describe my experience in the initial evaluation of 100Gbps network as a part of the Advance Network Initiative project. We needed intense fine-tuning, both in network and application layers, to take advantage of the higher network capacity. End-system bottlenecks and system performance play an important role especially in many-core platforms. I will introduce a special data movement prototype, successfully tested in one of the first 100Gbps demonstrations, in which applications map memory blocks for remote data, in contrast to the send/receive semantics. This prototype was used to stream climate data over wide-area for in-memory application processing and visualization. I will conclude my talk with a brief overview of my other related projects on runtime adaptive tuning and failure recovery of data transfer jobs, and I/O aggregation and performance evaluation of distributed storage for collaborative science.

Bio:
Mehmet Balman, PhD is a senior Performance Engineer at VMware Inc. working on hyper-converged storage, virtual volumes, and virtualized solutions. He worked as a research engineer in the Computational Research Division at Lawrence Berkeley National Laboratory (Berkeley Lab). His work focused on data-intensive distributed computing, performance problems in high-bandwidth networks, advance resource provisioning, and data scheduling for large-scale applications. He was also deeply involved in the initial phase of DOE's nationwide 100Gbps network.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

  1. 1. Network-­‐aware  Data   Management  Middleware  for   High  Throughput  Flows   March  16,  2015   Mehmet  Balman   h3p://balman.info     Performance  Engineer  at  VMware  Inc.     Guest  ScienFst  at  Berkeley  Lab   1  
  2. 2. About  me:   Ø 2013:  Performance,  Central  Engineering,  VMware,  Palo  Alto,  CA   Ø 2009:  ComputaFonal  Research  Division  (CRD)  at  Lawrence  Berkeley   NaFonal  Laboratory  (LBNL)   Ø 2005:  Center  for  ComputaFon  &  Technology  (CCT),  Baton  Rouge,  LA   v Computer  Science,  Louisiana  State  University  (2010,2008)   v Bogazici  University,  Istanbul,  Turkey  (2006,2000)     Data  Transfer  Scheduling  with  Advance  ReservaFon  and  Provisioning,  Ph.D.                     Failure-­‐Awareness  and  Dynamic  AdaptaFon  in  Data  Scheduling,  M.S.   Parallel  Tetrahedral  Mesh  Refinement,  M.S.   2  
  3. 3. Why  Network-­‐aware?   Networking  is  one  of  the  major  components  in  many  of  the   soluFons  today   •  Distributed  data  and  compute  resources   •  CollaboraFon:  data  to  be  shared  between  remote  sites   •  Data  centers  are  complex  network  infrastructures     ü What  further  steps  are  necessary  to  take  full  advantage  of  future   networking  infrastructure?   ü How  are  we  going  to  deal  with  performance  problems?     ü How  can  we  enhance  data  management  services  and  make  them   network-­‐aware?     New  collabora>ons  between  data  management  and   networking  communi>es.   3  
  4. 4. Two  major  players:   • AbstracFon  and  Programmability   •  Rapid  Development,  Intelligent  services   •  OrchestraFng  compute,  storage,  and  network  resources  together   •  IntegraFon  and  deployment  of  complex  workflows   •  VirtualizaFon  (+containers)     •  Distributed  storage  (storage  wars)   •  Open  Source    (if  you  can’t  fix  it,  you  don’t  own  it)   •  Performance  Gap:   •  LimitaFon  is  current  system  socware  and  foreseen    speed:   •  Hardware  is  fast,  Socware  is  slow     •  Latency  throughput  mismatch  will  lead  to  new  innovaGons   4  
  5. 5. Outline   •  Data  Streaming  in  High-­‐bandwidth  Networks   •  Climate100:  Advance  Network  IniFaFve  and  100Gbps  Demo   •  MemzNet:  Memory-­‐Mapped  Network  Zero-­‐copy  Channels     •  Core  Affinity  and  End  System  Tuning  in  High-­‐Throughput   Flows   •  Network  Reserva>on  and  Online  Scheduling   •  FlexRes:  A  Flexible  Network  ReservaFon  Algorithm   •  SchedSim:  Online  Scheduling  with  Advance  Provisioning       •  Performance  Engineering  and  Virtualized  Solu>ons   •  So,ware  Defined  Storage   5  
  6. 6. 100Gbps  networking  has  Finally  arrived!   Applica>ons’  Perspec>ve   Increasing   the   bandwidth   is   not   sufficient   by   itself;   we   need   careful   evaluaFon   of   high-­‐bandwidth   networks   from   the   applicaFons’  perspecFve.       1Gbps  to  10Gbps  transiFon     (10  years  ago)   ApplicaFon  did  not  run  10  Fmes   faster  because  there  was  more   bandwidth  available   6  
  7. 7. ANI   100Gbps   Demo   •  100Gbps  demo  by  ESnet  and   Internet2       •  ApplicaFon  design  issues  and  host   tuning  strategies  to  scale  to  100Gbps   rates     •  VisualizaFon  of  remotely  located  data   (Cosmology)     •  Data  movement  of  large    datasets  with   many  files  (Climate  analysis)     7  
  8. 8. Earth  System  Grid  Federation  (ESGF)   8   •  Over  2,700  sites   •  25,000  users     •  IPCC  Fich  Assessment  Report  (AR5)  2PB     •  IPCC  Forth  Assessment  Report  (AR4)  35TB   •  Remote    Data  Analysis   •  Bulk  Data  Movement  
  9. 9. Application’s   Perspective:     Climate  Data  Analysis   9  
  10. 10.   lots-­‐of-­‐small-­‐*iles  problem!   *ile-­‐centric  tools?     FTP RPC request a file request a file send file send file request data send data •  Keep  the  network  pipe  full   •  We  want  out-­‐of-­‐order  and  asynchronous  send  receive       10  
  11. 11. Many  Concurrent  Streams   (a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).       11  
  12. 12. ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M Effects  of  many  concurrent  streams   12  
  13. 13. Analysis  of    Core  AfFinities    (NUMA  Effect)   13  Nathan  Hanford  et  al.    NDM’13   Sandy  Bridge  Architecture   Receive  process    
  14. 14. 14   Analysis  of    Core  AfFinities    (NUMA  Effect)   Nathan  Hanford  et  al.   NDM’14  
  15. 15.  100Gbps  demo  environment   RRT:    Sea3le  –  NERSC    16ms                      NERSC  –  ANL              50ms                      NERSC  –  ORNL        64ms   15  
  16. 16. Framework  for  the  Memory-­‐mapped   Network  Channel   +  SynchronizaFon  mechanism  for  RoCE   -­‐  Keep  the  pipe  full  for  remote  analysis   16  
  17. 17. Moving  climate  *iles  ef*iciently   17  
  18. 18. Advantages   •  Decoupling  I/O  and  network  operaFons   •  front-­‐end  (I/O    processing)   •  back-­‐end  (networking  layer)     •  Not  limited  by  the  characterisFcs  of  the  file  sizes   •  On  the  fly  tar  approach,    bundling  and  sending    many  files   together   •  Dynamic  data  channel  management     Can   increase/decrease   the   parallelism   level   both     in   the   network   communicaFon   and   I/O   read/write   operaFons,   without   closing   and   reopening   the   data   channel   connecFon   (as   is   done   in   regular   FTP   variants).     MemzNet  is    is  not  file-­‐centric.  Bookkeeping  informaFon  is  embedded   inside  each  block.       18  
  19. 19. MemzNet’s  Architecture  for  data   streaming   19  
  20. 20. 100Gbps  Demo   •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC   •  Block  size  4MB   •  Each  block’s  data  secFon  was  aligned  according  to  the   system  pagesize.     •  1GB  cache  both  at  the  client  and  the  server     •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data  files   in  parallel.   •   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received  data   blocks.   •   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for   each  host-­‐to-­‐host  connecFon.     20  
  21. 21. 83Gbps     throughput   21  
  22. 22. MemzNet’s  Performance     TCP  buffer  size  is  set  to  50MB     MemzNetGridFTP 100Gbps demo ANI Testbed 22  
  23. 23. Challenge?   •  High-­‐bandwidth  brings  new  challenges!   •  We  need  substanFal  amount  of  processing  power  and  involvement  of   mulFple  cores  to  fill  a  40Gbps  or  100Gbps  network     •  Fine-­‐tuning,  both  in  network  and  applicaFon  layers,  to  take   advantage  of  the  higher  network  capacity.     •  Incremental  improvement  in  current  tools?   •  We  cannot  expect  every  applicaFon  to  tune  and  improve  every  Fme  we   change  the  link  technology  or  speed.       23  
  24. 24. MemzNet   •  MemzNet:  Memory-­‐mapped  Network  Channel     •  High-­‐performance  data  movement     MemzNet  is  an  iniFal  effort  to  put  a  new  layer   between  the  applicaFon  and  the  transport  layer.   •  Main  goal  is  to  define  a  network  channel  so  applicaFons  can   directly  use  it  without  the  burden  of  managing/tuning  the  network   communicaFon.     24   Tech  report:  LBNL-­‐6177E  
  25. 25. MemzNet  =  New  Execution  Model   •  Luigi  Rizzo  ’s  netmap     •  proposes  a  new  API  to  send/receive  data  over  the   network   • RDMA  programming  model   •  MemzNet  as  a  memory-­‐management  component   • IX:  Data  Plane  OS  (Adam  Baley  et  al.  @standford  –   similar  to  MemzNet’s  model)   •  mTCP  (even  based  /  replaces  send/receive  in  user  level)   •  Tanenbaum  et  al.    Minimizing  context  switches:   proposing  to  use  MONITOR/MWAIT  for   synchronizaFon   25  
  26. 26. Problem  Domain:  Esnet’s  OSCARS   26   ASIA-PACIFIC (ASGC/Kreonet2/ TWAREN) ASIA-PACIFIC (KAREN/KREONET2/ NUS-GP/ODN/ REANNZ/SINET/ TRANSPAC/TWAREN) AUSTRALIA (AARnet) LATIN AMERICA CLARA/CUDI CANADA (CANARIE) RUSSIA AND CHINA (GLORIAD) US R&E (DREN/Internet2/NLR) US R&E (DREN/Internet2/ NASA) US R&E (NASA/NISN/ USDOI) ASIA-PACIFIC (BNP/HEPNET) ASIA-PACIFIC (ASCC/KAREN/ KREONET2/NUS-GP/ ODN/REANNZ/ SINET/TRANSPAC) AUSTRALIA (AARnet) US R&E (DREN/Internet2/ NISN/NLR) US R&E (Internet2/ NLR) CERN US R&E (DREN/Internet2/ NISN) CANADA (CANARIE) LHCONE CANADA (CANARIE) FRANCE (OpenTransit) RUSSIA AND CHINA (GLORIAD) CERN (USLHCNet) ASIA-PACIFIC (SINET) EUROPE (GÉANT/ NORDUNET) EUROPE (GÉANT) LATIN AMERICA (AMPATH/CLARA) LATIN AMERICA (CLARA/CUDI) HOUSTON ALBUQUERQUE El PASO SUNNYVALE BOISE SEATTLE KANSAS CITY NASHVILLE WASHINGTON DC NEW YORK BOSTON CHICAGO DENVER SACRAMENTO ATLANTA PNNL SLAC AMES PPPL BNL ORNL JLAB FNAL ANL LBNL •  ConnecFng  experimental  faciliFes  and  supercompuFng  centers   •  On-­‐Demand  Secure  Circuits  and  Advance  ReservaFon  System     •  Guaranteed  between  collaboraFng  insFtuFons  by  delivering   network-­‐as-­‐a-­‐service     •  Co-­‐allocaFon  of  storage  and  network  resources         (SRM:  Storage  Resource  Manager)     OSCARS  provides  yes/no   answers  to  a  reservaFon   request  for  (bandwidth,   start_Gme,  end_Gme)   End-­‐to-­‐end  ReservaFon:    Storage+Network    
  27. 27. Reservation  Request   •  Between  edge  routers     Need  to  ensure  availability  of  the  requested  bandwidth  from  source  to   desGnaGon  for  the  requested  Gme  interval     v   R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}.   v  source/desFnaFon  end-­‐points   v  Requested  bandwidth   v  start/end  Fmes     Commi3ed  reservaFons  between  tstart  and  tend  are  examined       The  shortest  path  from  source  to  desFnaFon  is  calculated  based  on  the   engineering  metric  on  each  link,  and  a  bandwidth  guaranteed  path  is  set   up  to  commit  and  eventually  complete  the  reservaFon  request  for  the   given  Fme  period   27  
  28. 28. Reservation   28   v  Components (Graph): v node (router), port, link (connecting two ports) v engineering metric (~latency) v maximum bandwidth (capacity) v  Reservation: v source, destination, path, time v (time t1, t3) A -> B -> D (900Mbps) v (time t2, t3) A -> C -> D (400Mbps) v (time t4, t5) A -> B -> D (800Mpbs) A   C  B   D   800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   ReservaFon  1   ReservaFon  2   ReservaFon  3   t1   t2   t3   t4   t5  
  29. 29. Example   (Fme  t1,  t2)  :     A  to  D  (600Mbps)  NO     A  to  D  (500Mbps)  YES           A   C  B   D   0  Mbps  /  900Mbps  (900Mbps)   100  Mbps  /  900Mbps  (1000Mbps)   800  Mbps  /  0Mbps  (800Mbps)   500  Mbps  /  0Mbps  (500Mbps)   300  Mbps  /    0  Mbps  (300Mbps)   AcFve  reservaFon   reservaFon  1:  (Fme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)   reservaFon  2:  (Fme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)   reservaFon  3:  (Fme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)   available/  reserved   (capacity)     29  
  30. 30. Example   A   C  B   D   0  Mbps  /  900Mbps  (900Mbps)   100  Mbps  /  900Mbps  (1000Mbps)   400  Mbps  /  400Mbps  (800Mbps)   100  Mbps  /  400Mbps  (500Mbps)   300  Mbps  /    0  Mbps  (300Mbps)   (Fme  t1,  t3)  :     A  to  D  (500Mbps)  NO       A  to  C  (500Mbps)  No   (not  max-­‐FLOW!)           AcFve  reservaFon   reservaFon  1:  (Fme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)   reservaFon  2:  (Fme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)   reservaFon  3:  (Fme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)   available/  reserved   (capacity)     30  
  31. 31. Alternative  Approach:  Flexible  Reservations   •  IF  the  requested  bandwidth  can  not  be  guaranteed:   •  Try-­‐and-­‐error  unFl  get  an  available  reservaFon   •  Client  is  not  given  other  possible  opFons   •  How  can  we  enhance  the  OSCARS  reservaFon  system?   •  Be  Flexible:   •  Submit  constraints  and  the  system  suggests  possible  reservaFon  opFons   saFsfying  given  requirements   31    Rs '={  nsource  ,  ndesGnaGon,  MMAXbandwidth,  DdataSize,  tEarliestStart,  tLatestEnd}     ReservaFon  engine  finds  out  the  reservaFon            R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}     for  the  earliest  compleFon  or  for  the  shortest  duraFon     where  Mbandwidth≤  MMAXbandwidth  and  tEarliestStart  ≤  tstart  <  tend≤  tLatestEnd  .  
  32. 32. Bandwidth  Allocation  (time-­‐dependent)               Modified  Dijstra's   algorithms  (max  available   bandwidth):     •  BoPleneck  constraint     (not  addiFve)   •  QoS  constraint  is  addiFve   in  shortest  path,  etc)   32  The  maximum  bandwidth  available  for  allocaFon  from  a  source  node  to  a  desFnaFon   node   t1   t2   t3   t4   t5   t6  
  33. 33. Analogous Example n  A vehicle travelling from city A to city B n  There are multiple cities between A and B connected with separate highways. n  Each highway has a specific speed limit –  (maximum bandwidth) n  But we need to reduce our speed if there is high traffic load on the road n  We know the load on each highway for every time period –  (active reservations) n  The first question is which path the vehicle should follow in order to reach city B from city A as early as possible (earliest completion) •  Or, we can delay our journey and start later if the total travel time would be reduced. Second question is to find the route along with the starting time for shortest travel duration (shortest duration) 33   Advance bandwidth reservation: we have to set the speed limit before starting and cannot change during the journey  
  34. 34. Time steps n  Time steps between t1 and t13 Fme   t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13   ReservaFon  1   ReservaFon  2   ReservaFon  3   Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   Fme  steps   Max (2r+1) time steps, where r is the number of reservations ts1   ts2   ts3   ts4   34  
  35. 35. Static Graphs Res  1   Res  1,2   Res  2   t4  t1   t6   t7   t9   A   C  B   D   0  Mbps   100  Mbps   800  Mbps   500  Mbps   300  Mbps)   A   C  B   D   0  Mbps   100  Mbps   400  Mbps   100  Mbps   300  Mbps)   A   C  B   D   900  Mbps   1000  Mbps   400  Mbps   100  Mbps   300  Mbps)   A   C  B   D   900  Mbps   1000  Mbps   800  Mbps   500  Mbps   300  Mbps)   t4   t6   t7   G(ts3)   G(ts4)  G(ts2)  G(ts1)   35  
  36. 36. Time Windows Res  1,2   Res  2   t1   t6   t9   A   C  B   D   0  Mbps   100  Mbps   400  Mbps   100  Mbps   300  Mbps   A   C  B   D   900  Mbps   1000  Mbps   400  Mbps   100  Mbps   300  Mbps   t6   Max (s × (s + 1))/2 time windows, where s is the number of time steps G(tw)=G(ts3)  x  G(ts4)   tw=ts1+ts2   Bo3leneck  constraint   G(tw)=G(ts1)  x  G(ts2)   tw=ts3+ts4   36  
  37. 37. Time  Window  List          (special  data  structures)   now   infinite   Time  windows  list   new  reservaFon:    reservaFon  1,  start  t1,  end  t10   now   t1   t10   infinite   Res  1   new  reservaFon:    reservaFon  2,  start  t12,  end  t20   now   t1   t10   t12   Res  1   t20   infinite   Res  2   37   Careful  socware  design  makes  implementaFon  fast  and  efficient  
  38. 38. Performance max-bandwidth path ~ O(n^2 ) n is the number of nodes in the topology graph In the worst-case, we may require to search all time windows, (s × (s + 1))/2, where s is the number of time steps. If there are r committed reservations in the search period, there can be a maximum of 2r + 1 different time steps in the worst-case. Overall, the worst-case complexity is bounded by O(r^2 n^2 ) Note: r is relatively very small compared to the number of nodes n 38  
  39. 39. Example Reservation 1: (time t1, t6) A -> B -> D (900Mbps) Reservation 2: (time t4, t7) A -> C -> D (400Mbps) Reservation 3: (time t9, t12) A -> B -> D (700Mpbs) A   C  B   D   800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13   ReservaFon  1   ReservaFon  2   ReservaFon  3   from A to D (earliest completion) max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots earliest start = t1, latest finish t13 39  
  40. 40. Search Order - Time Windows Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   windows   Res  1   Res  1,  2   Res  1,  2   2   Res  1,2     Res  1,  2   Res  2   Res  1,  2   Res  1,  2   t1-­‐-­‐t6   t4—t6   t1-­‐-­‐t4   t6—t7   t4—t7   t1—t7   t7—t9   t6—t9   t4—t9   t1—t9   Max  bandwidth  from  A  to  D   1.  900Mbps    (3)   2.  100Mbps    (2)   3.  100Mbps    (5)   4.  900Mbps    (1)   5.  100Mbps    (3)   6.  100Mbps    (6)   7.  900Mpbs    (2)   8.  900Mbps    (3)   9.  100Mbps    (5)   10.  100Mbps    (8)   ReservaFon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   40  
  41. 41. Search Order - Time Windows Shortest  dura>on?     Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   windows   Res  3   Res  3  t9—t13   t12—t12   t9—t12   Max  bandwidth  from  A  to  D   1.  200Mbps    (3)   2.  900Mbps    (1)   3.  200Mbps    (4)      ReservaFon:  (A  to  D  )  (200Mbps)  start=t9  end=t13         Ø from  A  to  D,  max  bandwidth  =  200Mbps          volume  =  175Mbps  x  4  Fme  slots            earliest  start  =  t1,  latest  finish  t13        earliest  compleFon:    (  A  to  D  )  (100Mbps)  start=t1    end=t8      shortest  duraFon:          (  A  to  D  )  (200Mbps)  start=t9    end=t12.5     41  
  42. 42. Source  >  Network  >  Destination     A CB D 800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   n2   n1   Now  we  have     mulFple  requests   42  
  43. 43. With  start/end  times   •   Each  transfer  request  has  start  and  end  Fmes   •  n  transfer  requests  are  given  (each  request  has  a  specific  amount  of   profit)   •  ObjecFve  is  to  maximize  the  profit   •  If  profit  is  same  for  each  request,  then  objecFve  is  to   maximize  the  number  of  jobs  in  a  give  Fme  period     •  Unspli3able  Flow  Problem:   •  An  undirected  graph,     •  route  demand  from  source(s)  to  desFnaFons(s)  and  maximize/minimize   the  total  profit/cost     43    The  online  scheduling  method  here  is  inspired  from  Gale-­‐Shapley  algorithm  (also   known  as  stable  marriage  problem)  
  44. 44. Methodology   •  Displace  other  jobs  to  open  space  for  the  new  request   •   we  can  shic  max  n  jobs?   •  Never  accept  a  job  if  it  causes  other  commi3ed  jobs  to  break  their   criteria   •  Planning  ahead  (gives  opportunity  for  co-­‐allocaFon)   •  Gives  a  polynomial  approximaFon  algorithm   •  The  preference  converts  the  UFP  problem  into  Dijkstra  path   search   •  UFlizes  Fme  windows/Fme  steps  for  ranking  (be3er  than  earliest   deadline  first)   •  Earliest  compleFon  +  shortest  duraFon   •  Minimize  concurrency     •  Even  random  ranking  would  work  (relaxaFon  in  an  NP-­‐hard  problem   44  
  45. 45.         45  
  46. 46. Recall  Time  Windows   Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   windows   Res  1   Res  1,  2   Res  1,  2   2   Res  1,2     Res  1,  2   Res  2   Res  1,  2   Res  1,  2   t1-­‐-­‐t6   t4—t6   t1-­‐-­‐t4   t6—t7   t4—t7   t1—t7   t7—t9   t6—t9   t4—t9   t1—t9   Max  bandwidth  from  A  to  D   1.  900Mbps    (3)   2.  100Mbps    (2)   3.  100Mbps    (5)   4.  900Mbps    (1)   5.  100Mbps    (3)   6.  100Mbps    (6)   7.  900Mpbs    (2)   8.  900Mbps      (3)   9.  100Mbps    (5)   10.  100Mbps    (8)   ReservaFon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   46  
  47. 47. Test     47   In  real  life,  number  of   nodes  and  number  of   reservaFon  in  a  given   search  interval  are   limited   See  AINA’13  paper  for  results    +  comparison  with  different  preference  metrics  
  48. 48. Autonomic  Provisioning  System   •  Generate  constraints  automaFcally  (without  user  input)   •  Volume  (elephant  flow?)   •  True  deadline  if  applicable   •  End-­‐host  resource  availability   •  Burst  rate  (fixed  bandwidth,  variable  bandwidth)   •  Update  constraints  according  to  feedback  and  monitoring   •  Minimize  operaFonal  cost   •  AlternaFve  to  manual  traffic  engineering     What  is  the  incenFve  to  make  correct  reservaFons?       48  
  49. 49. Data  Center  1   Data  Center  2   Data  node  B    (web  access)   Experimental    facility  A   *  (1)  Experimental  facility  A  generates  30T  of  data  every  day,  and  it  needs  to  be  stored  in   data  center  2,  before  the  next  run,  since  local  disk  space  is  limited   *  (2)  There  is  a  reservaFon  made  between  data  center  1  and  2.  It  is  used  to  replicate   data  files,  1P  total  size,  when  new  data  is  available  in  data  center  2   *  (3)  New  results  are  published  at  data  node  B,  we  expect  high  traffic  to  download   new  simulaFon  files  for  the  next  couple  of  months   Wide-­‐area   SDN   49  
  50. 50. Example   •  Experimental  facility  periodically  transfers  data  (i.e.  every  night)   •  Data  replicaFon  happens  occasionally,  and  it  will  take  a  week  to   move  1P  of  data.  If  could  get  delayed  couple  of  hours  with  no  harm   •  Wide-­‐area  download  traffic  will  increase  gradually,  most  of  the   traffic  will  be  during  the  day.     •  We  can  dynamically  increase  preference  for  download  traffic  in  the   mornings,  give  high  priority  for  transferring  data  from  the  facility  at  night,   and  use  rest  of  the  bandwidth  for  data  replicaFon  (and  allocate  some   bandwidth  to  confirm  that  it  would  finish  within  a  week  as  usual)   50  
  51. 51. Virtual  Circuit   ReservaFon  Engine   Autonomic  provisioning   system   monitoring   Reserva>on  Engine   –  Select  opFmal  path/Fme/bandwidth   –  maximize  the  number  of  admi3ed  requests   –   increase  overall  system  uFlizaFon  and  network  efficiency   –  Dynamically  update  the  selected  rouFng  path  for  network  efficiency   –  Modify  exisFng  reservaFons  dynamically  to  open  space/Fme  for  new   requests   51  
  52. 52. Performance  Engineer  ?   •  Sample  projects:   •  VSAN      (Virtual  SAN)   •  VVOL    (Virtual  Volumes)   •  Important  aspects  of  performance  engineering:   •  Be  a  part  in  the  iniFal  development  phase   •  Develop  techniques  to  analyze  performance   problems     •  Make  sure!  performance  issues  are  addresses   correctly   52  
  53. 53. THANK  YOU     Any  QuesFon/Comment?           Mehmet  Balman          mbalman@lbl.gov     h3p://balman.info     53  

×