Fault Tolerant Clustering (IEEE Services 2012)

279 views

Published on

fault tolerant clustering for workflows

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
279
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Fault Tolerant Clustering (IEEE Services 2012)

  1. 1. Fault  Tolerant  Clustering  in   Scien2fic  Workflows   Weiwei  Chen,  Ewa  Deelman   Informa2on  Sciences  Ins2tute   University  of  Southern  California   1  
  2. 2. Outline  •  Introduc2on  •  Workflow  and  Failure  Model  •  Fault  Tolerant  Clustering  •  Experiments  •  Task  Specific  Failures  •  Loca2on  Specific  Failures   2  
  3. 3. Introduc2on    •  Task  based  Scien2fic  Workflows   –  Task   –  Job  •  Task  Clustering     –  Merges  mul2ple  small  tasks  into  a  job     –  Reduce  scheduling  and  submit  overhead  •  Fault  Tolerance  in  Task  Clustering   –  Exis2ng  techniques  underes2mate  or  ignore  the   influences  of  failures   3  
  4. 4. Task  Clustering    •  Task  Clustering     –  Horizontal  Clustering   –  Ver2cal  Clustering   –  Arbitrary  Clustering   Clustering  Factor  (k):  number  of  tasks  in  a  job   4  
  5. 5. System  Overview     scheduling  and   submit  delay   without   clustering   with   clustering   Timeline   5   Improvement  
  6. 6. Task  Failures  and  Job  Failures   •  We  only  focus  on  Transient  Failure  and  Job  Retry   •  We  don’t  differen2ate  the  causes  of  failures  but  we   concern  about  the  average  failure  rate.     •  Assump2on:  a  failure  is  a  random  event  independent  of   workflow  characteris2cs  or  execu2on  environment     •  Two  Categories   o  Task  Failure:  a  task  fails,  other   tasks  in  the  same  job  may  not   fail   §  E.g.  Applica2on     o  Job  Failure:  a  job  fails,  all  of  its   tasks  fail   §  E.g.  Scheduling  System     6  
  7. 7. Influence  of  Failures  on  Clustering   ttotal   Es2mated  Overall  Run2me   n   Number  of  tasks  to  run   t   Run2me  of  a  single  task   r   Number  of  available  resources   d   Time  delay  between  jobs   N   Expected  retry  2mes  for  a  single  task   k   Number  of  tasks  in  a  job   β   Job  failure  rate   α   Task  failure  rate   Target  Func2on:  Min  (ttotal)   given  n  tasks  to  run  on  r  resources   task  failure  rate  (α)  is  measurable  (Task  Failure  Model)   or  job  failure  rate  (β)  is  measurable  (Job  Failure  Model)    Assump2on:  n  >>  r,  but  n/k  >>  r     7  
  8. 8. Job  Failure  Model  Run2me  for  a  single  job   t job = kt + dAvg  retry  2me   N = 1 jobfor  a  single  job   (1− β ) ttotal   Es2mated  Overall  Run2me   " $ N job n if n ≥r n   Number  of  tasks  to  run   $ rk k t   Run2me  of  a  single  task  Retry  2me   N total =# r   Number  of  available  resources  for  all  jobs   $ n $ N job , if k <r d   Time  delay  between  jobs   % N   Expected  retry  2mes  for  a  single  task  Overall   ttotal = t job N total k   Number  of  tasks  in  a  job  run2me   # β   Job  failure  rate   % Nn(kt + d) = n(kt + d) , if n ≥r α   Task  failure  rate   % rk rk(1− β ) k ttotal =$ % (kt + d) n % N(kt + d) = , if <r & 1− β k 8  
  9. 9. Job  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− β ) kttotal =$ % (kt + d) n % N(kt + d) = , if <r & 1− β k k*  is  independent  of  β     It’s  not  necessary  to   n k* = adjust  k.  Just  set  it  to  be   r * (kt + d) ttotal = 1− β n=1000,  t=5  sec,  d=5  sec,  r=20   9  
  10. 10. Task  Failure  Model  Run2me  for  a  single  job   t job = kt + dAvg  retry  2me   N = 1 jobfor  a  single  job   (1− α )k ttotal   Es2mated  Overall  Run2me   " $ N job n if n ≥r n   Number  of  tasks  to  run   $ rk k t   Run2me  of  a  single  task  Retry  2me   N total =# r   Number  of  available  resources  for  all  jobs   $ n $ N job , if k <r d   Time  delay  between  jobs   % N   Expected  retry  2mes  for  a  single  task  Overall   ttotal = t job N total k   Number  of  tasks  in  a  job  run2me   β   Job  failure  rate   α   Task  failure  rate   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k k ttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k 10  
  11. 11. Task  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k kttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k k*  is  dependent  of  α     It’s  necessary  to  adjust  k   4d according  to  α   −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t = total * rk(1− α )k 11  
  12. 12. Comparing  TFM  and  JFM   2.  Op2mal  clustering  factor   1.  Linear  increase  vs  exponen2al  increase   4d n −d + d 2 − k* = k* = ln(1− α ) , if n >> r r 2t (kt + d) *ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 12  
  13. 13. Fault  Tolerant  Clustering  •  Job  Failure  Model:  k=n/r  •  Selec2ve  Reclustering  (SR)   –  select  the  failed  tasks  in  a  clustered  job  and   cluster  them  into  a  new  clustered  job     –  It  requires  the  iden2fica2on  of  failed  tasks.   13  
  14. 14. Fault  Tolerant  Clustering  •  Dynamic  Clustering  (DC)   –  adjust  the  clustering  factor  according  to  the  task   failure  rates  dynamically   4d −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t total,DC = * * rk (1− α )k 14  
  15. 15. Fault  Tolerant  Clustering  •  Dynamic  Reclustering  (DR)   –  A  combina2on  of  SR  and  DC   15  
  16. 16. Evalua2on  •  Run  simula2ons  based  on  the  real  traces  that   were  run  by  the  Pegasus  group.    •  Each  workflow  was  simulated  100  2mes  so   that  the  standard  devia2on  is  less  than  10%  •  Two  workflows  were  used.    •  20  worker  nodes  were  used  in  each   experiment.     16  
  17. 17. Workflows  Used  •  Montage   –  An  astronomy  applica2on  used  to  construct  large   image  mosaics  of  the  sky.     –  Montage   has   complex   data   dependencies   between  tasks     –  10,422  tasks,  57GB  data.     17   Image  from  hhp://montage.ipac.caltech.edu/  
  18. 18. Workflows  Used  •  Periodogram   –  Iden2fy   periodic   signals   from   light   curves   that   arise  from  transi2ng  planets.     –  216,600  tasks,  19GB  input  data.     –  Periodogram  has  only  one  level   Image  from  hhp://pegasus.isi.edu/presenta2ons/2011/sci709-­‐voeckler-­‐talk.ppt/   18  
  19. 19. Simulator  •  Extension  to  CloudSim   –  Workflow  Engine   –  Clustering  Engine   –  Scheduler   –  Failure  Generator   –  Failure  Monitor   19  
  20. 20. Performance  •  NOOP:  no  op2miza2on,  (k=n/r)  •  DC  (Dynamic  Clustering)    •  SR  (Selec2ve  Reclustering)  •  DR  (  Dynamic  Reclustering)  •  Overall  Run2me  in  seconds   20  
  21. 21. Performance  •  Periodogram   21  
  22. 22. Performance  •  Montage   22  
  23. 23. Task  Specific  Failure  Detec2on  (TSFD)  •  Task  Failures  are  related  to  the  type  of  tasks  •  Failure  Monitor  classifies  failures  based  on  the  type    •  Clustering   Engine   merges   tasks   based   on   different   task   failure  rates  •  In   this   experiment   of   Montage,   we   set   the   task   failure   rate   of   mProjectPP   and   mDiffFit   to   be   0.001   while   mBackground  ranges  from  0.2  to  0.8.     Optimization Methods α1 DR DR+TSFD DC DC+TSFD 0.2 10415 10412 13804 13820 0.4 11830 11839 22946 22923 0.6 14704 14688 60429 60414 0.8 23238 23229 436638 435297 23  
  24. 24. Task  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k kttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k ttotal  is  not  sensi2ve  to  α     4d −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t = total * rk(1− α )k Simplifica2on  of  failures  is  acceptable     24  
  25. 25. Loca2on  Specific  Failure  Detec2on  (LSFD)  •  Task  Failures  are  related  to  the  loca2on  of  execu2on  •  Failure   Monitor   classifies   failures   based   on   resource   id  •  Scheduler  orders  resources  based  on  their  reliability.  •  Two   out   of   twenty   nodes   have   a   higher   task   failure   rates   (from   0.2   to   0.8)   while   others   s2ll   have   a   task   failure  rate  of  0.001.     small  tasks  if  task  failure  rate  is  high   DC  generates  many   25  
  26. 26. Conclusion  •  We  present  three  basic  methods  to  improve   fault  tolerance  in  task  clustering  •  If  the  system  supports  iden2fica2on  of  failed   tasks,  dynamic  reclustering  performs  best  •  Otherwise,  use  dynamic  clustering  •  Improvement  is  significant  even  for  very  basic   method   26  
  27. 27. Future  Work  •  Ver2cal  Clustering  and  Arbitrary  Clustering  •  Intelligent  Scheduler  •  More  Workflow  Examples  •  Distribu2on  of  Failures   27  
  28. 28. Ques2ons?  •  Thank  you  for  coming!  •  For  further  info,  please  visit:  pegasus.isi.edu   or  email  wchen@isi.edu   28  
  29. 29. Refinements  •  When  n>>r  does  not  hold  in  the  end  of   execu2on   ntask•  Default:    kactual = k n jobs = k < r * r•  Replica2ve:                 n jobs   r k  actual    =  k  *                  =      replicate  jobs  by   ntask / k•  Even:     actual = ntask n jobs = r k r 29  
  30. 30. Dynamic  Performance  •  TFM  and  DC   30  

×