Your SlideShare is downloading. ×
0
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Fault Tolerant Clustering (IEEE Services 2012)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Fault Tolerant Clustering (IEEE Services 2012)

106

Published on

fault tolerant clustering for workflows

fault tolerant clustering for workflows

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
106
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Fault  Tolerant  Clustering  in   Scien2fic  Workflows   Weiwei  Chen,  Ewa  Deelman   Informa2on  Sciences  Ins2tute   University  of  Southern  California   1  
  • 2. Outline  •  Introduc2on  •  Workflow  and  Failure  Model  •  Fault  Tolerant  Clustering  •  Experiments  •  Task  Specific  Failures  •  Loca2on  Specific  Failures   2  
  • 3. Introduc2on    •  Task  based  Scien2fic  Workflows   –  Task   –  Job  •  Task  Clustering     –  Merges  mul2ple  small  tasks  into  a  job     –  Reduce  scheduling  and  submit  overhead  •  Fault  Tolerance  in  Task  Clustering   –  Exis2ng  techniques  underes2mate  or  ignore  the   influences  of  failures   3  
  • 4. Task  Clustering    •  Task  Clustering     –  Horizontal  Clustering   –  Ver2cal  Clustering   –  Arbitrary  Clustering   Clustering  Factor  (k):  number  of  tasks  in  a  job   4  
  • 5. System  Overview     scheduling  and   submit  delay   without   clustering   with   clustering   Timeline   5   Improvement  
  • 6. Task  Failures  and  Job  Failures   •  We  only  focus  on  Transient  Failure  and  Job  Retry   •  We  don’t  differen2ate  the  causes  of  failures  but  we   concern  about  the  average  failure  rate.     •  Assump2on:  a  failure  is  a  random  event  independent  of   workflow  characteris2cs  or  execu2on  environment     •  Two  Categories   o  Task  Failure:  a  task  fails,  other   tasks  in  the  same  job  may  not   fail   §  E.g.  Applica2on     o  Job  Failure:  a  job  fails,  all  of  its   tasks  fail   §  E.g.  Scheduling  System     6  
  • 7. Influence  of  Failures  on  Clustering   ttotal   Es2mated  Overall  Run2me   n   Number  of  tasks  to  run   t   Run2me  of  a  single  task   r   Number  of  available  resources   d   Time  delay  between  jobs   N   Expected  retry  2mes  for  a  single  task   k   Number  of  tasks  in  a  job   β   Job  failure  rate   α   Task  failure  rate   Target  Func2on:  Min  (ttotal)   given  n  tasks  to  run  on  r  resources   task  failure  rate  (α)  is  measurable  (Task  Failure  Model)   or  job  failure  rate  (β)  is  measurable  (Job  Failure  Model)    Assump2on:  n  >>  r,  but  n/k  >>  r     7  
  • 8. Job  Failure  Model  Run2me  for  a  single  job   t job = kt + dAvg  retry  2me   N = 1 jobfor  a  single  job   (1− β ) ttotal   Es2mated  Overall  Run2me   " $ N job n if n ≥r n   Number  of  tasks  to  run   $ rk k t   Run2me  of  a  single  task  Retry  2me   N total =# r   Number  of  available  resources  for  all  jobs   $ n $ N job , if k <r d   Time  delay  between  jobs   % N   Expected  retry  2mes  for  a  single  task  Overall   ttotal = t job N total k   Number  of  tasks  in  a  job  run2me   # β   Job  failure  rate   % Nn(kt + d) = n(kt + d) , if n ≥r α   Task  failure  rate   % rk rk(1− β ) k ttotal =$ % (kt + d) n % N(kt + d) = , if <r & 1− β k 8  
  • 9. Job  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− β ) kttotal =$ % (kt + d) n % N(kt + d) = , if <r & 1− β k k*  is  independent  of  β     It’s  not  necessary  to   n k* = adjust  k.  Just  set  it  to  be   r * (kt + d) ttotal = 1− β n=1000,  t=5  sec,  d=5  sec,  r=20   9  
  • 10. Task  Failure  Model  Run2me  for  a  single  job   t job = kt + dAvg  retry  2me   N = 1 jobfor  a  single  job   (1− α )k ttotal   Es2mated  Overall  Run2me   " $ N job n if n ≥r n   Number  of  tasks  to  run   $ rk k t   Run2me  of  a  single  task  Retry  2me   N total =# r   Number  of  available  resources  for  all  jobs   $ n $ N job , if k <r d   Time  delay  between  jobs   % N   Expected  retry  2mes  for  a  single  task  Overall   ttotal = t job N total k   Number  of  tasks  in  a  job  run2me   β   Job  failure  rate   α   Task  failure  rate   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k k ttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k 10  
  • 11. Task  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k kttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k k*  is  dependent  of  α     It’s  necessary  to  adjust  k   4d according  to  α   −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t = total * rk(1− α )k 11  
  • 12. Comparing  TFM  and  JFM   2.  Op2mal  clustering  factor   1.  Linear  increase  vs  exponen2al  increase   4d n −d + d 2 − k* = k* = ln(1− α ) , if n >> r r 2t (kt + d) *ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 12  
  • 13. Fault  Tolerant  Clustering  •  Job  Failure  Model:  k=n/r  •  Selec2ve  Reclustering  (SR)   –  select  the  failed  tasks  in  a  clustered  job  and   cluster  them  into  a  new  clustered  job     –  It  requires  the  iden2fica2on  of  failed  tasks.   13  
  • 14. Fault  Tolerant  Clustering  •  Dynamic  Clustering  (DC)   –  adjust  the  clustering  factor  according  to  the  task   failure  rates  dynamically   4d −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t total,DC = * * rk (1− α )k 14  
  • 15. Fault  Tolerant  Clustering  •  Dynamic  Reclustering  (DR)   –  A  combina2on  of  SR  and  DC   15  
  • 16. Evalua2on  •  Run  simula2ons  based  on  the  real  traces  that   were  run  by  the  Pegasus  group.    •  Each  workflow  was  simulated  100  2mes  so   that  the  standard  devia2on  is  less  than  10%  •  Two  workflows  were  used.    •  20  worker  nodes  were  used  in  each   experiment.     16  
  • 17. Workflows  Used  •  Montage   –  An  astronomy  applica2on  used  to  construct  large   image  mosaics  of  the  sky.     –  Montage   has   complex   data   dependencies   between  tasks     –  10,422  tasks,  57GB  data.     17   Image  from  hhp://montage.ipac.caltech.edu/  
  • 18. Workflows  Used  •  Periodogram   –  Iden2fy   periodic   signals   from   light   curves   that   arise  from  transi2ng  planets.     –  216,600  tasks,  19GB  input  data.     –  Periodogram  has  only  one  level   Image  from  hhp://pegasus.isi.edu/presenta2ons/2011/sci709-­‐voeckler-­‐talk.ppt/   18  
  • 19. Simulator  •  Extension  to  CloudSim   –  Workflow  Engine   –  Clustering  Engine   –  Scheduler   –  Failure  Generator   –  Failure  Monitor   19  
  • 20. Performance  •  NOOP:  no  op2miza2on,  (k=n/r)  •  DC  (Dynamic  Clustering)    •  SR  (Selec2ve  Reclustering)  •  DR  (  Dynamic  Reclustering)  •  Overall  Run2me  in  seconds   20  
  • 21. Performance  •  Periodogram   21  
  • 22. Performance  •  Montage   22  
  • 23. Task  Specific  Failure  Detec2on  (TSFD)  •  Task  Failures  are  related  to  the  type  of  tasks  •  Failure  Monitor  classifies  failures  based  on  the  type    •  Clustering   Engine   merges   tasks   based   on   different   task   failure  rates  •  In   this   experiment   of   Montage,   we   set   the   task   failure   rate   of   mProjectPP   and   mDiffFit   to   be   0.001   while   mBackground  ranges  from  0.2  to  0.8.     Optimization Methods α1 DR DR+TSFD DC DC+TSFD 0.2 10415 10412 13804 13820 0.4 11830 11839 22946 22923 0.6 14704 14688 60429 60414 0.8 23238 23229 436638 435297 23  
  • 24. Task  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k kttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k ttotal  is  not  sensi2ve  to  α     4d −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t = total * rk(1− α )k Simplifica2on  of  failures  is  acceptable     24  
  • 25. Loca2on  Specific  Failure  Detec2on  (LSFD)  •  Task  Failures  are  related  to  the  loca2on  of  execu2on  •  Failure   Monitor   classifies   failures   based   on   resource   id  •  Scheduler  orders  resources  based  on  their  reliability.  •  Two   out   of   twenty   nodes   have   a   higher   task   failure   rates   (from   0.2   to   0.8)   while   others   s2ll   have   a   task   failure  rate  of  0.001.     small  tasks  if  task  failure  rate  is  high   DC  generates  many   25  
  • 26. Conclusion  •  We  present  three  basic  methods  to  improve   fault  tolerance  in  task  clustering  •  If  the  system  supports  iden2fica2on  of  failed   tasks,  dynamic  reclustering  performs  best  •  Otherwise,  use  dynamic  clustering  •  Improvement  is  significant  even  for  very  basic   method   26  
  • 27. Future  Work  •  Ver2cal  Clustering  and  Arbitrary  Clustering  •  Intelligent  Scheduler  •  More  Workflow  Examples  •  Distribu2on  of  Failures   27  
  • 28. Ques2ons?  •  Thank  you  for  coming!  •  For  further  info,  please  visit:  pegasus.isi.edu   or  email  wchen@isi.edu   28  
  • 29. Refinements  •  When  n>>r  does  not  hold  in  the  end  of   execu2on   ntask•  Default:    kactual = k n jobs = k < r * r•  Replica2ve:                 n jobs   r k  actual    =  k  *                  =      replicate  jobs  by   ntask / k•  Even:     actual = ntask n jobs = r k r 29  
  • 30. Dynamic  Performance  •  TFM  and  DC   30  

×