3. Introduc2on
• Task
based
Scien2fic
Workflows
– Task
– Job
• Task
Clustering
– Merges
mul2ple
small
tasks
into
a
job
– Reduce
scheduling
and
submit
overhead
• Fault
Tolerance
in
Task
Clustering
– Exis2ng
techniques
underes2mate
or
ignore
the
influences
of
failures
3
4. Task
Clustering
• Task
Clustering
– Horizontal
Clustering
– Ver2cal
Clustering
– Arbitrary
Clustering
Clustering
Factor
(k):
number
of
tasks
in
a
job
4
5. System
Overview
scheduling
and
submit
delay
without
clustering
with
clustering
Timeline
5
Improvement
6. Task
Failures
and
Job
Failures
• We
only
focus
on
Transient
Failure
and
Job
Retry
• We
don’t
differen2ate
the
causes
of
failures
but
we
concern
about
the
average
failure
rate.
• Assump2on:
a
failure
is
a
random
event
independent
of
workflow
characteris2cs
or
execu2on
environment
• Two
Categories
o Task
Failure:
a
task
fails,
other
tasks
in
the
same
job
may
not
fail
§ E.g.
Applica2on
o Job
Failure:
a
job
fails,
all
of
its
tasks
fail
§ E.g.
Scheduling
System
6
7. Influence
of
Failures
on
Clustering
ttotal
Es2mated
Overall
Run2me
n
Number
of
tasks
to
run
t
Run2me
of
a
single
task
r
Number
of
available
resources
d
Time
delay
between
jobs
N
Expected
retry
2mes
for
a
single
task
k
Number
of
tasks
in
a
job
β
Job
failure
rate
α
Task
failure
rate
Target
Func2on:
Min
(ttotal)
given
n
tasks
to
run
on
r
resources
task
failure
rate
(α)
is
measurable
(Task
Failure
Model)
or
job
failure
rate
(β)
is
measurable
(Job
Failure
Model)
Assump2on:
n
>>
r,
but
n/k
>>
r
7
8. Job
Failure
Model
Run2me
for
a
single
job
t job = kt + d
Avg
retry
2me
N = 1
job
for
a
single
job
(1− β )
ttotal
Es2mated
Overall
Run2me
"
$ N job n if
n
≥r
n
Number
of
tasks
to
run
$ rk k t
Run2me
of
a
single
task
Retry
2me
N total =# r
Number
of
available
resources
for
all
jobs
$ n
$ N job , if
k
<r d
Time
delay
between
jobs
% N
Expected
retry
2mes
for
a
single
task
Overall
ttotal = t job N total k
Number
of
tasks
in
a
job
run2me
# β
Job
failure
rate
% Nn(kt + d) = n(kt + d) , if
n
≥r α
Task
failure
rate
% rk rk(1− β ) k
ttotal =$
% (kt + d) n
% N(kt + d) = , if <r
& 1− β k
8
9. Job
Failure
Model
#
% Nn(kt + d) = n(kt + d) , if
n
≥r
% rk rk(1− β ) k
ttotal =$
% (kt + d) n
% N(kt + d) = , if <r
& 1− β k
k*
is
independent
of
β
It’s
not
necessary
to
n
k* =
adjust
k.
Just
set
it
to
be
r
* (kt + d)
ttotal
=
1− β
n=1000,
t=5
sec,
d=5
sec,
r=20
9
10. Task
Failure
Model
Run2me
for
a
single
job
t job = kt + d
Avg
retry
2me
N = 1
job
for
a
single
job
(1− α )k
ttotal
Es2mated
Overall
Run2me
"
$ N job n if
n
≥r
n
Number
of
tasks
to
run
$ rk k t
Run2me
of
a
single
task
Retry
2me
N total =# r
Number
of
available
resources
for
all
jobs
$ n
$ N job , if
k
<r d
Time
delay
between
jobs
% N
Expected
retry
2mes
for
a
single
task
Overall
ttotal = t job N total k
Number
of
tasks
in
a
job
run2me
β
Job
failure
rate
α
Task
failure
rate
#
% Nn(kt + d) = n(kt + d) , if
n
≥r
% rk rk(1− α )k k
ttotal =$
% (kt + d) n
% N(kt + d) = k
, if <r
& (1− α ) k
10
11. Task
Failure
Model
#
% Nn(kt + d) = n(kt + d) , if
n
≥r
% rk rk(1− α )k k
ttotal =$
% (kt + d) n
% N(kt + d) = k
, if <r
& (1− α ) k
k*
is
dependent
of
α
It’s
necessary
to
adjust
k
4d
according
to
α
−d + d 2 −
ln(1− α )
k* = , if n >> r
2t
* n(k *t + d)
t =
total *
rk(1− α )k
11
12. Comparing
TFM
and
JFM
2.
Op2mal
clustering
factor
1.
Linear
increase
vs
exponen2al
increase
4d
n −d + d 2 −
k* = k* =
ln(1− α )
, if n >> r
r 2t
(kt + d)
*
ttotal
= * n(k *t + d)
1− β t =
total *
rk(1− α )k
12
13. Fault
Tolerant
Clustering
• Job
Failure
Model:
k=n/r
• Selec2ve
Reclustering
(SR)
– select
the
failed
tasks
in
a
clustered
job
and
cluster
them
into
a
new
clustered
job
– It
requires
the
iden2fica2on
of
failed
tasks.
13
14. Fault
Tolerant
Clustering
• Dynamic
Clustering
(DC)
– adjust
the
clustering
factor
according
to
the
task
failure
rates
dynamically
4d
−d + d 2 −
ln(1− α )
k* = , if n >> r
2t
* n(k *t + d)
t total,DC = * *
rk (1− α )k
14
16. Evalua2on
• Run
simula2ons
based
on
the
real
traces
that
were
run
by
the
Pegasus
group.
• Each
workflow
was
simulated
100
2mes
so
that
the
standard
devia2on
is
less
than
10%
• Two
workflows
were
used.
• 20
worker
nodes
were
used
in
each
experiment.
16
17. Workflows
Used
• Montage
– An
astronomy
applica2on
used
to
construct
large
image
mosaics
of
the
sky.
– Montage
has
complex
data
dependencies
between
tasks
– 10,422
tasks,
57GB
data.
17
Image
from
hhp://montage.ipac.caltech.edu/
18. Workflows
Used
• Periodogram
– Iden2fy
periodic
signals
from
light
curves
that
arise
from
transi2ng
planets.
– 216,600
tasks,
19GB
input
data.
– Periodogram
has
only
one
level
Image
from
hhp://pegasus.isi.edu/presenta2ons/2011/sci709-‐voeckler-‐talk.ppt/
18
23. Task
Specific
Failure
Detec2on
(TSFD)
• Task
Failures
are
related
to
the
type
of
tasks
• Failure
Monitor
classifies
failures
based
on
the
type
• Clustering
Engine
merges
tasks
based
on
different
task
failure
rates
• In
this
experiment
of
Montage,
we
set
the
task
failure
rate
of
mProjectPP
and
mDiffFit
to
be
0.001
while
mBackground
ranges
from
0.2
to
0.8.
Optimization Methods
α1 DR DR+TSFD DC DC+TSFD
0.2 10415 10412 13804 13820
0.4 11830 11839 22946 22923
0.6 14704 14688 60429 60414
0.8 23238 23229 436638 435297
23
24. Task
Failure
Model
#
% Nn(kt + d) = n(kt + d) , if
n
≥r
% rk rk(1− α )k k
ttotal =$
% (kt + d) n
% N(kt + d) = k
, if <r
& (1− α ) k
ttotal
is
not
sensi2ve
to
α
4d
−d + d 2 −
ln(1− α )
k* = , if n >> r
2t
* n(k *t + d)
t =
total *
rk(1− α )k
Simplifica2on
of
failures
is
acceptable
24
25. Loca2on
Specific
Failure
Detec2on
(LSFD)
• Task
Failures
are
related
to
the
loca2on
of
execu2on
• Failure
Monitor
classifies
failures
based
on
resource
id
• Scheduler
orders
resources
based
on
their
reliability.
• Two
out
of
twenty
nodes
have
a
higher
task
failure
rates
(from
0.2
to
0.8)
while
others
s2ll
have
a
task
failure
rate
of
0.001.
small
tasks
if
task
failure
rate
is
high
DC
generates
many
25
26. Conclusion
• We
present
three
basic
methods
to
improve
fault
tolerance
in
task
clustering
• If
the
system
supports
iden2fica2on
of
failed
tasks,
dynamic
reclustering
performs
best
• Otherwise,
use
dynamic
clustering
• Improvement
is
significant
even
for
very
basic
method
26
27. Future
Work
• Ver2cal
Clustering
and
Arbitrary
Clustering
• Intelligent
Scheduler
• More
Workflow
Examples
• Distribu2on
of
Failures
27
28. Ques2ons?
• Thank
you
for
coming!
• For
further
info,
please
visit:
pegasus.isi.edu
or
email
wchen@isi.edu
28
29. Refinements
• When
n>>r
does
not
hold
in
the
end
of
execu2on
ntask
• Default:
kactual = k n jobs = k < r
*
r
• Replica2ve:
n jobs
r
k
actual
=
k
*
=
replicate
jobs
by
ntask / k
• Even:
actual = ntask n jobs = r
k
r
29