Resilience at exascale

Resilience at Exascale

Marc
Snir

Director,
Mathema0cs
and

Computer
Science
Division

Argonne
Na0onal
Laboratory

Professor,
Dept.
of
Computer

Science,
UIUC

The Problem
•  “Resilience
is
the
black
swan
of
the
exascale
program”

•  DOE
&
DoD
commissioned
several
reports

–  Inter-‐Agency
Workshop
on
HPC
Resilience
at
Extreme
Scale

hOp://ins0tute.lanl.gov/resilience/docs/Inter-‐
AgencyResilienceReport.pdf

(Feb
2012)

–  U.S.
Department
of
Energy
Fault
Management
Workshop

hOp://shadow.dyndns.info/publica0ons/geist12department.pdf

(June
2012)

–  ICIS
workshop
on
“addressing
failures
in
exascale
compu0ng”

(report
forthcoming)

•  Talk:

–  Discuss
the
liOle
we
understand

–  Discuss
how
we
could
learn
more

–  Seek
you
feedback
on
the
report’s
content

2

WHERE WE ARE TODAY

3

Failures Today
•  Latest
large
systema0c
study
(Schroeder
&
Gibson)
published
in
2005

–
quite
obsolete

•  Failures
per
node
per
year:
2-‐20

•  Root
cause
of
failures

–  Hardware
(30%-‐60%),
SW
(5%-‐25%),
Unknown
(20%-‐30%)

•  Mean-‐Time
to
Repair:
3-‐24
hours

•  Current
anecdotal
numbers:

–  Applica0on
crash:
>1
day

–  Global
system
crash:
>
1
week

–  Applica0on
checkpoint/restart:
15-‐20
minutes
(checkpoint
oden

“free”)

–  System
restart:
>
1
hour

–  Sod
HW
errors
suspected

4

Recommendation

•  Study
in
systema0c
manner
current
failure
types
and
rates

–  Using
error
logs
at
major
DOE
labs

–  Pushing
for
common
ontology

•  Spend
0me
hun0ng
for
sod
HW
errors

•  Study
causes
of
errors
(?)

–  Most
HW
faults
(>99%)
have
no
effect
(processors
are
hugely

inefficient?)

–  The
common
effect
of
sod
HW
bit
flips
is

SW
failure

–  HW
faults
can
(oden?)
be
due
to
design
bugs

–  Vendors
are
cagey

•  Time
for
a
consor0um
effort?

5

Current Error-Handling

•  Applica0on:
global
checkpoint
&
global
restart

•  System:

–  Repair
persistent
state
(ﬁle
system,
databases)

–  Clean
slate
restart
for
everything
else

•  Quick
analysis:

–  Assume
failures
have
Poisson
distribu0on

checkpoint
0me
C=1;

recover+restart
0me
=
R;

MTBF=M

− τ /M
–  Op0mal
checkpoint
interval
τ
sa0sﬁes
e = (M − τ +1) / M

–  System
u0liza0on
U
is
U = (M − R − τ +1) / M

6

Utilization, assuming R=1

•  EE.g.

–  C=R=15
mins
(=1)

–  MTBF=
25
hours

(=100)

–  U
≈
85%

7

Utilization as Function of MTBF and R

•  E.g.

–  C
=
15
mins
(=1)

–  R
=
1
hour
(=4)

–  MTBF=
25
hours

(=100)

•  U
≈
80%

8

Projecting Ahead

•  “Comfort
zone:”

–  Checkpoint
0me
<1%
MTBF

–  Repair
0me
<5%
MTBF

•  Assume
MTBF
=
1
hour

–  Checkpoint
0me
≈
0.5
minute

–  Repair
0me
≈
3
minutes

•  Is
this
doable?

–  Yes
if
done
in
memory

–  E.g.,
using
“RAID”
techniques

9

Exascale Design Point

Systems
2012
2020-‐2024

Difference

BG/Q
Today
&
2019

Computer

System
peak
20
Pflop/s
1
Eflop/s
O(100)

Power
8.6
MW
~20
MW

System
memory
1.6
PB
32
-‐
64
PB
O(10)

(16*96*1024)

Node
performance

205
GF/s
1.2

or
15TF/s
O(10)
–
O(100)

(16*1.6GHz*8)

Node
memory
BW
42.6
GB/s
2
-‐
4TB/s
O(1000)

Node
concurrency
64
Threads
O(1k)
or
10k
O(100)
–
O(1000)

Total
Node
Interconnect
BW
20
GB/s
200-‐400GB/s
O(10)

System
size
(nodes)
98,304
O(100,000)
or
O(1M)
O(100)
–
O(1000)

(96*1024)

Total
concurrency
5.97
M
O(billion)
O(1,000)

MTTI
4
days
O(<1
day)
-‐
O(10)

Both price and power envelopes may be too aggressive!

Time to checkpoint
•  Assume
“Raid
5”
across
memories,
checkpoint
size
~
50%

memory
size

–  Checkpoint
0me
≈
0me
to
transfer
50%
of
memory
to

another
node:
Few
seconds!

–  Memory
overhead

~
50%

–  Energy
overhead
small
with
nVRAM

–  (But
need
many
write
cycles)

–  We
are
in
comfort
zone
(re
checkpoint)

•  How
about
recovery?

•  Time
to
restart
applica0on
same

as
0me
to
checkpoint

•  Problem:
Time
to
recover
if

system
failed

11

Key Assumptions:

1.  Errors
are
detected
(before
the
erroneous
checkpoint
is

commiOed)

2.  System
failures
are
rare
(<<
1
day)
or
recovery
from
them
is

very
fast
(minutes)

12

Hardware Will Fail More Frequently

•  More,
smaller
transistor

•  Near-‐threshold
logic

☛ More
frequent
bit
upsets
due
to
radia0on

☛ More
frequent
mul0ple
bit
upsets
due
to
radia0on

☛ Larger
manufacturing
varia0on

☛ Faster
aging

13

nknown with respect to the 11nm technology node.
Hardware Error Detection: Assumptions
2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm
45nm 11nm
Cores 8 128
Scattered latches per core 200, 000 200, 000
p p
Scattered latchs in uncore relative to cores ncores ⇥ 1.25 = 0.44 ncores ⇥ 1.25 = 0.11
FIT per latch 10 1
10 1

Arrays per core (MB) 1 1
FIT per SRAM cell 10 4
10 4

Logic FIT / latch FIT 0.1 0.5 0.1 0.5
DRAM FIT (per node) 50 50

12

14

Hardware Error Detection: Analysis
Array interleaving and SECDED
(Baseline)
DCE [FIT] DUE [FIT] UE [FIT]
45nm 11nm 45nm 11nm 45nm 11nm
Arrays 5000 100000 50 20000 1 1000
Scattered latches 200 4000 N/A N/A 20 400
Combinational logic 20 400 N/A N/A 0 4
DRAM 50 50 0.5 0.5 0.005 0.005
Total 1000 - 5000 100000 10 - 100 5000 - 20000 10 - 50 500 - 5000
Array interleaving and ¿SECDED
(11nm overhead: ⇠ 1% area and < 5% power)
Arrays 5000 100000 50 1000 1 5
Scattered latches 200 4000 N/A N/A 20 400
Combinational logic 20 400 N/A N/A 0.2 5
DRAM 50 50 0.5 0.5 0.005 0.005
Total 1500 - 6500 100000 10 - 50 500 - 5000 10 - 50 100 - 500
Array interleaving and ¿SECDED + latch parity
(45nm overhead ⇠ 10%; 11nm overhead: ⇠ 20% area and ⇠ 25% power)
Arrays 5000 100000 50 1000 1 5
Scattered latches 200 4000 20 400 0.01 0.5
Combinational logic 20 400 N/A N/A 0.2 5
DRAM 0 0 0.1 0.0 0.100 0.001
Total 1500 - 6500 100000 25 - 100 2000 - 10000 1 5 - 20 15

Summary of (Rough) Analysis

•  If
no
new
technology
is
deployed
can
have
up
to
one
undetected

error
per
hour

•  With
addi0onal
circuitry
could
get
down
to
one
undetected

error
per
100-‐1,000
hours
(week
–
months)

–  The
cost
could
be
as
high
as
20%
addi0onal
circuits
and
25%

addi0onal
power

–  Main
problems
are
in
combinatorial
logic
and
latches

–  The
price
could
be
signiﬁcantly
higher
(small
market
for
high-‐
availability
servers)

•  Need
sodware
error
detec0on
as
an
op0on!

16

Application-Level Data Error Detection
•  Check
checkpoint
for
correctness
before
it
is
commiOed.

–  Look
for
outliers
(assume
smooth
ﬁelds)
–
handle
high
order

bit
errors

–  Use
robust
solvers
–
handle
low-‐order
bit
errors

•  May
not
work
for
discrete
problems,
parBcle
simulaBons,

disconBnuous
phenomena,
etc.

•  May
not
work
if
outlier
is
rapidly
smoothed
(error
propagaBon)

–  Check
for
global
invariants
(e.g.,
energy
preserva0on)

• 
Ignore
error

•  Duplicate
computa0on
of
cri0cal
variables

•  Can
we
reduce
checkpoint
size
(or
avoid
them
altogether)?

–  Tradeoﬀ:
how
much
is
saved,
how
oden
data
is
checked
(big/
small
transac0ons)

17

How About Control Errors?

•  Code
is
corrupted,
jump
to
wrong
address,
etc.

–  Rarer
(more
data
state
than
control
state)

–  More
likely
to
cause
a
fatal
excep0on

–  More
easy
to
protect
against

18

How About System Failures (Due to Hardware)?

•  Kernel
data
corrupted

–  Page
tables,
rou0ng
tables,
ﬁle
system
metadata,
etc.

•  Hard
to
understand
and
diagnose

–  Break
abstrac0on
layers

•  Make
sense
to
avoid
such
failures

–  Duplicate
system
computa0ons
that
aﬀect
kernel
state
(price

can
be
tolerated
for
HPC)

–  Use
transac0onal
methods

–  Harden
kernel
data

•  Highly
reliable
DRAM

•  Remotely
accessible
NVRAM

–  Predict
and
avoid
failures

19

Failure Prediction from Event Logs
Use a combination of signal analysis (to identify outliers) and
datamining (to find correlations)
e 1 presents
pe.
oes not have
e, a memory
arge number
sh the error
Data mining
themselfs in
e more than

ent the ma-
l to extract
ignals. This
f faults seen
between the

three signal
Gainaru, Cappello, Snir, Kramer (SC12)
that can be Fig. 2. Methodology overview of the hybrid approach 20

Can Hardware Failure Be Predicted?

Prediction method Precision Recall Seq used Pred failures
ELSA hybrid 91.2% 45.8% 62 (96.8%) 603
ELSA signal 88.1% 40.5% 117 (92.8%) 534
Data mining 91.9% 15.7% 39 (95.1%) 207
TABLE II
P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
C Precision Recall MTTF for the whole system Waste gain
.
1min 92 20 one day 9.13%
n 1min 92 36 one day 17.33%
10s 92 36 one day 12.09%
os The metrics used for evaluating one day
10s 92 45 prediction performance are
15.63%
prediction 92 recall:
1min
10s
and
92
50
65
5h
5h
21.74%
24.78%
• Precision is the fraction of failure predictions that turn
TABLE III
) out to be correct.
Migrating processes when node failure is predicted can
P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
significantly improve utilization
• Recall is the fraction of failures that are predicted.
o
s 21

How About Software Bugs?
•  Parallel
code
(transforma0onal,
mostly
determinis0c)
vs.

concurrent
code
(event
driven,
very
nondeterminis0c)

–  Hard
to
understand
concurrent
code
(cannot
comprehend

more
than
10
concurrent
actors)

–  Hard
to
avoid
performance
bugs
(e.g.,
overcommiOed

resources
causing
0me-‐outs)

–  Hard
to
test
for
performance
bugs

•  Concurrency
performance
bugs
(e.g.,
in
parallel
ﬁle
systems)
are

major
source
of
failures
on
current
supercomputers

–  Problem
will
worsen
as
we
scale
up
–
performance
bugs

become
more
frequent

•  Need
to
become
beOer
at
avoiding
performance
bugs
(learn

from
control
theory
and
real-‐0me
systems)

–  Make
system
code
more
determinis0c

22

System Recovery

•  Local
failures
(e.g.,
node
kernel
crashed)
are
not
a
major

problem
-‐-‐
Can
replace

•  Global
failures
(e.g.,
global
ﬁle
system
crashed)
are
the
hardest

problem
-‐-‐
Need
to
avoid

•  Issue:
Fault
containment

–  Local
hardware
failure
corrupts
global
state

–  Localized
hardware
error
causes
global
performance
bugs

23

Quid Custodiet Ipsos Custodes?

•  Who
watches
the
watchmen?

•  Need
robust
(scalable,
fault-‐tolerant)
infrastructure
for
error

repor0ng
and
recovery
orchestra0on

–  Current
approach
of
out-‐of-‐band
monitoring
and
control
is

too
restricted

24

Summary

•  Resilience
is
a
major
problem
in
the
exascale
era

•  Major
todos:

–  Need
to
understand
much
beOer
failures
in
current
systems

and
future
trends

–  Need
to
develop
good
applica0on
error
detec0on

methodology;
error
correc0on
is
desirable,
but
less
cri0cal

–  Need
to
signiﬁcantly
reduce
system
errors
and
reduce

system
recovery
0me

–  Need
to
develop
robust
infrastructure
for
error
handling

25

Resilience at exascale

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (7)

Similar to Resilience at exascale

Similar to Resilience at exascale (20)

Resilience at exascale