Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Silent error resilience in
numerical time-stepping schemes
Austin Benson
arbenson@stanford.edu
Stanford University
ICME Co...
 Computer systems are getting bigger and more complicated.
 Software systems are getting bigger and more complicated.
 ...
What breaks?
 Hardware wears out
 Bit flips from cosmic rays
 Data races and other software bugs
 Firmware bugs
Silent...
What can we do?
 Checkpoint/restart: Occasionally save state of
system. If error is detected, restart.
Does not scale. Ho...
Spot the error!
5
At time step 120, multiplied single entry in
right-hand-side of Crank-Nicolson and
Backward Euler linear solves by 0.995. 6
General algorithm:
 “Base method” generates sequence B1, B2, …
 “Auxiliary method” generates sequence A1, A2, …
 If Di ...
Base method:
high-order numerical integration scheme:
Runge-Kutta 5
Auxiliary method:
lower-order scheme: Runge-Kutta 4
Di...
Key idea: re-use data
RK 1/2 scheme for u’ = f(t, u)
Second-order
scheme has
error O(h^3)
No extra function evaluations.
P...
Key idea: re-use data
Implicit solve
that is stable
Explicit solve checks.
It is OK that the explicit solve may be unstabl...
 Backward/Forward Euler
 Richardson/Crank-Nicolson
 Runge-Kutta 1/2, 2/3, 4/5
 Adams-Bashforth linear multistep method...
Exercise in step detection (change point detection)
Algorithmic details in the paper. Main parameters:
Relative jump
Varia...
Experimental setup:
 Solve heat equation for T time steps and
artificially inject error at one time step.
 Do this many ...
Are large errors easier to detect?
Local truncation error (LTE)-normalized error
Output when no fault is injected.
Output ...
Error injection:
Multiply single entry of RHS
in linear solves by
z ~ N(1, 5e-5) at a single
time step
15
Error injection:
Multiply q(x, t) at one
discrete x by z ~ N(1, 0.1)
at a single time step
16
Takeaways
17
 We have a general framework for detecting silent errors.
 Numerical integration is our central application...
 How many silent errors are there? How worried should we be?
 Do we need systems solutions or algorithmic solutions? Bot...
Silent error resilience in
numerical time-stepping schemes
Austin Benson
arbenson@stanford.edu
Stanford University
ICME Co...
Tardy error detection
20
Upcoming SlideShare
Loading in …5
×

Silent error resilience in numerical time-stepping schemes

257 views

Published on

Talk on silent error resilience at the ICME Colloquium on January 26, 2015.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Silent error resilience in numerical time-stepping schemes

  1. 1. Silent error resilience in numerical time-stepping schemes Austin Benson arbenson@stanford.edu Stanford University ICME Colloquium, Jan. 26 2015 Joint work with Sven Schmit, Stanford Rob Schreiber, HP Labs code + data: http://stanford.edu/~arbenson/silent.html paper: Intl. J. of High Performance Computing Applications, 2014 1
  2. 2.  Computer systems are getting bigger and more complicated.  Software systems are getting bigger and more complicated.  Pushing energy limits.  Things break. 2
  3. 3. What breaks?  Hardware wears out  Bit flips from cosmic rays  Data races and other software bugs  Firmware bugs Silent errors are errors in application state that have escaped low-level error detection. 3
  4. 4. What can we do?  Checkpoint/restart: Occasionally save state of system. If error is detected, restart. Does not scale. How to detect errors?  Other ABFT: Clever algorithms that address these issues for particular algorithms.  This work: Error detection for iterative computation in general, numerical time-stepping schemes in particular. 4
  5. 5. Spot the error! 5
  6. 6. At time step 120, multiplied single entry in right-hand-side of Crank-Nicolson and Backward Euler linear solves by 0.995. 6
  7. 7. General algorithm:  “Base method” generates sequence B1, B2, …  “Auxiliary method” generates sequence A1, A2, …  If Di = ||Bi – Ai|| is abnormal, possible error 7
  8. 8. Base method: high-order numerical integration scheme: Runge-Kutta 5 Auxiliary method: lower-order scheme: Runge-Kutta 4 Difference: Di = |Bi – Ai| Re-purposing an old idea for step-size control [Fehlberg, 1969], [Dormand and Prince, 1980] 8
  9. 9. Key idea: re-use data RK 1/2 scheme for u’ = f(t, u) Second-order scheme has error O(h^3) No extra function evaluations. Provides O(h^2) check. 9
  10. 10. Key idea: re-use data Implicit solve that is stable Explicit solve checks. It is OK that the explicit solve may be unstable. (Why?) 10 e.g., Backward Euler e.g., Forward Euler
  11. 11.  Backward/Forward Euler  Richardson/Crank-Nicolson  Runge-Kutta 1/2, 2/3, 4/5  Adams-Bashforth linear multistep method 2/3, 4/5  Explicit check on implicit scheme  Extrapolation Lots of these checks for numerical time-stepping algorithms… 11
  12. 12. Exercise in step detection (change point detection) Algorithmic details in the paper. Main parameters: Relative jump Variance change 12
  13. 13. Experimental setup:  Solve heat equation for T time steps and artificially inject error at one time step.  Do this many times with different types of errors.  True positive rate: #(real errors detected) / #(trials)  False positive rate: #(non-errors “detected”) / #(time steps) 13
  14. 14. Are large errors easier to detect? Local truncation error (LTE)-normalized error Output when no fault is injected. Output when fault is injected. 14
  15. 15. Error injection: Multiply single entry of RHS in linear solves by z ~ N(1, 5e-5) at a single time step 15
  16. 16. Error injection: Multiply q(x, t) at one discrete x by z ~ N(1, 0.1) at a single time step 16
  17. 17. Takeaways 17  We have a general framework for detecting silent errors.  Numerical integration is our central application.  We detect large errors more easily.  Not too many false positives.
  18. 18.  How many silent errors are there? How worried should we be?  Do we need systems solutions or algorithmic solutions? Both?  “Defense in depth” is good. But how easy are ABFT methods to incorporate into existing solvers? Resilience: what do we need to discuss? 18
  19. 19. Silent error resilience in numerical time-stepping schemes Austin Benson arbenson@stanford.edu Stanford University ICME Colloquium, Jan. 26 2015 Joint work with Sven Schmit, Stanford Rob Schreiber, HP Labs code + data: http://stanford.edu/~arbenson/silent.html paper: Intl. J. of High Performance Computing Applications, 2014 19
  20. 20. Tardy error detection 20

×