Successfully reported this slideshow.
Upcoming SlideShare
×

# Silent error resilience in numerical time-stepping schemes

257 views

Published on

Talk on silent error resilience at the ICME Colloquium on January 26, 2015.

Published in: Engineering
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Silent error resilience in numerical time-stepping schemes

1. 1. Silent error resilience in numerical time-stepping schemes Austin Benson arbenson@stanford.edu Stanford University ICME Colloquium, Jan. 26 2015 Joint work with Sven Schmit, Stanford Rob Schreiber, HP Labs code + data: http://stanford.edu/~arbenson/silent.html paper: Intl. J. of High Performance Computing Applications, 2014 1
2. 2.  Computer systems are getting bigger and more complicated.  Software systems are getting bigger and more complicated.  Pushing energy limits.  Things break. 2
3. 3. What breaks?  Hardware wears out  Bit flips from cosmic rays  Data races and other software bugs  Firmware bugs Silent errors are errors in application state that have escaped low-level error detection. 3
4. 4. What can we do?  Checkpoint/restart: Occasionally save state of system. If error is detected, restart. Does not scale. How to detect errors?  Other ABFT: Clever algorithms that address these issues for particular algorithms.  This work: Error detection for iterative computation in general, numerical time-stepping schemes in particular. 4
5. 5. Spot the error! 5
6. 6. At time step 120, multiplied single entry in right-hand-side of Crank-Nicolson and Backward Euler linear solves by 0.995. 6
7. 7. General algorithm:  “Base method” generates sequence B1, B2, …  “Auxiliary method” generates sequence A1, A2, …  If Di = ||Bi – Ai|| is abnormal, possible error 7
8. 8. Base method: high-order numerical integration scheme: Runge-Kutta 5 Auxiliary method: lower-order scheme: Runge-Kutta 4 Difference: Di = |Bi – Ai| Re-purposing an old idea for step-size control [Fehlberg, 1969], [Dormand and Prince, 1980] 8
9. 9. Key idea: re-use data RK 1/2 scheme for u’ = f(t, u) Second-order scheme has error O(h^3) No extra function evaluations. Provides O(h^2) check. 9
10. 10. Key idea: re-use data Implicit solve that is stable Explicit solve checks. It is OK that the explicit solve may be unstable. (Why?) 10 e.g., Backward Euler e.g., Forward Euler
11. 11.  Backward/Forward Euler  Richardson/Crank-Nicolson  Runge-Kutta 1/2, 2/3, 4/5  Adams-Bashforth linear multistep method 2/3, 4/5  Explicit check on implicit scheme  Extrapolation Lots of these checks for numerical time-stepping algorithms… 11
12. 12. Exercise in step detection (change point detection) Algorithmic details in the paper. Main parameters: Relative jump Variance change 12
13. 13. Experimental setup:  Solve heat equation for T time steps and artificially inject error at one time step.  Do this many times with different types of errors.  True positive rate: #(real errors detected) / #(trials)  False positive rate: #(non-errors “detected”) / #(time steps) 13
14. 14. Are large errors easier to detect? Local truncation error (LTE)-normalized error Output when no fault is injected. Output when fault is injected. 14
15. 15. Error injection: Multiply single entry of RHS in linear solves by z ~ N(1, 5e-5) at a single time step 15
16. 16. Error injection: Multiply q(x, t) at one discrete x by z ~ N(1, 0.1) at a single time step 16
17. 17. Takeaways 17  We have a general framework for detecting silent errors.  Numerical integration is our central application.  We detect large errors more easily.  Not too many false positives.
18. 18.  How many silent errors are there? How worried should we be?  Do we need systems solutions or algorithmic solutions? Both?  “Defense in depth” is good. But how easy are ABFT methods to incorporate into existing solvers? Resilience: what do we need to discuss? 18
19. 19. Silent error resilience in numerical time-stepping schemes Austin Benson arbenson@stanford.edu Stanford University ICME Colloquium, Jan. 26 2015 Joint work with Sven Schmit, Stanford Rob Schreiber, HP Labs code + data: http://stanford.edu/~arbenson/silent.html paper: Intl. J. of High Performance Computing Applications, 2014 19
20. 20. Tardy error detection 20