XS Oracle 2009 Error Detection

616 views

Published on

Detecting and Correcting Transient Errors via Xen

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
616
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

XS Oracle 2009 Error Detection

  1. 1. Detecting and Correcting Transient Hardware Errors John Byrne (john.l.byrne@hp.com), Norman P. Jouppi, Laura Ramirez, Parthasarathy Ranganathan, Bruce J. Walker HP Labs Nidhi Aggarwal, Kewal K. Saluja, James E. Smith University of Wisconsin – Madison © 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  2. 2. Availability/Reliability Spectrum High Availability; Fault Tolerance; No lost Fault Correction; keep Restart/Failover; work; keep running, Running with correct results Lose ongoing work even with bad results 2 24 February 2009
  3. 3. Availability/Reliability Spectrum High Availability; Fault Tolerance; No lost Fault Correction; keep Restart VM; work; keep running, Running with correct results Lose ongoing work even with bad results Ongoing FT Work: Remus Checkpoint and restart on complete node failure Kemari Output comparison; pick one if they don’t match Marathon Lockstep VMware Execution Backup takes over on complete node failure 3 24 February 2009
  4. 4. Availability/Reliability Spectrum High Availability; Fault Tolerance; No lost Fault Correction; keep Restart VM; work; keep running, Running with correct results Lose ongoing work even with bad results Ongoing FT Work: Remus Checkpoint and restart on complete node failure Kemari Output comparison; pick one if they don’t match Marathon Lockstep VMware Execution Backup takes over on complete node failure Theme: Deal with Complete Node Failure No one is detecting or correcting transient processor failures 4 24 February 2009
  5. 5. Transient Hardware Errors International Technology Roadmap for Semiconductors • has predicted significant reliability problems Intel study in 2005 indicated 100-fold increase in • transient faults in scaling from 180nm to 16nm Errors can: a. Crash the OS; b. Corrupt data; c. Cause execution to take a different path; Goal: Detect and Correct Transient Errors 5 24 February 2009
  6. 6. Detect and Correct Transient Errors Lockstep VMs with ongoing checkpoints. 1. Tee input to both VMs 2. Compare output from VMs and re-execute on 3. miscompare Compare checkpoints and re-execute on 4. miscompare Log input, interrupts and non-deterministic 5. instructions to allow completely accurate re- execution; 6 24 February 2009
  7. 7. Lockstep VMs • Create 2 identical images at VM start time; • Ensure response to each VMexit is identical; − Force them to be identical if necessary (rdtsc) • Deliver interrupt at the identical instruction − At each VMexit, deliver interrupts if they are pending; − Use the count-down PMU counter to force a synchronization point if no VMexits happen and deliver interrupts at that point. • Log VMexit return values as necessary (for replay); • Log when interrupts are delivered (for replay) 7 24 February 2009
  8. 8. I/O • Input is sent to both VMs (network and disk); − Input is logged as being part of a specific checkpoint so on replay inputs can come from the log; • Output is compared and if equal, is sent out; − If not equal, re-execute from the last good checkpoint − Blktap driver modified to allow disk output to be compared − Network backend driver modified to allow network output to be compared; − Output is counted so on replay the correct number of re- created outputs can be discarded and not output twice. 8 24 February 2009
  9. 9. Checkpointing and Comparing Incremental, periodic checkpoints of each VM • − At exactly the same instruction; − Utilize COW or copy at checkpoint time; − Mark the checkpoint event in the input and output streams − Immediate continue execution after checkpoint done; After checkpoint X is done, compare the incremental • checkpoints − If equal, delete one of the checkpoint − If equal, delete checkpoint X-1 + any logs for x-1; − If not equal, then do a replay from checkpoint X-1; At checkpoint event, tell input to start a new log; • • At checkpoint event, tell output to record the output count and start a new count for the next checkpoint 9 24 February 2009
  10. 10. Replay from Checkpoint X • Restore the registers and memory image • Tell input i/o to replay input from log for checkpoint X • Tell output that we are replaying from checkpoint X and it then throws away as many i/o’s as were done since checkpoint X. 10 24 February 2009
  11. 11. Initial Limitations • Uniprocessor VMs • HVM guests • Single Node implementation 11 24 February 2009
  12. 12. Performance Data to Date • Implementation to date includes lockstep VMs and disk i/o input funneling and output checking (network i/o is implemented but not yet tested) • Bonnie i/o benchmark didn’t show any degradation; • SpecCPU benchmark suite showed 2-5% degradation. 12 24 February 2009
  13. 13. Plans • Implement the checkpoint and replay code • Work on performance of lockstep and checkpoint • Investigate UP, HVM and single-site restrictions. 13 24 February 2009

×