2. Introduction
Hardware Errors in Theory
Bulk
Substrate
Source
–
–
Drain
–
–
Gate
+
++
Oxide Layer
Bj¨orn D¨obel () OS Resilience 06.08.2013 31 / 58
3. Introduction
Hardware Errors in Theory
Bulk
Substrate
Source
–
–
Drain
–
–
Gate
+
++
Oxide Layer
Radiation-induced errors
Cosmic radiation
Alpha particles emitted by
packaging
Thermal stress
Aging of circuitry
Electromigration
Hot Carrier Injection
Negative-Bias Temperature
Instability
Bj¨orn D¨obel () OS Resilience 06.08.2013 31 / 58
4. Introduction
Hardware Errors in the Real World
Several studies investigated manifestation of hardware errors in software:
Saggese, 2005
85% of hardware errors
masked
Error outcome depends on
a↵ected HW unit
Li, 2008, focus on permanent
errors
Permanent errors mainly lead
to crashes / HW exceptions
65% of errors corrupt OS
state before crashing
Bj¨orn D¨obel () OS Resilience 06.08.2013 32 / 58
5. Introduction
Hardware Errors in the Real World
Several studies investigated manifestation of hardware errors in software:
Saggese, 2005
85% of hardware errors
masked
Error outcome depends on
a↵ected HW unit
Li, 2008, focus on permanent
errors
Permanent errors mainly lead
to crashes / HW exceptions
65% of errors corrupt OS
state before crashing
Arlat 2002, Chorus and LynxOS
microkernels
Significant amount (30%) of
”no change” errors
Some OS components are
more error-prone than others
Wang, 2003, focus on branching
errors
Several cases (up to 40%)
where taking di↵erent branch
does not change program
result
Bj¨orn D¨obel () OS Resilience 06.08.2013 32 / 58
8. Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software
Optimization Potential: don’t track harmless errors
Challenge: Binary applications
Bj¨orn D¨obel () OS Resilience 06.08.2013 33 / 58
9. Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software
Optimization Potential: don’t track harmless errors
Challenge: Binary applications
Optimization Potential: Hardware-Level Concurrency
Bj¨orn D¨obel () OS Resilience 06.08.2013 33 / 58
10. Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors
Bj¨orn D¨obel () OS Resilience 06.08.2013 34 / 58
11. Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
Bj¨orn D¨obel () OS Resilience 06.08.2013 34 / 58
12. Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
Bj¨orn D¨obel () OS Resilience 06.08.2013 34 / 58
13. Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
Bj¨orn D¨obel () OS Resilience 06.08.2013 34 / 58
14. Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Bj¨orn D¨obel () OS Resilience 06.08.2013 34 / 58
15. Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Romain
Bj¨orn D¨obel () OS Resilience 06.08.2013 34 / 58
41. Introduction
Shared Memory
Not in complete control of master
Standard technique: trap&emulate
Execution overhead (x100 - x1000)
Adds complexity to RCB
Disassembler 6,000 LoC
Tiny emulator 500 LoC
Our implementation: copy & execute
Bj¨orn D¨obel () OS Resilience 06.08.2013 44 / 58
56. Introduction
Minimizing the RCB
What to minimize?
Lines of Code (as in TCB)?
Time spent executing RCB code?
Bj¨orn D¨obel () OS Resilience 06.08.2013 50 / 58
57. Introduction
Minimizing the RCB
What to minimize?
Lines of Code (as in TCB)?
Time spent executing RCB code?
More likely: runtime ⇥ vulnerability
Bj¨orn D¨obel () OS Resilience 06.08.2013 50 / 58
58. Introduction
Hardening the RCB
We need: Dedicated mechanisms to
protect the RCB (HW or SW)
We have: Full control over software
RAD-hardened hardware?
Too expensive
Embrace heterogeneity!
IBM Cell
ARM big.LITTLE
Bj¨orn D¨obel () OS Resilience 06.08.2013 51 / 58
59. Introduction
Hardening the RCB
We need: Dedicated mechanisms to
protect the RCB (HW or SW)
We have: Full control over software
RAD-hardened hardware?
Too expensive
Embrace heterogeneity!
IBM Cell
ARM big.LITTLE
Our proposal: Split HW into
ResCores and NonRes-Cores
ResCore
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
Bj¨orn D¨obel () OS Resilience 06.08.2013 51 / 58
61. Introduction
Signaling Performance
10
20
30
40
50
60
Overheadin%
Overhead by notification method
Local Faults
Migration
Sync IPC
Shared Mem
susan CRC32
DMR
susan CRC32
TMR
Fast shared-memory message
passing would be good !
Intel SCC / Knights Corner
RCB/Non-RCB boundary is
vulnerable
Messaging / Exceptions
need to function
Must not overwrite other
data
Bj¨orn D¨obel () OS Resilience 06.08.2013 52 / 58
62. Introduction
Is software-level protection feasible?
We have full source of the RCB.
Compiler support for fault tolerance (SWIFT1
, AN-Encoded Processing2
)
may help.
Hasn’t been done for kernel code yet.
Bj¨orn D¨obel () OS Resilience 06.08.2013 53 / 58
63. Introduction
Is software-level protection feasible?
We have full source of the RCB.
Compiler support for fault tolerance (SWIFT1
, AN-Encoded Processing2
)
may help.
Hasn’t been done for kernel code yet.
Gedankenexperiment:
We know how much RCB-related execution is added due to replication.
We know average overheads for SWIFT (9.5%) and AN encoding(390%)3
Bj¨orn D¨obel () OS Resilience 06.08.2013 53 / 58
66. Introduction
Summary
OS-level techniques to tolerate SW and HW faults
Address-space isolation
Microreboots
Various ways of handling session state
Replication against hardware errors
Special care needed to protect Reliable Computing Base
Bj¨orn D¨obel () OS Resilience 06.08.2013 56 / 58
67. Introduction
Further Reading
Minix3: Jorrit Herder, Ben Gras,, Philip Homburg, Andrew S. Tanenbaum:
Fault Isolation for Device Drivers, DSN 2009
CuriOS: Francis M. David, Ellick M. Chan, Je↵rey C. Carlyle and Roy H.
Campbell CuriOS: Improving Reliability through Operating System Structure,
OSDI 2008
L4ReAnimator: Dirk Vogt, Bj¨orn D¨obel, Adam Lackorzynski: Stay strong,
stay safe: Enhancing Reliability of a Secure Operating System, IIDS 2010
Bj¨orn D¨obel () OS Resilience 06.08.2013 57 / 58
68. Introduction
Further Reading
Reliability Analysis:
Saggese et al.: An Experimental Study of Soft Errors in Microprocessors, IEEE Micro 2005
Li et al.: Understanding the Propagation of Hard Errors to Software and Implications for Resilient
System Design, ASPLOS 2008
Arlat et al.: Dependability of COTS Microkernel-Based Systems, IEEE ToCS 2002
Wang et al.: Y-Branches: When you come to a Fork in the Road: Take it!, PACT 2003
PLR: Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseh Blomsted, Ramesh
Peri: Using Process-Level Redundancy to Exploit Multiple Cores for Transient
Fault Tolerance, DSN 2007
Romain:
Bj¨orn D¨obel, Hermann H¨artig, Michael Engel: Operating System Support for Redundant
Multithreading, EMSOFT 2012
Bj¨orn D¨obel, Hermann H¨artig: Who watches the watchmen? – Protecting Operating System
Reliability Mechanisms, HotDep 2012
Bj¨orn D¨obel () OS Resilience 06.08.2013 58 / 58