Hardware Errors and the OS

Introduction
Hardware Errors and the OS
Bj¨orn D¨obel () OS Resilience 06.08.2013 30 / 58

Introduction
Hardware Errors in Theory
Bulk
Substrate
Source
–
–
Drain
–
–
Gate
+
++
Oxide Layer

Introduction
Hardware Errors in Theory
Bulk
Substrate
Source
–
–
Drain
–
–
Gate
+
++
Oxide Layer
Radiation-induced errors
Cosmic radiation
Alpha particles emitted by
packaging
Thermal stress
Aging of circuitry
Electromigration
Hot Carrier Injection
Negative-Bias Temperature
Instability

Introduction
Hardware Errors in the Real World
Several studies investigated manifestation of hardware errors in software:
Saggese, 2005
85% of hardware errors
masked
Error outcome depends on
a↵ected HW unit
Li, 2008, focus on permanent
errors
Permanent errors mainly lead
to crashes / HW exceptions
65% of errors corrupt OS
state before crashing

Introduction
Hardware Errors in the Real World
Several studies investigated manifestation of hardware errors in software:
Saggese, 2005
85% of hardware errors
masked
Error outcome depends on
a↵ected HW unit
Li, 2008, focus on permanent
errors
Permanent errors mainly lead
to crashes / HW exceptions
65% of errors corrupt OS
state before crashing
Arlat 2002, Chorus and LynxOS
microkernels
Signiﬁcant amount (30%) of
”no change” errors
Some OS components are
more error-prone than others
Wang, 2003, focus on branching
errors
Several cases (up to 40%)
where taking di↵erent branch
does not change program
result

Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software

Introduction
Optimization Potential: don’t track harmless errors

Introduction
Challenge: Binary applications

Introduction
Challenge: Binary applications
Optimization Potential: Hardware-Level Concurrency

Introduction
Fault Tolerance: State of the Union
non-
COTS COTS
Hardware
errors
Software
errors

Introduction
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.

Introduction
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS

Introduction
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer

Introduction
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing

Introduction
non-
COTS COTS
Hardware
errors
Software
errors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Romain

Introduction
CS 101
Compute
Application
Inputs Outputs
Determinism property

Introduction
Redundant execution
App
App’
App”

Introduction
Redundant execution
Collected
Inputs
App
App’
App”

Introduction
Redundant execution
Collected
Inputs
App
App’
App”
=

Introduction
Inputs and Outputs
Inputs Outputs
System Calls System Calls
Shared Memory Shared Memory
I/O Memory I/O Memory
Special Instructions
(e.g., rdtsc)
Hardware Interrupts Hardware Exceptions
(e.g., page faults)

Introduction
Process-Level Redundancy [Shye 2007]
Binary recompilation
Complex, unprotected compiler
Architecture-dependent
System calls for replica synchronization
Virtual memory fault isolation
Restricted to Linux user-level programs

Introduction
Process-Level Redundancy [Shye 2007]
Binary recompilation
Complex, unprotected compiler
Architecture-dependent
Reuse OS mechanisms
System calls for replica synchronization
Additional synchronization events
Virtual memory fault isolation
Restricted to Linux user-level programs
Microkernel-based

Introduction
Transparent Replication as OS Service
Application
L4 Runtime
Environment
L4/Fiasco.OC microkernel

Introduction
Replicated
Application
L4 Runtime
Environment
Romain

Introduction
Unreplicated
Application
Replicated
Application
L4 Runtime
Environment
Romain

Introduction
Replicated
Driver
Unreplicated
Application
Replicated
Application
L4 Runtime
Environment
Romain

Introduction
Reliable Computing Base
Replicated
Driver
Unreplicated
Application
Replicated
Application
L4 Runtime
Environment
Romain

Introduction
Romain: Structure
Master

Introduction
Romain: Structure
Replica Replica Replica
Master

Introduction
Romain: Structure
Master
=

Introduction
Romain: Structure
Master
System
Call Proxy
Resource
Manager
=

Introduction
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1

Introduction
1 22 3 4 5 6
Replica 1
1 22 3 4 5 6
Replica 2

Introduction
1 22 3 4 5 6
Replica 1
1 22 3 4 5 6
Replica 2
1 2 3 4 5 6 Master

Introduction
Partitioned Capability Tables
1 2 3 4 5 6
Replica 1
1 2 3 4 5 6
Replica 2
1 2 3 4 5 6 Master
Marked used
Master private

Introduction
Replica Memory Management
Replica 1
rw ro ro
Replica 2
rw ro ro
Master

Introduction
Shared Memory
Not in complete control of master
Standard technique: trap&emulate
Execution overhead (x100 - x1000)
Adds complexity to RCB
Disassembler 6,000 LoC
Tiny emulator 500 LoC
Our implementation: copy & execute

Introduction
Copy&Execute
Master Replica

Introduction
Copy&Execute
Master Replica
mov eax, [ebx]
X

Introduction
Copy&Execute
Master Replica
mov eax, [ebx]

Introduction
Copy&Execute
Master Replica
mov eax, [ebx]
load repl. state
NOP; NOP; ...;
NOP
restore master
state

Introduction
Copy&Execute
Master Replica
mov eax, [ebx]mov eax, [ebx]
load repl. state
NOP; NOP; ...;
NOP
restore master
state

Introduction
Copy&Execute
Master Replica
mov eax, [ebx]
load repl. state
NOP; NOP; ...;
NOP
restore master
state
mov eax, [ebx]

Introduction
Runtime Overhead
SPEC INT 2006
400
perl
401
bzip2
403
gcc
429
mcf
445
gobmk
456
hm-
mer
458
sjeng
462
lib
quan-
tum
464
h264ref
471
om-
net++
473
as-
tar
1
1.05
1.1
1.15
1.2
1.25
1.3
Runtimenormalized
vs.nativeexecu-
tion
Single DMR TMR
1.45
1.95

Introduction
Replica-Core Placement Matters
429
mcf
429
mcf
adj
462
lib
quan-
tum
462
lib
quan-
tum
adj
471
om-
net++
471
om-
net++
adj
1
1.05
1.1
1.15
1.2
1.25
1.3
Runtimenormalized
vs.nativeexecu-
tion

Introduction
Romain Lines of Code
Base code (main, logging, locking) 325
Application loader 375
Replica manager 628
Redundancy 153
Memory manager 445
System call proxy 311
Shared memory 281
Total 2,518
Fault injector 668
GDB server stub 1,304

Introduction
User land is covered!
Replicated
Driver
Unreplicated
Application
Replicated
Application
L4 Runtime
Environment
Romain

Introduction
User land is covered!
Reliable Computing Base
Replicated
Driver
Unreplicated
Application
Replicated
Application
L4 Runtime
Environment
Romain

Introduction
Minimizing the RCB
What to minimize?
Lines of Code (as in TCB)?

Introduction
Minimizing the RCB
What to minimize?
Time spent executing RCB code?

Introduction
Minimizing the RCB
What to minimize?
Time spent executing RCB code?
More likely: runtime ⇥ vulnerability

Introduction
Hardening the RCB
We need: Dedicated mechanisms to
protect the RCB (HW or SW)
We have: Full control over software
RAD-hardened hardware?
Too expensive
Embrace heterogeneity!
IBM Cell
ARM big.LITTLE

Introduction
Hardening the RCB
We need: Dedicated mechanisms to
protect the RCB (HW or SW)
We have: Full control over software
RAD-hardened hardware?
Too expensive
Embrace heterogeneity!
IBM Cell
ARM big.LITTLE
Our proposal: Split HW into
ResCores and NonRes-Cores
ResCore
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core
NonRes
Core

Introduction
Signaling Performance
10
20
30
40
50
60
Overheadin%
Overhead by notiﬁcation method
Local Faults
Migration
Sync IPC
Shared Mem
susan CRC32
DMR
susan CRC32
TMR

Introduction
Signaling Performance
10
20
30
40
50
60
Overheadin%
Overhead by notiﬁcation method
Local Faults
Migration
Sync IPC
Shared Mem
susan CRC32
DMR
susan CRC32
TMR
Fast shared-memory message
passing would be good !
Intel SCC / Knights Corner
RCB/Non-RCB boundary is
vulnerable
Messaging / Exceptions
need to function
Must not overwrite other
data

Introduction
Is software-level protection feasible?
We have full source of the RCB.
Compiler support for fault tolerance (SWIFT1
, AN-Encoded Processing2
)
may help.
Hasn’t been done for kernel code yet.

Introduction
Is software-level protection feasible?
We have full source of the RCB.
Compiler support for fault tolerance (SWIFT1
, AN-Encoded Processing2
)
may help.
Hasn’t been done for kernel code yet.
Gedankenexperiment:
We know how much RCB-related execution is added due to replication.
We know average overheads for SWIFT (9.5%) and AN encoding(390%)3

Introduction
Modeling software-level RCB protection
Application Code
tapp
Kernel:
System
Calls
tkern
Romain
Master
Code
tmaster
Additional
Kernel
Invocations
t0
kern
Hardware
Stalls (e.g.,
caching)
thw
Native execution time Replication overhead
T = tnat + trep
= tapp + tkern + tmaster + t0
kern + thw
Tprot = tapp + C ⇥ (tkern + tmaster + t0
kern + thw )
tkern = t0
kern = thw = 0
Tprot = tapp + C ⇥ tmaster

Introduction
Estimating RCB protection runtime
400
perl
401
bzip2
429
mcf
445
gobmk
456
hm-
mer
458
sjeng
462
lib
quan-
tum
464
h264ref
471
om-
net++
473
as-
tar
1
1.05
1.1
1.15
1.2
1.25
1.3
1.4
1.5
1.6
Runtimenormalized
vs.nativeexecu-
tion
Romain only Romain+SWIFT Romain+ANBD

Introduction
Summary
OS-level techniques to tolerate SW and HW faults
Address-space isolation
Microreboots
Various ways of handling session state
Replication against hardware errors
Special care needed to protect Reliable Computing Base

Introduction
Further Reading
Minix3: Jorrit Herder, Ben Gras,, Philip Homburg, Andrew S. Tanenbaum:
Fault Isolation for Device Drivers, DSN 2009
CuriOS: Francis M. David, Ellick M. Chan, Je↵rey C. Carlyle and Roy H.
Campbell CuriOS: Improving Reliability through Operating System Structure,
OSDI 2008
L4ReAnimator: Dirk Vogt, Bj¨orn D¨obel, Adam Lackorzynski: Stay strong,
stay safe: Enhancing Reliability of a Secure Operating System, IIDS 2010

Introduction
Further Reading
Reliability Analysis:
Saggese et al.: An Experimental Study of Soft Errors in Microprocessors, IEEE Micro 2005
Li et al.: Understanding the Propagation of Hard Errors to Software and Implications for Resilient
System Design, ASPLOS 2008
Arlat et al.: Dependability of COTS Microkernel-Based Systems, IEEE ToCS 2002
Wang et al.: Y-Branches: When you come to a Fork in the Road: Take it!, PACT 2003
PLR: Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseh Blomsted, Ramesh
Peri: Using Process-Level Redundancy to Exploit Multiple Cores for Transient
Fault Tolerance, DSN 2007
Romain:
Björn Döbel, Hermann Härtig, Michael Engel: Operating System Support for Redundant
Multithreading, EMSOFT 2012
Björn Döbel, Hermann Härtig: Who watches the watchmen? – Protecting Operating System
Reliability Mechanisms, HotDep 2012

Hardware Errors and the OS

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (7)

Similar to Hardware Errors and the OS

Similar to Hardware Errors and the OS (20)

More from Vasily Sartakov

More from Vasily Sartakov (15)

Recently uploaded

Recently uploaded (20)

Hardware Errors and the OS