SlideShare a Scribd company logo
1 of 46
Tolerating Hardware Device Failures in
Software
Asim Kadav, Matthew J. Renzelmann, Michael M. Swift
University ofWisconsin-Madison
Current state of OS-hardware interaction
• Many device drivers assume device perfection
» Common Linux network driver: 3c59x .c
4/14/2014 Tolerating Hardware Device Failures in Software
While (ioread16(ioaddr + Wn7_MasterStatus))
& 0x8000)
;
Hardware dependence bug: Device malfunction can crash the system
HANG!
void hptitop_iop_request_callback(...) {
arg= readl(...);
...
if (readl(&req->result) == IOP_SUCCESS) {
arg->result = HPT_IOCTL_OK;
}
}
Current state of OS-hardware interaction
• Hardware dependence bugs across driver classes
4/14/2014 Tolerating Hardware Device Failures in Software
*Code simplified for presentation purposes
Highpoint SCSI driver(hptiop.c)
How do the hardware bugs manifest?
• Drivers often trust hardware to always work correctly
» Drivers use device data in critical control and data paths
» Drivers do not report device malfunctions to system log
» Drivers do not detect or recover from device failures
4/14/2014 Tolerating Hardware Device Failures in Software
An example:Windows servers
• Transient hardware failures caused 8% of all crashes
and 9% of all unplanned reboots[1]
» Systems work fine after reboots
» Vendors report returned device was faultless
• Existing solution is hand-coded hardened driver:
» Crashes reduced from 8% to 3%
• Driver isolation systems not yet deployed
4/14/2014 Tolerating Hardware Device Failures in Software
[1] Fault resilient drivers for Longhorn server, May 2004. Microsoft Corp.
Carburizer
• Goal:Tolerate hardware device failures in software
through hardware failure detection and recovery
• Static analysis tool - analyze and insert code to:
» Detect and fix hardware dependence bugs
» Detect and generate missing error reporting information
• Runtime
» Handle interrupt failures
» Transparently recover from failures
4/14/2014 Tolerating Hardware Device Failures in Software
Outline
• Background
• Hardening drivers
• Reporting errors
• Runtime fault tolerance
• Cost of carburizing
• Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Hardware unreliability
• Sources of hardware misbehavior:
» Device wear-out, insufficient burn-in
» Bridging faults
» Electromagnetic radiation
» Firmware bugs
• Result of misbehavior:
» Corrupted/stuck-at inputs
» Timing errors/unpredictable DMA
» Interrupt storms/missing interrupts
4/14/2014 Tolerating Hardware Device Failures in Software
Vendor recommendations for driver developers
4/14/2014 Tolerating Hardware Device Failures in Software
Recommendation Summary Recommended by
Intel Sun MS Linux
Validation Input validation   
Read once& CRC data   
DMA protection  
Timing Infinite polling   
Stuck interrupt 
Lost request 
Avoid excess delay in OS 
Unexpected events  
Reporting Report all failures   
Recovery Handle all failures  
Cleanup correctly  
Do not crash on failure   
Wrap I/O memory access    
Goal: Automatically implement as many recommendations as
possible in commodity drivers
Carburizer architecture
4/14/2014 Tolerating Hardware Device Failures in Software
OS Kernel
If (c==0) {
.
print
(“Driver
init”);
}
.
.
Driver
Carburizer
If (c==0) {
.
print (“Driver
init”);
}
.
.
Compile-time components Run-time components
Hardened
Driver Binary
Faulty
Hardware
Carburizer
Runtime
Kernel Interface
Compiler
Outline
• Background
• Hardening drivers
» Finding sensitive code
» Repairing code
• Reporting errors
• Runtime fault tolerance
• Cost of carburizing
• Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Hardening drivers
• Goal: Remove hardware dependence bugs
» Find driver code that uses data from device
» Ensure driver performs validity checks
• Carburizer detects and fixes hardware bugs from
» Infinite polling
» Unsafe static/dynamic array reference
» Unsafe pointer dereferences
» System panic calls
4/14/2014 Tolerating Hardware Device Failures in Software
Hardening drivers
• Finding sensitive code
» First pass: Identify tainted variables
4/14/2014 Tolerating Hardware Device Failures in Software
Finding sensitive code
First pass: Identify tainted variables
4/14/2014 Tolerating Hardware Device Failures in Software
int test () {
a = readl();
b = inb();
c = b;
d = c + 2;
return d;
}
int set() {
e = test();
}
Tainted
Variables
a
b
c
d
test()
e
Detecting risky uses of tainted variables
• Finding sensitive code
» Second pass: Identify risky uses of tainted variables
• Example: Infinite polling
» Driver waiting for device to enter particular state
» Solution: Detect loops where all terminating
conditions depend on tainted variables
4/14/2014 Tolerating Hardware Device Failures in Software
Example: Infinite polling
Finding sensitive code
4/14/2014 Tolerating Hardware Device Failures in Software
static int amd8111e_read_phy(………)
{
...
reg_val = readl(mmio + PHY_ACCESS);
while (reg_val & PHY_CMD_ACTIVE)
reg_val = readl(mmio + PHY_ACCESS)
.
}
AMD 8111e network driver(amd8111e.c)
Not all bugs are obvious
4/14/2014 Tolerating Hardware Device Failures in Software
while (DAC960_PD_StatusAvailableP(ControllerBaseAddress))
{
DAC960_V1_CommandIdentifier_T CommandIdentifier= DAC960_PD_ReadStatusCommandIdentifier
(ControllerBaseAddress);
DAC960_Command_T *Command = Controller ->Commands [CommandIdentifier-1];
DAC960_V1_CommandMailbox_T *CommandMailbox = &Command->V1.CommandMailbox;
DAC960_V1_CommandOpcode_T CommandOpcode=CommandMailbox->Common.CommandOpcode;
Command->V1.CommandStatus =DAC960_PD_ReadStatusRegister(ControllerBaseAddress);
DAC960_PD_AcknowledgeInterrupt(ControllerBaseAddress);
DAC960_PD_AcknowledgeStatus(ControllerBaseAddress);
switch (CommandOpcode)
{
case DAC960_V1_Enquiry_Old:
DAC960_P_To_PD_TranslateReadWriteCommand(CommandMailbox);
…
}
DAC960 Raid Controller(DAC960.c)
Detecting risky uses of tainted variables
• Example II: Unsafe array accesses
» Tainted variables used as array index into static or
dynamic arrays
» Tainted variables used as pointers
4/14/2014 Tolerating Hardware Device Failures in Software
Example: Unsafe array accesses
Unsafe array accesses
4/14/2014 Tolerating Hardware Device Failures in Software
static void __init attach_pas_card(...)
{
if ((pas_model = pas_read(0xFF88)))
{
...
sprintf(temp, “%s rev %d”,
pas_model_names[(int) pas_model], pas_read(0x2789));
...
}
Pro Audio Sound driver (pas2_card.c)
Analysis results over the Linux kernel
• Analyzed drivers in 2.6.18.8 Linux kernel
» 6300 driver source files
» 2.8 million lines of code
» 37 minutes to analyze and compile code
• Additional analyses to detect existing validation
code
4/14/2014 Tolerating Hardware Device Failures in Software
Analysis results over the Linux kernel
• Found 992 bugs in driver code
• False positive rate: 7.4% (manual sampling of 190 bugs)
4/14/2014 Tolerating Hardware Device Failures in Software
Driver class Infinite
polling
Static array Dynamic
array
Panic calls
net 117 2 21 2
scsi 298 31 22 121
sound 64 1 0 2
video 174 0 22 22
other 381 9 57 32
Total 860 43 89 179
Many cases of poorly written drivers with hardware dependence bugs
Repairing drivers
• Hardware dependence bugs difficult to test
• Carburizer automatically generates repair code
» Inserts timeout code for infinite loops
» Inserts checks for unsafe array/pointer references
» Replaces calls to panic() with recovery service
» Triggers generic recovery service on device failure
4/14/2014 Tolerating Hardware Device Failures in Software
Carburizer automatically fixes infinite loops
4/14/2014 Tolerating Hardware Device Failures in Software
timeout = rdstcll(start) + (cpu/khz/HZ)*2;
reg_val = readl(mmio + PHY_ACCESS);
while (reg_val & PHY_CMD_ACTIVE) {
reg_val = readl(mmio + PHY_ACCESS);
if (_cur < timeout)
rdstcll(_cur);
else
__recover_driver();
}
*Code simplified for presentation purposes
Timeout code
added
AMD 8111e network driver(amd8111e.c)
Carburizer automatically adds bounds checks
4/14/2014 Tolerating Hardware Device Failures in Software
static void __init attach_pas_card(...)
{
if ((pas_model = pas_read(0xFF88)))
{
...
if ((pas_model< 0)) || (pas_model>= 5))
__recover_driver();
.
sprintf(temp, “%s rev %d”,
pas_model_names[(int) pas_model], pas_read(0x2789));
}
*Code simplified for presentation purposes
Array bounds
check added
Pro Audio Sound driver (pas2_card.c)
Runtime fault recovery
• Low cost transparent recovery
» Based on shadow drivers
» Records state of driver
» Transparent restart and state
replay on failure
• Independent of any isolation
mechanism (like Nooks)
4/14/2014 Tolerating Hardware Device Failures in Software
Shadow
Driver
Device
Driver
Device
Taps
Driver-Kernel
Interface
Device/Driver Original Driver Carburizer
Behavior Detection Behavior Detection Recovery
3COM 3C905 CRASH None RUNNING Yes Yes
DEC DC 21x4x CRASH None RUNNING Yes Yes
Experimental validation
4/14/2014 Tolerating Hardware Device Failures in Software
• Synthetic fault injection on network drivers
• Results
Carburizer failure detection and transparent recovery work
well for transient device failures
Outline
• Background
• Hardening drivers
• Reporting errors
• Runtime fault tolerance
• Cost of carburizing
• Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Reporting errors
• Drivers often fail silently and fail to report device errors
» Drivers should proactively report device failures
» Fault management systems require these inputs
• Driver already detects failure but does not report them
• Carburizer analysis performs two functions
» Detect when there is a device failure
» Report unless the driver is already reporting the failure
4/14/2014 Tolerating Hardware Device Failures in Software
Detecting driver detected device failures
• Detect code that depends on tainted variables
» Perform unreported loop timeouts
» Returns negative error constants
» Jumps to common cleanup code
4/14/2014 Tolerating Hardware Device Failures in Software
while (ioread16 (regA) == 0x0f) {
if (timeout++ == 200) {
sys_report(“Device timed out %s.n”, mod_name);
return (-1);
}
}
Reporting code
added by
Carburizer
Detecting existing reporting code
Carburizer detects function calls with string arguments
4/14/2014 Tolerating Hardware Device Failures in Software
static u16 gm_phy_read(...)
{
...
if (__gm_phy_read(...))
printk(KERN_WARNING "%s: ...n”, ...);
Carburizer
detects existing
reporting code
SysKonnect network driver(skge.c)
Evaluation
• Manual analysis of drivers of different classes
• No false positives
• Fixed 1135 cases of unreported timeouts and 467 cases of
unreported device failures in Linux drivers
4/14/2014 Tolerating Hardware Device Failures in Software
Driver Class Driver detected
device failures
Carburizer
reported failures
bnx2 network 24 17
mptbase scsi 28 17
ens1371 sound 10 9
Carburizer automatically improves the fault diagnosis
capabilities of the system
Outline
• Background
• Hardening drivers
• Reporting errors
• Runtime fault tolerance
• Cost of carburizing
• Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Runtime failure detection
• Static analysis cannot detect all device failures
» Missing interrupts: expected but never arrives
» Stuck interrupts (interrupts storm): interrupt cleared
by driver but continues to be asserted
4/14/2014 Tolerating Hardware Device Failures in Software
Tolerating missing interrupts
4/14/2014 Tolerating Hardware Device Failures in Software
Driver
Hardware
Device
Request
Interrupt
responses
• Detect when to expect interrupts
» Detect driver activity via referenced bits
» Invoke ISR when bits referenced but no
interrupt activity
• Detect how often to poll
» Dynamic polling based on previous
invocation result
Tolerating stuck interrupts
• Driver interrupt handler is called too many times
• Convert the device from interrupts to polling
4/14/2014 Tolerating Hardware Device Failures in Software
DriverType Driver Name Throughput reduction due to polling
Disk ide-core,ide-disk, ide-generic Reduced by 50%
Network e1000 Reduced from 750 Mb/s to 130 Mb/s
Sound ens1371 Sounds plays with distortion
Carburizer ensures system and device make forward progress
Outline
• Background
• Hardening drivers
• Reporting errors
• Runtime fault tolerance
• Cost of carburizing
• Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Throughput overhead
4/14/2014 Tolerating Hardware Device Failures in Software
940
721
935
720
0
200
400
600
800
1000
nVIDIA MCP 55 Intel Pro 1000
ThroughputinMbps
NetworkCardType
Linux Kernel
Carburizer
Kernel
netperf on 2.2 GHz AMD machines
CPU overhead
4/14/2014 Tolerating Hardware Device Failures in Software
31
16
36
16
31
16
0
5
10
15
20
25
30
35
40
nVIDIA MCP 55 Intel Pro 1000
CPUUtilization(%)
Network CardType
Linux Kernel
Carburizer Kernel
with recovery
Carburizer Kernel
w/o recovery
Almost no overhead from hardened drivers and automatic recovery
netperf on 2.2 GHz AMD machines
Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Recommendation Summary Recommended by
Intel Sun MS Linux
Validation Input validation   
Read once& CRC data   
DMA protection  
Timing Infinite polling   
Stuck interrupt 
Lost request 
Avoid excess delay in OS 
Unexpected events  
Reporting Report all failures   
Recovery Handle all failures  
Cleanup correctly  
Do not crash on failure   
Wrap I/O memory access    
Conclusion
4/14/2014 Tolerating Hardware Device Failures in Software
Recommendation Summary Recommended by Carburizer
EnsuresIntel Sun MS Linux
Validation Input validation    
Read once& CRC data   
DMA protection  
Timing Infinite polling    
Stuck interrupt  
Lost request  
Avoid excess delay in OS 
Unexpected events  
Reporting Report all failures    
Recovery Handle all failures   
Cleanup correctly   
Do not crash on failure    
Wrap I/O memory access    
Carburizer improves system reliability by automatically ensuring
that hardware failures are tolerated in software
ThankYou
• Contact
» kadav@cs.wisc.edu
• Visit our website for research on drivers
» http://cs.wisc.edu/~swift/drivers
4/14/2014 Tolerating Hardware Device Failures in Software
OS
Kernel
If
(c==0) {
.
print
(“Driver
init”);
}
.
.
Driver
Carburizer
If (c==0) {
.
print (“Driver
init”);
}
.
.
Compile-time components Run-time components
Hardened
Driver Binary
Faulty
Hardware
Carburizer
Runtime
Kernel Interface
Compiler
Backup slides
4/14/2014 Tolerating Hardware Device Failures in Software
Improving analysis accuracy
• Detect existing driver validation code
» Track variable taint history
» Detect existing timeout code
» Detect existing sanity checks
4/14/2014 Tolerating Hardware Device Failures in Software
while ((inb(nic_base + EN0_ISR) & ENISR_RDC) == 0)
if (jiffies - dma_start> 2) {
...
break;
}
ne2000 network driver (ne2k-pci.c)
Trend of hardware dependence bugs
• Many drivers either had one or two hardware bugs
» Developers were mostly careful but forgot in a few places
• Small number of drivers were badly written
»Developers did not account H/W dependence; many bugs
4/14/2014 Tolerating Hardware Device Failures in Software
Implementation efforts
• Carburizer static analysis tool
» 3230 LOC in OCaml
• Carburizer runtime (Interrupt Monitoring)
» 1030 lines in C
• Carburizer runtime (Shadow drivers)
»19000 LOC in C
»~70% wrappers – can be automatically generated by scripts
4/14/2014 Tolerating Hardware Device Failures in Software
Vendor recommendations for driver developers
4/14/2014 Tolerating Hardware Device Failures in Software
Recommendation Summary Vendors Carburizer
EnsuresIntel Sun MS Linux
Validation Input validation    
Read once& CRC data   
DMA protection  
Timing Infinite polling    
Stuck interrupt  
Lost request  
Avoid excess delay in OS 
Unexpected events  
Reporting Report all failures    
Recovery Handle all failures   
Cleanup correctly   
Do not crash on failure    
Wrap I/O memory access    

More Related Content

What's hot

"Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S...
"Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S..."Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S...
"Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S...Marina Krotofil
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
Top schools in delhi ncr
Top schools in delhi ncrTop schools in delhi ncr
Top schools in delhi ncrEdhole.com
 
Controlling PC on ARM using Fault Injection
Controlling PC on ARM using Fault InjectionControlling PC on ARM using Fault Injection
Controlling PC on ARM using Fault InjectionRiscure
 
Get it right the first time lpddr4 validation and compliance test
Get it right the first time lpddr4 validation and compliance testGet it right the first time lpddr4 validation and compliance test
Get it right the first time lpddr4 validation and compliance testBarbara Aichinger
 
[CLASS 2014] Palestra Técnica - Jonathan Knudsen
[CLASS 2014] Palestra Técnica - Jonathan Knudsen[CLASS 2014] Palestra Técnica - Jonathan Knudsen
[CLASS 2014] Palestra Técnica - Jonathan KnudsenTI Safe
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...CODE BLUE
 
How to drive a malware analyst crazy
How to drive a malware analyst crazyHow to drive a malware analyst crazy
How to drive a malware analyst crazyMichael Boman
 
Some skills required to be a computer hardware engineer professional
Some skills required to be a computer hardware engineer professionalSome skills required to be a computer hardware engineer professional
Some skills required to be a computer hardware engineer professionalSayed Ahmed
 
Managing securityforautomotivesoc
Managing securityforautomotivesocManaging securityforautomotivesoc
Managing securityforautomotivesocPankaj Singh
 
IFDIS PHM Conference 2016
IFDIS   PHM Conference 2016IFDIS   PHM Conference 2016
IFDIS PHM Conference 2016Ken Anderson
 
Track B- Advanced ESL verification - Mentor
Track B- Advanced ESL verification - MentorTrack B- Advanced ESL verification - Mentor
Track B- Advanced ESL verification - Mentorchiportal
 
CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...
CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...
CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...CODE BLUE
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgPractical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgSam Bowne
 
Radio Frequency Test Stands for Remote Controllers
Radio Frequency Test Stands for Remote ControllersRadio Frequency Test Stands for Remote Controllers
Radio Frequency Test Stands for Remote ControllersUjjal Dutt, PMP
 

What's hot (17)

"Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S...
"Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S..."Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S...
"Man-in-the-SCADA": Anatomy of Data Integrity Attacks in Industrial Control S...
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
S emb t5-arch_io
S emb t5-arch_ioS emb t5-arch_io
S emb t5-arch_io
 
Top schools in delhi ncr
Top schools in delhi ncrTop schools in delhi ncr
Top schools in delhi ncr
 
Controlling PC on ARM using Fault Injection
Controlling PC on ARM using Fault InjectionControlling PC on ARM using Fault Injection
Controlling PC on ARM using Fault Injection
 
Get it right the first time lpddr4 validation and compliance test
Get it right the first time lpddr4 validation and compliance testGet it right the first time lpddr4 validation and compliance test
Get it right the first time lpddr4 validation and compliance test
 
[CLASS 2014] Palestra Técnica - Jonathan Knudsen
[CLASS 2014] Palestra Técnica - Jonathan Knudsen[CLASS 2014] Palestra Técnica - Jonathan Knudsen
[CLASS 2014] Palestra Técnica - Jonathan Knudsen
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
 
How to drive a malware analyst crazy
How to drive a malware analyst crazyHow to drive a malware analyst crazy
How to drive a malware analyst crazy
 
Some skills required to be a computer hardware engineer professional
Some skills required to be a computer hardware engineer professionalSome skills required to be a computer hardware engineer professional
Some skills required to be a computer hardware engineer professional
 
Managing securityforautomotivesoc
Managing securityforautomotivesocManaging securityforautomotivesoc
Managing securityforautomotivesoc
 
IFDIS PHM Conference 2016
IFDIS   PHM Conference 2016IFDIS   PHM Conference 2016
IFDIS PHM Conference 2016
 
Embedded system - introduction to arm7
Embedded system -  introduction to arm7Embedded system -  introduction to arm7
Embedded system - introduction to arm7
 
Track B- Advanced ESL verification - Mentor
Track B- Advanced ESL verification - MentorTrack B- Advanced ESL verification - Mentor
Track B- Advanced ESL verification - Mentor
 
CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...
CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...
CODE BLUE 2014 : DeviceDisEnabler : A hypervisor which hides devices to prote...
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgPractical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
 
Radio Frequency Test Stands for Remote Controllers
Radio Frequency Test Stands for Remote ControllersRadio Frequency Test Stands for Remote Controllers
Radio Frequency Test Stands for Remote Controllers
 

Viewers also liked

Who are the INTERNET SERVICE PROVIDERS?
Who are the INTERNET SERVICE PROVIDERS?Who are the INTERNET SERVICE PROVIDERS?
Who are the INTERNET SERVICE PROVIDERS?Likan Patra
 
Productivity software presentation
Productivity software presentationProductivity software presentation
Productivity software presentationThomas Cormier
 
MIS - IT Infrastructure (Part I)
MIS  - IT Infrastructure (Part I)MIS  - IT Infrastructure (Part I)
MIS - IT Infrastructure (Part I)Soetam Rizky
 
Web Services PHP Tutorial
Web Services PHP TutorialWeb Services PHP Tutorial
Web Services PHP TutorialLorna Mitchell
 
Hardware and software ppt
Hardware and software pptHardware and software ppt
Hardware and software pptshamitamurali
 
Computer hardware presentation
Computer hardware presentationComputer hardware presentation
Computer hardware presentationJisu Dasgupta
 
Types and components of computer system
Types and components of computer systemTypes and components of computer system
Types and components of computer systemmkhisalg
 
It infrastructure hardware and software
It infrastructure hardware and softwareIt infrastructure hardware and software
It infrastructure hardware and softwareProf. Othman Alsalloum
 
Computer Basics 101 Slide Show Presentation
Computer Basics 101 Slide Show PresentationComputer Basics 101 Slide Show Presentation
Computer Basics 101 Slide Show Presentationsluget
 

Viewers also liked (10)

Who are the INTERNET SERVICE PROVIDERS?
Who are the INTERNET SERVICE PROVIDERS?Who are the INTERNET SERVICE PROVIDERS?
Who are the INTERNET SERVICE PROVIDERS?
 
Productivity software presentation
Productivity software presentationProductivity software presentation
Productivity software presentation
 
MIS - IT Infrastructure (Part I)
MIS  - IT Infrastructure (Part I)MIS  - IT Infrastructure (Part I)
MIS - IT Infrastructure (Part I)
 
Hardware & Software
Hardware & SoftwareHardware & Software
Hardware & Software
 
Web Services PHP Tutorial
Web Services PHP TutorialWeb Services PHP Tutorial
Web Services PHP Tutorial
 
Hardware and software ppt
Hardware and software pptHardware and software ppt
Hardware and software ppt
 
Computer hardware presentation
Computer hardware presentationComputer hardware presentation
Computer hardware presentation
 
Types and components of computer system
Types and components of computer systemTypes and components of computer system
Types and components of computer system
 
It infrastructure hardware and software
It infrastructure hardware and softwareIt infrastructure hardware and software
It infrastructure hardware and software
 
Computer Basics 101 Slide Show Presentation
Computer Basics 101 Slide Show PresentationComputer Basics 101 Slide Show Presentation
Computer Basics 101 Slide Show Presentation
 

Similar to Tolerating Hardware Device Failures in Software

Fine-grained fault tolerance using device checkpoints
Fine-grained fault tolerance using device checkpointsFine-grained fault tolerance using device checkpoints
Fine-grained fault tolerance using device checkpointsasimkadav
 
Getting started with RISC-V verification what's next after compliance testing
Getting started with RISC-V verification what's next after compliance testingGetting started with RISC-V verification what's next after compliance testing
Getting started with RISC-V verification what's next after compliance testingRISC-V International
 
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...Zhen Huang
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an..."Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...SegInfo
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTraceGraeme Jenkinson
 
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORSDEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORSFelipe Prado
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_finalsean chen
 
Software reliability tools and common software errors
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errorsHimanshu
 
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-daysHow Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-daysPriyanka Aash
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)RichardWarburton
 
Dolphin: Regression Test System for Latitude
Dolphin: Regression Test System for LatitudeDolphin: Regression Test System for Latitude
Dolphin: Regression Test System for LatitudeTao Jiang
 
SECURITY SOFTWARE RESOLIUTIONS (SSR) .docx
SECURITY SOFTWARE RESOLIUTIONS (SSR)                              .docxSECURITY SOFTWARE RESOLIUTIONS (SSR)                              .docx
SECURITY SOFTWARE RESOLIUTIONS (SSR) .docxbagotjesusa
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen ILS
 
Develop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM BoardsDevelop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM BoardsNational Cheng Kung University
 

Similar to Tolerating Hardware Device Failures in Software (20)

Fine-grained fault tolerance using device checkpoints
Fine-grained fault tolerance using device checkpointsFine-grained fault tolerance using device checkpoints
Fine-grained fault tolerance using device checkpoints
 
Faults inside System Software
Faults inside System SoftwareFaults inside System Software
Faults inside System Software
 
Getting started with RISC-V verification what's next after compliance testing
Getting started with RISC-V verification what's next after compliance testingGetting started with RISC-V verification what's next after compliance testing
Getting started with RISC-V verification what's next after compliance testing
 
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Respo...
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an..."Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
 
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORSDEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_final
 
Techno-Fest-15nov16
Techno-Fest-15nov16Techno-Fest-15nov16
Techno-Fest-15nov16
 
Cuda debugger
Cuda debuggerCuda debugger
Cuda debugger
 
Software reliability tools and common software errors
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errors
 
Choosing the right processor
Choosing the right processorChoosing the right processor
Choosing the right processor
 
Blackfin Device Drivers
Blackfin Device DriversBlackfin Device Drivers
Blackfin Device Drivers
 
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-daysHow Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)
 
Dolphin: Regression Test System for Latitude
Dolphin: Regression Test System for LatitudeDolphin: Regression Test System for Latitude
Dolphin: Regression Test System for Latitude
 
SECURITY SOFTWARE RESOLIUTIONS (SSR) .docx
SECURITY SOFTWARE RESOLIUTIONS (SSR)                              .docxSECURITY SOFTWARE RESOLIUTIONS (SSR)                              .docx
SECURITY SOFTWARE RESOLIUTIONS (SSR) .docx
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival Skills
 
Develop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM BoardsDevelop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM Boards
 

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 

Recently uploaded (20)

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 

Tolerating Hardware Device Failures in Software

  • 1. Tolerating Hardware Device Failures in Software Asim Kadav, Matthew J. Renzelmann, Michael M. Swift University ofWisconsin-Madison
  • 2. Current state of OS-hardware interaction • Many device drivers assume device perfection » Common Linux network driver: 3c59x .c 4/14/2014 Tolerating Hardware Device Failures in Software While (ioread16(ioaddr + Wn7_MasterStatus)) & 0x8000) ; Hardware dependence bug: Device malfunction can crash the system HANG!
  • 3. void hptitop_iop_request_callback(...) { arg= readl(...); ... if (readl(&req->result) == IOP_SUCCESS) { arg->result = HPT_IOCTL_OK; } } Current state of OS-hardware interaction • Hardware dependence bugs across driver classes 4/14/2014 Tolerating Hardware Device Failures in Software *Code simplified for presentation purposes Highpoint SCSI driver(hptiop.c)
  • 4. How do the hardware bugs manifest? • Drivers often trust hardware to always work correctly » Drivers use device data in critical control and data paths » Drivers do not report device malfunctions to system log » Drivers do not detect or recover from device failures 4/14/2014 Tolerating Hardware Device Failures in Software
  • 5. An example:Windows servers • Transient hardware failures caused 8% of all crashes and 9% of all unplanned reboots[1] » Systems work fine after reboots » Vendors report returned device was faultless • Existing solution is hand-coded hardened driver: » Crashes reduced from 8% to 3% • Driver isolation systems not yet deployed 4/14/2014 Tolerating Hardware Device Failures in Software [1] Fault resilient drivers for Longhorn server, May 2004. Microsoft Corp.
  • 6. Carburizer • Goal:Tolerate hardware device failures in software through hardware failure detection and recovery • Static analysis tool - analyze and insert code to: » Detect and fix hardware dependence bugs » Detect and generate missing error reporting information • Runtime » Handle interrupt failures » Transparently recover from failures 4/14/2014 Tolerating Hardware Device Failures in Software
  • 7. Outline • Background • Hardening drivers • Reporting errors • Runtime fault tolerance • Cost of carburizing • Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software
  • 8. Hardware unreliability • Sources of hardware misbehavior: » Device wear-out, insufficient burn-in » Bridging faults » Electromagnetic radiation » Firmware bugs • Result of misbehavior: » Corrupted/stuck-at inputs » Timing errors/unpredictable DMA » Interrupt storms/missing interrupts 4/14/2014 Tolerating Hardware Device Failures in Software
  • 9. Vendor recommendations for driver developers 4/14/2014 Tolerating Hardware Device Failures in Software Recommendation Summary Recommended by Intel Sun MS Linux Validation Input validation    Read once& CRC data    DMA protection   Timing Infinite polling    Stuck interrupt  Lost request  Avoid excess delay in OS  Unexpected events   Reporting Report all failures    Recovery Handle all failures   Cleanup correctly   Do not crash on failure    Wrap I/O memory access     Goal: Automatically implement as many recommendations as possible in commodity drivers
  • 10. Carburizer architecture 4/14/2014 Tolerating Hardware Device Failures in Software OS Kernel If (c==0) { . print (“Driver init”); } . . Driver Carburizer If (c==0) { . print (“Driver init”); } . . Compile-time components Run-time components Hardened Driver Binary Faulty Hardware Carburizer Runtime Kernel Interface Compiler
  • 11. Outline • Background • Hardening drivers » Finding sensitive code » Repairing code • Reporting errors • Runtime fault tolerance • Cost of carburizing • Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software
  • 12. Hardening drivers • Goal: Remove hardware dependence bugs » Find driver code that uses data from device » Ensure driver performs validity checks • Carburizer detects and fixes hardware bugs from » Infinite polling » Unsafe static/dynamic array reference » Unsafe pointer dereferences » System panic calls 4/14/2014 Tolerating Hardware Device Failures in Software
  • 13. Hardening drivers • Finding sensitive code » First pass: Identify tainted variables 4/14/2014 Tolerating Hardware Device Failures in Software
  • 14. Finding sensitive code First pass: Identify tainted variables 4/14/2014 Tolerating Hardware Device Failures in Software int test () { a = readl(); b = inb(); c = b; d = c + 2; return d; } int set() { e = test(); } Tainted Variables a b c d test() e
  • 15. Detecting risky uses of tainted variables • Finding sensitive code » Second pass: Identify risky uses of tainted variables • Example: Infinite polling » Driver waiting for device to enter particular state » Solution: Detect loops where all terminating conditions depend on tainted variables 4/14/2014 Tolerating Hardware Device Failures in Software
  • 16. Example: Infinite polling Finding sensitive code 4/14/2014 Tolerating Hardware Device Failures in Software static int amd8111e_read_phy(………) { ... reg_val = readl(mmio + PHY_ACCESS); while (reg_val & PHY_CMD_ACTIVE) reg_val = readl(mmio + PHY_ACCESS) . } AMD 8111e network driver(amd8111e.c)
  • 17. Not all bugs are obvious 4/14/2014 Tolerating Hardware Device Failures in Software while (DAC960_PD_StatusAvailableP(ControllerBaseAddress)) { DAC960_V1_CommandIdentifier_T CommandIdentifier= DAC960_PD_ReadStatusCommandIdentifier (ControllerBaseAddress); DAC960_Command_T *Command = Controller ->Commands [CommandIdentifier-1]; DAC960_V1_CommandMailbox_T *CommandMailbox = &Command->V1.CommandMailbox; DAC960_V1_CommandOpcode_T CommandOpcode=CommandMailbox->Common.CommandOpcode; Command->V1.CommandStatus =DAC960_PD_ReadStatusRegister(ControllerBaseAddress); DAC960_PD_AcknowledgeInterrupt(ControllerBaseAddress); DAC960_PD_AcknowledgeStatus(ControllerBaseAddress); switch (CommandOpcode) { case DAC960_V1_Enquiry_Old: DAC960_P_To_PD_TranslateReadWriteCommand(CommandMailbox); … } DAC960 Raid Controller(DAC960.c)
  • 18. Detecting risky uses of tainted variables • Example II: Unsafe array accesses » Tainted variables used as array index into static or dynamic arrays » Tainted variables used as pointers 4/14/2014 Tolerating Hardware Device Failures in Software
  • 19. Example: Unsafe array accesses Unsafe array accesses 4/14/2014 Tolerating Hardware Device Failures in Software static void __init attach_pas_card(...) { if ((pas_model = pas_read(0xFF88))) { ... sprintf(temp, “%s rev %d”, pas_model_names[(int) pas_model], pas_read(0x2789)); ... } Pro Audio Sound driver (pas2_card.c)
  • 20. Analysis results over the Linux kernel • Analyzed drivers in 2.6.18.8 Linux kernel » 6300 driver source files » 2.8 million lines of code » 37 minutes to analyze and compile code • Additional analyses to detect existing validation code 4/14/2014 Tolerating Hardware Device Failures in Software
  • 21. Analysis results over the Linux kernel • Found 992 bugs in driver code • False positive rate: 7.4% (manual sampling of 190 bugs) 4/14/2014 Tolerating Hardware Device Failures in Software Driver class Infinite polling Static array Dynamic array Panic calls net 117 2 21 2 scsi 298 31 22 121 sound 64 1 0 2 video 174 0 22 22 other 381 9 57 32 Total 860 43 89 179 Many cases of poorly written drivers with hardware dependence bugs
  • 22. Repairing drivers • Hardware dependence bugs difficult to test • Carburizer automatically generates repair code » Inserts timeout code for infinite loops » Inserts checks for unsafe array/pointer references » Replaces calls to panic() with recovery service » Triggers generic recovery service on device failure 4/14/2014 Tolerating Hardware Device Failures in Software
  • 23. Carburizer automatically fixes infinite loops 4/14/2014 Tolerating Hardware Device Failures in Software timeout = rdstcll(start) + (cpu/khz/HZ)*2; reg_val = readl(mmio + PHY_ACCESS); while (reg_val & PHY_CMD_ACTIVE) { reg_val = readl(mmio + PHY_ACCESS); if (_cur < timeout) rdstcll(_cur); else __recover_driver(); } *Code simplified for presentation purposes Timeout code added AMD 8111e network driver(amd8111e.c)
  • 24. Carburizer automatically adds bounds checks 4/14/2014 Tolerating Hardware Device Failures in Software static void __init attach_pas_card(...) { if ((pas_model = pas_read(0xFF88))) { ... if ((pas_model< 0)) || (pas_model>= 5)) __recover_driver(); . sprintf(temp, “%s rev %d”, pas_model_names[(int) pas_model], pas_read(0x2789)); } *Code simplified for presentation purposes Array bounds check added Pro Audio Sound driver (pas2_card.c)
  • 25. Runtime fault recovery • Low cost transparent recovery » Based on shadow drivers » Records state of driver » Transparent restart and state replay on failure • Independent of any isolation mechanism (like Nooks) 4/14/2014 Tolerating Hardware Device Failures in Software Shadow Driver Device Driver Device Taps Driver-Kernel Interface
  • 26. Device/Driver Original Driver Carburizer Behavior Detection Behavior Detection Recovery 3COM 3C905 CRASH None RUNNING Yes Yes DEC DC 21x4x CRASH None RUNNING Yes Yes Experimental validation 4/14/2014 Tolerating Hardware Device Failures in Software • Synthetic fault injection on network drivers • Results Carburizer failure detection and transparent recovery work well for transient device failures
  • 27. Outline • Background • Hardening drivers • Reporting errors • Runtime fault tolerance • Cost of carburizing • Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software
  • 28. Reporting errors • Drivers often fail silently and fail to report device errors » Drivers should proactively report device failures » Fault management systems require these inputs • Driver already detects failure but does not report them • Carburizer analysis performs two functions » Detect when there is a device failure » Report unless the driver is already reporting the failure 4/14/2014 Tolerating Hardware Device Failures in Software
  • 29. Detecting driver detected device failures • Detect code that depends on tainted variables » Perform unreported loop timeouts » Returns negative error constants » Jumps to common cleanup code 4/14/2014 Tolerating Hardware Device Failures in Software while (ioread16 (regA) == 0x0f) { if (timeout++ == 200) { sys_report(“Device timed out %s.n”, mod_name); return (-1); } } Reporting code added by Carburizer
  • 30. Detecting existing reporting code Carburizer detects function calls with string arguments 4/14/2014 Tolerating Hardware Device Failures in Software static u16 gm_phy_read(...) { ... if (__gm_phy_read(...)) printk(KERN_WARNING "%s: ...n”, ...); Carburizer detects existing reporting code SysKonnect network driver(skge.c)
  • 31. Evaluation • Manual analysis of drivers of different classes • No false positives • Fixed 1135 cases of unreported timeouts and 467 cases of unreported device failures in Linux drivers 4/14/2014 Tolerating Hardware Device Failures in Software Driver Class Driver detected device failures Carburizer reported failures bnx2 network 24 17 mptbase scsi 28 17 ens1371 sound 10 9 Carburizer automatically improves the fault diagnosis capabilities of the system
  • 32. Outline • Background • Hardening drivers • Reporting errors • Runtime fault tolerance • Cost of carburizing • Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software
  • 33. Runtime failure detection • Static analysis cannot detect all device failures » Missing interrupts: expected but never arrives » Stuck interrupts (interrupts storm): interrupt cleared by driver but continues to be asserted 4/14/2014 Tolerating Hardware Device Failures in Software
  • 34. Tolerating missing interrupts 4/14/2014 Tolerating Hardware Device Failures in Software Driver Hardware Device Request Interrupt responses • Detect when to expect interrupts » Detect driver activity via referenced bits » Invoke ISR when bits referenced but no interrupt activity • Detect how often to poll » Dynamic polling based on previous invocation result
  • 35. Tolerating stuck interrupts • Driver interrupt handler is called too many times • Convert the device from interrupts to polling 4/14/2014 Tolerating Hardware Device Failures in Software DriverType Driver Name Throughput reduction due to polling Disk ide-core,ide-disk, ide-generic Reduced by 50% Network e1000 Reduced from 750 Mb/s to 130 Mb/s Sound ens1371 Sounds plays with distortion Carburizer ensures system and device make forward progress
  • 36. Outline • Background • Hardening drivers • Reporting errors • Runtime fault tolerance • Cost of carburizing • Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software
  • 37. Throughput overhead 4/14/2014 Tolerating Hardware Device Failures in Software 940 721 935 720 0 200 400 600 800 1000 nVIDIA MCP 55 Intel Pro 1000 ThroughputinMbps NetworkCardType Linux Kernel Carburizer Kernel netperf on 2.2 GHz AMD machines
  • 38. CPU overhead 4/14/2014 Tolerating Hardware Device Failures in Software 31 16 36 16 31 16 0 5 10 15 20 25 30 35 40 nVIDIA MCP 55 Intel Pro 1000 CPUUtilization(%) Network CardType Linux Kernel Carburizer Kernel with recovery Carburizer Kernel w/o recovery Almost no overhead from hardened drivers and automatic recovery netperf on 2.2 GHz AMD machines
  • 39. Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software Recommendation Summary Recommended by Intel Sun MS Linux Validation Input validation    Read once& CRC data    DMA protection   Timing Infinite polling    Stuck interrupt  Lost request  Avoid excess delay in OS  Unexpected events   Reporting Report all failures    Recovery Handle all failures   Cleanup correctly   Do not crash on failure    Wrap I/O memory access    
  • 40. Conclusion 4/14/2014 Tolerating Hardware Device Failures in Software Recommendation Summary Recommended by Carburizer EnsuresIntel Sun MS Linux Validation Input validation     Read once& CRC data    DMA protection   Timing Infinite polling     Stuck interrupt   Lost request   Avoid excess delay in OS  Unexpected events   Reporting Report all failures     Recovery Handle all failures    Cleanup correctly    Do not crash on failure     Wrap I/O memory access     Carburizer improves system reliability by automatically ensuring that hardware failures are tolerated in software
  • 41. ThankYou • Contact » kadav@cs.wisc.edu • Visit our website for research on drivers » http://cs.wisc.edu/~swift/drivers 4/14/2014 Tolerating Hardware Device Failures in Software OS Kernel If (c==0) { . print (“Driver init”); } . . Driver Carburizer If (c==0) { . print (“Driver init”); } . . Compile-time components Run-time components Hardened Driver Binary Faulty Hardware Carburizer Runtime Kernel Interface Compiler
  • 42. Backup slides 4/14/2014 Tolerating Hardware Device Failures in Software
  • 43. Improving analysis accuracy • Detect existing driver validation code » Track variable taint history » Detect existing timeout code » Detect existing sanity checks 4/14/2014 Tolerating Hardware Device Failures in Software while ((inb(nic_base + EN0_ISR) & ENISR_RDC) == 0) if (jiffies - dma_start> 2) { ... break; } ne2000 network driver (ne2k-pci.c)
  • 44. Trend of hardware dependence bugs • Many drivers either had one or two hardware bugs » Developers were mostly careful but forgot in a few places • Small number of drivers were badly written »Developers did not account H/W dependence; many bugs 4/14/2014 Tolerating Hardware Device Failures in Software
  • 45. Implementation efforts • Carburizer static analysis tool » 3230 LOC in OCaml • Carburizer runtime (Interrupt Monitoring) » 1030 lines in C • Carburizer runtime (Shadow drivers) »19000 LOC in C »~70% wrappers – can be automatically generated by scripts 4/14/2014 Tolerating Hardware Device Failures in Software
  • 46. Vendor recommendations for driver developers 4/14/2014 Tolerating Hardware Device Failures in Software Recommendation Summary Vendors Carburizer EnsuresIntel Sun MS Linux Validation Input validation     Read once& CRC data    DMA protection   Timing Infinite polling     Stuck interrupt   Lost request   Avoid excess delay in OS  Unexpected events   Reporting Report all failures     Recovery Handle all failures    Cleanup correctly    Do not crash on failure     Wrap I/O memory access    

Editor's Notes

  1. Thank you today I will be talking about Howto tolerate hardware device failures in software
  2. Operating systems communicate with hardware devices using drivers and many of these drivers assume device to always behave properly.For example If we look at a piece of code from a fairly common network driver, we can see that here that the driver code is reading a device register and waiting in an infinite while loop for it to be set to a particular value. If the device does not return the correct value, the system can hang indefinitely. HANGWe define such code instances as hardware dependence bugs where driver may crash the system when the device does not behave correctly
  3. This figure shows another example of an hardware dependence bug. The first line of the code reads in a pointer directly from the device. In the third line, marked as red, we can see that the pointer is dereferenced. If the read returns a corrupt value, the pointer de-reference can corrupt or crash the system. Such dangerous code is replete in the driver code bases of modern operating systems.
  4. And these hardware bugs, they manifest because driver code assumes that device will always work as per its specifications. More specifically,Drivers use device data to perform critical branching or while indexing kernel memory.Drivers also tend to ignore minor device failures which either result in significant failures later on or appear at a much higher layer in software stack can waste considerable debugging time. Also, since hardware is always assumed to work correctly and transient error free, often there are no provisions to detect or perform device recovery
  5. A study from windows servers at microsoft shows the scope of this problem. 8% of the system crashes were due to failure in storage or network adapters. However, system administrators and vendors found out that these errors are mostly transient. When these adapters are used with hardened drivers,i.e. drivers written correctly such that they perform sanity checks on all hardware inputs, the crashes reduce significantly by two thirds to 3%.This example tells us three things:1. Device failures is a major cause of system failures2. Transient device failures are common3. Drivers that tolerate device failures can improve reliabilityExisting solutions like safe drive and nooks either instrument driver code or execute it in isolated environment. Do not distinguish between s/w and h/w faults – limited deployment since heavy weight
  6. To solve these problems, we present carburizer =&gt; a light-weight system to detect hardware failure and perform recoveryConsists of two components =&gt; a static analysis componentIt also consists of a runtime component which can fix interrupt related failures and can also perform transparent online recovery.
  7. This is the outline of my talk
  8. Hardware unreliability stems from the fact that modern CMOS devices prone to hardware failures.Amongst common sources, errors occur due to device wear out and insufficient burn-in causing bit-flips, stuckats where single bit is flipped transiently or for extended times. There are also bridging faults where adjacent pair of bits are electrically mated producing an undesirable logical and or logical orComputers used in diverse environments so environmental conditions such as electro magnetic interference and radiations can cause hardware errors.Modern embedded firmware have significant logic and bugs in them can cause problems like out of resource/memory leaks etc(some devices have million lines of code in firmware)As a result, these hardware problems cause errors which result in transient or extended corruption in driver inputs from device registers. They can also cause timing errors, stuck interrupt lines or unpredictable DMA in the driver.
  9. To address these problems, OSvendors lay out certain recommendations for driver developers on how to write driver code.These can be classified in four broad classes.First is Validate: Drivers should validate all input coming from the hardwareSecond is Timing: Drivers should ensure all device operations occur in finite time and are synchronized with driver operationsThird is that drivers should appropriately report all device errors to the system administrator or system log.Finally, the vendors also lay out certain guidelines on how drivers should perform device recovery.The goal of our tool, carburizer is to automatically implement as many recommendations as possible.MOVE BOX DOWN. Make automatically box
  10. To reiterate our solution, we perform static analysis on driver code during the build process to produce a kernel binary which checks hardware inputs and contains error reporting information.The hardened kernel binary with the carburizer runtime ensures that driver can tolerate stuck interrupts or interrupt storms and also support online transparent recovery.The resultant system consists hardened driver will detect a failure before corruption happens and trigger recovery The result is much more improved reliability against hardware device failures
  11. I will now talk about how we perform driver hardening
  12. The goal of carburizer here is to take a regular commodity driver and perform static analyses to detect and fix hardware dependence bugs and produce a hardened binary driver. Hardening a driver means to ensure that the driver sanitizes all input from device before using it in critical data or control paths.If the driver does not perform this check, carburizer automatically inserts code that performs these checks and triggers a generic recovery service if the check fails.In terms of the types of checks performed, Carburizer is able to find places in the driver where the system waits infinitely for device to enter a particular state or uses device data to perform unsafe array or pointer arithmetic. It also fixes code that invokes system panic when the driver is in an unexpected state - which is a common bad programming practice.
  13. To find the sensitive code described in previous slide, carburizer implemented in CIL and performs these actions in two passes. The first pass is to identify tainted variables in the driver code. Tainted variables are those variables which contain data directly or indirectly originating from the devices and need to be checked. They are similar to the definition of tainted variables in security because here too their use is risky.
  14. I will show this with an example how we identify tainted variables.Carburizer consults a table of functions known to perform I/O such as readl for memory mapped I/O or inb for port I/O. Initially, carburizer marks all heap and stack variables that receive results from these functions as tainted. Carburizer then propagates taint to variables that are computed from or are aliased to the tainted variables. We alsopropogate taint through return values.The output of the first pass is a table containing all variables and functions that are tainted.
  15. In the second pass, we use this table to find the risky uses of these variables.As an example, one of the analyses in second pass consists of detecting control paths that wait forever for a particular input from the device. For each loop, carburizer computes the set of conditions that cause the loop to terminate through while clauses or conditional break, return or gotostatements. If all conditions are hardware dependent then the loop may iterate infinitely when the device misbehaves.
  16. Here, we show a simple case detected by the tool, where the AMD network driver can hang the system if the device does not set the registerto a particular state.While this is a simple example, our analysis can detect complex examples that are not obvious by eyeballing and may not be caught during manual code inspection. Examples of such code include loops with case statements or loops with nested function call
  17. Here, we see a complex example where device read is called indirectly from these loops. It’s a really long loop with a switch statement thrown in at the end and its not very obvious looking at the code that it’s a hardware dependence bug. Carburizer can propogate taint via return values and can detect such bugs.
  18. In the second pass, we also implement analyses to detect if any tainted variable or combination of tainted variable is used to perform unsafe array or pointer references.When range of a device variable falls outside array bounds, or when dereferencing a random memory location, an incorrect reference can lead to reading an unmapped address or corrupt adjacent data structures.
  19. We can see here the driver reads in a variable from the device and uses it as an array index without any bounds check on the variable. This can crash the system if the memory refers to an unmapped address in the kernel. We now see the results
  20. We ran carburizer over the Linux 2.6.18 kernel driver tree. It analyzed and buil\tover 6300 driver source files consisting of over 2.8 million lines of code in 37 minutes. We try to detect hardware dependence bugs in as few code passes as possible.We also detect existing validation code like timeout code in infinite loops, array bounds check and not-NULL checks. This is to improve the accuracy of our results. The details of these are in the paper.
  21. This table shows the results of our analysis on Linux kernel across different driver classes. Overall, we are able to find approximately 992 bugs at the false positive rate of 7.4%. While we implement analyses to detect false positives, some false positives remain because some drivers performed validation differently. For e.g. some drivers perform timeout validation in a different function called from within the loop. There may also be more bugs in the drivers because carburizer tracks taint by function return variables but not via procedure arguments.The results suggest that modern drivers do a poor job of avoiding hardware bugs and are susceptible to crashes in the face of transient hardware failures. Carburizer can detect these hardware dependence bugs for the developer.Finding bugs is valuable but reliability does not increase until the bug is fixed.TELL STORY OF HOW WE DETECT FALSE POSITIVESWRITE MANY CASES instead of MANY DRIVERS
  22. This is important because the hardware dependance bugs occur infrequently and cannot be detected via regular stress testing. Hence, carburizer also fixes the detected bugs automatically.To fix these bugs, for example, the infinite loops, we insert time out code long enough for any device to respond. For unsafe array references, when array bounds can be determined statically, carburizer computes them and inserts a suitable bounds check. For dynamic arrays bounds and pointers valid values are not known but carburizer inserts a not-NULL check. When any of these checks fail, carburizer invokes a generic recovery service
  23. This is an example of a driver fixed by Carburizer. Carburizer detects the infinite loop and triggers the recovery functions after timeout of two clock ticks. The code inserted by carburzier can never harm the system because we wait for two clock ticks long enough for any device to respond.How we picked two clock ticks???
  24. This is an another example of a driver fixed by Carburizer. Carburizer detects a risky use of array index and adds a bounds check after determining the size of the array statically. The code inserted by carburzier can never harm the system because even if there is a false positive, the only extra overhead is of an additional bounds check.
  25. We now look at the recovery service used by carburzier. We leverage past work on shadow drivers to perform recovery as they conceal failures from applications and OS. A shadow driver is a driver independent kernel agent that monitors and stores the state of the driver. Incase of a failure, it can cleanup the failed driver and recover it to its old working state transparently.Unlike previous work, our implementation of shadow drivers does not depend on any isolation mechanism like nooks and avoids any performance penality that it could impose. We do not need an isolation mechanism because failures are detected before any corruption happens.
  26. To empirically validate carburizer, we performed fault injection on two different network drivers. It is difficult to simulate device failures for many bugs because it is difficult to get hold of different devices with specific chipsets which have hardware dependence bugs in their drivers in them as detected by carburizer.With the devices with driver bugs at our disposal, we injected hardware faults with a simple fault injection tool that modifies the return values of the read and in functions transiently in the register of interest.NOW BRING TABLE With the original driver, the system completely crashed due to fault injection. With hardened driver, In each test, the driver quickly detected the failure and triggered the recovery service. After a short delay, as the driver recovered, it returned to normal function transparently without interfering the applications. SUMMARY STATEMENT: THEREFORE IMPROVES RELIABILITY
  27. 16:25
  28. On many occasions drivers fail silently and do not report device errors. But OS and hardware vendors recommend monitoring hardware failures to allow proactive device repair or replacement. For example, the Solaris Fault Management architecture feeds error reported by device drivers into a diagnosis engine and recommends a high level action such as replacing or disabling a device.The goal of carburizer here is to detect conditions in the driver where there is a device failure. And to add a reporting statement unless driver is already reporting these failures
  29. Make point driver detected device fialresTo insert reporting statements at places where the driver already detects device failures – Carburizer implements analyses to detect code paths which timeout from a loop due to a device error or return a negative error constant due to a device error. Carburizer also inserts error reporting information where driver jumps to common cleanup code due to a device error.For example – in the code snippet shown here, the driver correctly detects a timeout loops but does not report it. Carburizer correctly inserts a reporting statement.
  30. Carburizer also detects existing reporting information implemented in the original driver by looking for calls to functions that contain string messages as arguments. As we can see here in this example…
  31. To evaluate our error reporting techniques, we resorted to manual analysis of the code generated by carburizer and compared the actual device errors vs reported errors.While there are no false positives, carburizer is able to detect most errors and miss few ones. These false negatives exist due to the fact that drivers detect device errors but ignore them and do not return negative error constants or flag them in any fashion. Also, in few cases error values are returned by out parameters which the analysis of carburizer does not track.Overall, carb. detected and fixed a total of 1600 unreported device failures. Many of these are low level device errors which if not reported will make it difficult for the administrator to nail down the problem in the system. Carburizer improves fault diagnosis by promptly detecting these failures.
  32. Carburizer runtime detects failures that cannot be detected through static modifications of the driver code. These are missing and stuck interrupts. Missing interrupts can result in an inoperable device while a stuck interrupt can result in a hung system.
  33. To understand the problem of missing interrupts, consider a request that arrives from the OS to the driver. The driver code executes and calls device functions. The device performs I/O and interrupts the driver to continue or complete the I/O. If this interrupt is not generated, the I/o does not complete and the device becomes inoperable. To detect and solve this problem, we need to detect when the device should generate interrupts and is not doing so. To determine this, we monitor the driver activity by reference bits. If the driver code is executed and these bits are set, then in a particular amount of time the device should generate interrupts elsethe device may be missing interrupts.In this case carburizer itself invokes the driver interrupt function. Carb also checks the return values of this invocation to reduce false invocations and increase useful work done by carburizer, if the invocation actually handled a missing interrupt, then carb increases the invocation frequency else exponentially reduces it. This dynamic polling works well to reduce frequency of spurious interrupt invocation and increase the frequency of useful interrupt invocation.
  34. The carburizer runtime also detects stuck interrupts which will cause the device to never lower the interrupt line and consequently hang the system. This is detected when a driver’s interrupt handler has been called too many times. Carburizer recovers here by disabling the interrupt and calls the driver’s polling function.BRING TABLE NOW – SAY SOMETHING ABT RESULTS..pick disk as e.g.ENTERObviously, polling reduces the device throughput as shown in the figure for different device classes but it is better than a completely hung system due to a stuck interrupt line.Why polling and not shadow recoverySome result indicating that it works
  35. To determine the performance overhead of carburizer, we perform netperf throughput tests on regular and carburized versions of network drivers.Network is most sensitive to performance and these tests can help us know the cost of carburizing.On the y-axisBRING DATA NOWThis graph shows the TCP send throughput for a native kernel and another with a carburizer runtime with shadow driver recovery enabled and a carburized network driver. The network throughput with Carburizer is within one-half percent of native performance.
  36. This figure shows..The CPU utilization increases 5% points for one particular driver. The extra CPU comes from using shadow driver monitoring. If we just use the carburizer to perform static bug fixing, then there is no runtime cost associated with using carb. There is almost no performance overhead and very little CPU overhead from hardened drivers and automatic recovery. Extra CPU comes from shadow drivers to monitor. If we just fix bugs, then no cost
  37. To summarize, vendors recommend driver developers to consider the significant hardware unreliability in practice. However, incorporating a feature that does not affect functionality or is captured by stress testing is easy to miss.
  38. We presented Carburizer – a low overhead reliability solution which implements the recommendations which do not require any semantic information about the device or any hardware support. To conclude, carburizer improves system reliability by automatically implementing vendor recommendations and tolerating hardware device failures in software.
  39. Make this as backup slideSome existing driver code already follow certain vendor guidelines. To make sure false positives do not occur due to existing validation code we augment our analysis to track additional taint information and validity checks. We track variable taint history and ensure that the variable is read into by a device function and not reset before any risky use.We also detect validating infinite loops that already have a counter variable or kernel jiffies clock to enforce a timeout. As we can see in this example, the driver already ensures the loop times out and as a result carburizer does not detect this as a hardware bug.We also check for sanity checks on array references in the form of mask, shift operations before array memory reference and not null checks before pointer dereference.Since we track taint history if a device I/O happens in these variables after this check, then the variable is still considered risky. We also report multiple risky uses of a single variable only once.
  40. Make this as backup slideSome existing driver code already follow certain vendor guidelines. To make sure false positives do not occur due to existing validation code we augment our analysis to track additional taint information and validity checks. We track variable taint history and ensure that the variable is read into by a device function and not reset before any risky use.We also detect validating infinite loops that already have a counter variable or kernel jiffies clock to enforce a timeout. As we can see in this example, the driver already ensures the loop times out and as a result carburizer does not detect this as a hardware bug.We also check for sanity checks on array references in the form of mask, shift operations before array memory reference and not null checks before pointer dereference.Since we track taint history if a device I/O happens in these variables after this check, then the variable is still considered risky. We also report multiple risky uses of a single variable only once.
  41. Make this as backup slideSome existing driver code already follow certain vendor guidelines. To make sure false positives do not occur due to existing validation code we augment our analysis to track additional taint information and validity checks. We track variable taint history and ensure that the variable is read into by a device function and not reset before any risky use.We also detect validating infinite loops that already have a counter variable or kernel jiffies clock to enforce a timeout. As we can see in this example, the driver already ensures the loop times out and as a result carburizer does not detect this as a hardware bug.We also check for sanity checks on array references in the form of mask, shift operations before array memory reference and not null checks before pointer dereference.Since we track taint history if a device I/O happens in these variables after this check, then the variable is still considered risky. We also report multiple risky uses of a single variable only once.