Fault Injection for
Software Certification
Roberto Natella
Many	
  industries are	
  facing	
  
legal	
  troubles	
  because	
  they	
  
are	
  liable	
  for	
  accidents caused	
  
by	
  computer faults
2
SSooffttwwaarree	
  	
  rriisskkss	
  	
  iinn	
  	
  ccrriittiiccaall	
  	
  ssyysstteemmss
The	
  Toyota	
  “unintended	
  
acceleration”	
  is	
  a	
  relevant	
  
example	
  of	
  accident	
  caused	
  
by	
  bad	
  software	
  quality and	
  
lack of fault-­‐tolerance
Fault	
  Injection	
  is	
  the	
  process	
  of	
  deliberately	
  introducing	
  faults
(from	
  software	
  and	
  hardware	
  components)for	
  validating	
  fault-­‐
tolerance	
  properties	
  of	
  a	
  system
FFaauulltt	
  	
  IInnjjeeccttiioonn	
  	
  TTeessttiinngg
FFaauulltt	
  	
  iinnjjeeccttiioonn	
  	
  iinn	
  	
  tthhee
DDOO-­‐-­‐117788BB//CC	
  	
  ssaaffeettyy	
  	
  ssttaannddaarrddss
The	
  standard	
  recommends	
  robustness	
   test	
  cases	
  “...	
  [able	
  to]	
  demonstrate	
  
the	
  ability	
  of	
  the	
  software	
  to	
  respond	
  to	
  abnormal	
  inputs	
  and	
  conditions.	
  
Activities	
  include:
○ Real	
  and	
  integer	
  variables	
  should	
  be	
  exercised	
  using	
  equivalence	
  class	
  selection	
  
of	
  invalid	
  values.
○ For	
  time-­‐related	
  functions,	
  such	
  as	
  filters,	
  integrators	
  and	
  delays,	
  test	
  cases	
  
should	
  be	
  developed	
  for	
  arithmetic	
  overflow	
  protection	
  mechanisms.
○ For	
  state	
  transitions,	
  test	
  cases	
  should	
  be	
  developed	
  to	
  provoke	
  transitions	
  that	
  
are	
  not	
  allowed by	
  the	
  software	
  requirements.”
○ ...
*	
  
RTCA	
  DO-­‐178B,	
  Software	
  considerations	
  in	
  airborne	
  systems	
  and	
  equipment	
  certification,	
  Sec.	
  6.4.2.2
FFaauulltt	
  	
  iinnjjeeccttiioonn	
  	
  iinn	
  	
  tthhee	
  	
  IISSOO	
  	
  2266226622	
  	
  ssaaffeettyy	
  	
  
ssttaannddaarrdd
● The	
  NASA	
  Software	
  Safety	
  Guidebook	
  recommends	
  fault	
  injection	
  
for	
  OTS	
  (off-­‐the-­‐shelf)	
  software	
  components
○ Software	
  fault	
  injection	
  (SFI)	
  is	
  a	
  technique	
  used	
  to	
  determine	
  the	
  robustness	
  of	
  the	
  
software,	
  and	
  can	
  be	
  used	
  to	
  understand	
  the	
  behavior	
  of	
  OTS	
  software.	
  It	
  injects	
  
faults	
  into	
  the	
  software	
  and	
  looks	
  at	
  the	
  results	
  (Did	
  the	
  fault	
  propagate?	
  Was	
  the	
  
end	
  result	
  an	
  undesirable	
  outcome?).	
  Basically,	
  the	
  intent	
  is	
  to	
  determine	
  if	
  the	
  
software	
  responds	
  gracefully	
  to	
  the	
  injected	
  faults.
FFaauulltt	
  	
  IInnjjeeccttiioonn	
  	
  iinn	
  	
  tthhee	
  	
  NNAASSAA	
  	
  SSooffttwwaarree	
  	
  
SSaaffeettyy	
  	
  SSttaannddaarrddss
● FIN.X-­‐RTOS	
  is	
  a	
  real-­‐time	
  operating	
  
systemfrom	
  Leonardo/Finmeccanica,	
  
based	
  on	
  open-­‐source	
  software
● Objective	
  of	
  the	
  project:	
  to	
  develop	
  a	
  
Linux	
  distribution	
  compliant	
  to	
  the	
  
DO-­‐178B	
  recommendations
● Built	
  upon	
  a	
  network	
  of	
  excellence	
  
between	
  industriesand	
  universities
CCaassee	
  	
  ssttuuddyy::	
  	
  FFIINN..XX-­‐-­‐RRTTOOSS
● Industrial	
  product	
  management	
  and	
  fully	
  customizable
● Support	
  for	
  hard	
  real-­‐time	
  on	
  multi-­‐core	
  CPUs
● Guaranteed	
  scalability	
  (from	
  embedded	
  devices	
  to	
  high-­‐
performance	
  systems,	
  such	
  as	
  workstations	
  and	
  servers)
● No	
  dependence	
  on	
  a	
  commercial	
  product	
  or	
  vendor
● Enhanced	
  IDE	
  for	
  software	
  development
● No	
  export	
  license	
  restriction
● Full	
  control	
  of	
  all	
  source	
  packages	
  and	
  build	
  process	
  (based	
  on	
  
Gentoo	
  Linux,	
  a	
  Linux	
  meta-­‐distribution)
8
FFIINN..XX-­‐-­‐RRTTOOSS	
  	
  oovveerrvviieeww
9
RReeaall-­‐-­‐ttiimmee	
  	
  ffeeaattuurreess	
  	
  ooff	
  	
  FFIINN..XX-­‐-­‐RRTTOOSS
10
CCeerrttiiffiiccaattiioonn	
  	
  pprroocceessss	
  	
  ooff	
  	
  FFIINN..XX-­‐-­‐RRTTOOSS
Linux	
  kernel
Open	
  Source
FIN.X-­‐RTOS
RTCA/DO-­‐178B	
  D	
  Level
o The	
  DO-­‐178B	
  recommendations	
   allow	
  the	
  reuse	
  of	
  “previously-­‐developed	
  
software”,	
  provided	
  that	
  safety	
  evidence	
  is	
  produced	
  from	
  alternative	
  sources	
  such	
  
as	
  additional	
  testing and	
  reverse	
  engineering
o The	
  functional	
  requirements of	
  the	
  kernel	
  were	
  studied,	
  documented,	
  and	
  tested	
  
(complying	
  to	
  level	
  D of	
  DO-­‐178B)
11
FFaauulltt	
  	
  IInnjjeeccttiioonn	
  	
  iinn	
  	
  FFIINN..XX-­‐-­‐RRTTOOSS
Faults	
  from	
  user-­‐space	
  
software
(API	
  misuse	
  injection)
Faults	
  from	
  device	
  
drivers	
  (code	
  
mutation)
Faults	
  from	
  
kernel	
  APIs
(API	
  error	
  
injection)
● There	
  are	
  many	
  
potential	
  cases	
  of	
  
kernel	
  API	
  failures:
○ Resource	
   exhaustion	
  (e.g.,	
  
allocation	
  of	
  I/O	
  regions,	
  
pages,	
  slabs,	
  ...)
○ Hardware	
  I/O	
  errors
○ Resource	
   busy	
  (e.g.,	
  
mutexes,	
   pinned	
  pages)
● Kernel	
  API	
  callers must	
  
check	
  and	
  handle	
  errors
12
FFaauulltt	
  	
  iinnjjeeccttiioonn	
  	
  oonn	
  	
  kkeerrnneell	
  	
  AAPPIIss
● The	
  Linux	
  kernel	
  already	
  includes	
  a	
  fault	
  injector
that	
  forces	
  erroneous	
  return	
  codes (to	
  simulate	
  
failed	
  memory	
  allocations,	
  I/O	
  errors,	
  ...)
13
AA	
  	
  ffaauulltt	
  	
  iinnjjeeccttoorr	
  	
  iinn	
  	
  tthhee	
  	
  LLiinnuuxx	
  	
  kkeerrnneell
void  *  kmem_cache_alloc (struct kmem_cache *  cachep,  gfp_t flags)
{
void  *  objp;;
if  (should_failslab(cachep,  flags))
return  NULL;;
...
return  objp;;
}
The	
  fault	
  injector	
  is	
  
programmed	
  from	
  user-­‐space
Examples:
• Fail	
  with	
  X%	
  probability
• Fail	
  1-­‐every-­‐X	
  calls	
  to	
  API
• Fail	
  after	
  X	
  seconds
LLiimmiittaattiioonnss	
  	
  ooff	
  	
  rraannddoomm	
  	
  ffaauulltt	
  	
  iinnjjeeccttiioonn
● Faults	
  are	
  injected	
  with	
  a	
  “blind”	
  (black-­‐box)	
  
approach,	
  with	
  a	
  random	
  timing
● However,	
  this	
  approach	
  neglects	
  the	
  internal	
  state	
  
of	
  the	
  system
○ Many	
  tests	
  are	
  redundant:	
  they	
  are	
  performed	
  on	
  the	
  same	
  
state
○ Many	
  important	
  states	
  may	
  be	
  missed by	
  the	
  tests
● The manual	
  definition	
  of	
  test	
  scenarios is	
  not	
  a	
  
feasible	
  solution
○ Too	
  much	
  effort	
  for	
  a	
  large	
  system,	
  and	
  may	
  still	
  be	
  inaccurate
● Basic	
  idea:
○ the	
  internal	
  state of	
  an	
  OS	
  
component	
  (such	
  as	
  the	
  FS)	
  is	
  
given	
  by	
  the	
  history	
  of	
  its	
  
interactions
○ we	
  profile	
  the	
  history	
  of	
  
interactions,	
  and	
  extract	
  
behavioral	
  models of	
  the	
  OS	
  
component	
  under	
  test
○ based	
  on	
  the	
  behavioral	
  model,	
  
we	
  perform	
  distinct	
  fault	
  
injections at	
  each	
  state,	
  to	
  
efficiently	
  cover different	
  states	
  
of	
  the	
  target
15
TThhee	
  	
  SSAABBRRIINNEE	
  	
  aapppprrooaacchh
ext3_dirty_inode
journal_dirty_metadata
kmem_cache_alloc
16
AApppprrooaacchh	
  	
  oovveerrvviieeww
Operating	
  System
OS	
  component
1
OS	
  component
2
OS	
  component
N
OS	
  interface
Target
OS	
  component
User	
  
apps
HW
System
calls
Interrupt	
  
requests
17
AApppprrooaacchh	
  	
  oovveerrvviieeww
Operating	
  System
OS	
  component
1
OS	
  component
2
OS	
  component
N
OS	
  interface
Target
OS	
  component
User	
  
apps
HW
System
calls
Interrupt	
  
requests
Phase	
  1:	
  monitoring
18
AApppprrooaacchh	
  	
  oovveerrvviieeww
Operating	
  System
OS	
  component
1
OS	
  component
2
OS	
  component
N
OS	
  interface
Target
OS	
  component
User	
  
apps
HW
System
calls
Interrupt	
  
requests
Phase	
  1:	
  monitoring
Phase	
  2:
model	
  learning
19
AApppprrooaacchh	
  	
  oovveerrvviieeww
Operating	
  System
OS	
  component
1
OS	
  component
2
OS	
  component
N
OS	
  interface
Target
OS	
  component
User	
  
apps
HW
System
calls
Interrupt	
  
requests
Phase	
  1:	
  monitoring
Phase	
  2:
model	
  learning
Phase	
  3:
model-­‐based	
  testing
PPaatttteerrnn	
  	
  iiddeennttiiffiiccaattiioonn
● We	
  get	
  an	
  execution	
  log of	
  the	
  target	
  OS	
  component,	
  by	
  running	
  a	
  workload	
  and	
  recording	
  the	
  
function	
  calls	
  (interactions)	
  made	
  by	
  the	
  component
● The	
  execution	
  log	
  is	
  divided	
  into	
  sequences (i.e.,	
  a	
  subset	
  of	
  interactions that	
  happen	
  during	
  the	
  same	
  
system	
  call,	
  interrupt	
  request,	
  or	
  kernel	
  task	
  execution)
● Unique	
  repeated	
  sequences	
  are	
  grouped	
  (patterns)
● Patterns	
  that	
  are	
  similar	
  (even	
  if	
  not	
  identical)	
  are	
  further	
  grouped	
  into	
  clusters
TRACE
ID
OPERATION
ID
SEQ.
ID
INT.
TYPE
CALLED
FUNCTION
CALL
POINT
Seq.	
  B
Seq.	
  A
Seq.	
  C
...
OUT, pdflush, 428, 1, ll_rw_block, flush_commit_list:1f3eb
INJ, pdflush, 428, 1, kmem_cache_alloc, flush_commit_list:1f3eb
INJ, pdflush, 428, 1, kmem_cache_alloc, flush_commit_list:1f3eb
IN, close, 491, 1, reiserfs_file_release, __fput:c018efda
INJ, pdflush, 428, 1, generic_make_request, flush_commit_list:1f3eb
OUT, pdflush, 428, 1, __find_get_block, flush_commit_list:1f3cc
...
IN, close, 503, 1, reiserfs_file_release, __fput:c018efda
...
CClluusstteerriinngg
ext3_dirty_inode
journal_start
kmem_cache_alloc
__getblk
journal_get_write_access
__alloc_pages
kmem_cache_alloc
journal_dirty_metadata
kmem_cache_alloc
__brelse
journal_stop
ext3_dirty_inode
journal_start
kmem_cache_alloc
__getblk
journal_get_write_access
__alloc_pages
kmem_cache_alloc
journal_stop
Sequence	
  1 Sequence	
  2
21
ll_rw_block
kmem_cache_alloc
kmem_cache_alloc
generic_make_request
__find_get_block
Sequence	
  3
Clustering
Finite	
  state	
  machine
Finite	
  state	
  machine
CClluusstteerriinngg	
  	
  aallggoorriitthhmm
1. For	
  each	
  pair	
  of	
  patterns,	
  we	
  compute	
  a	
  similarity	
  score (Smith-­‐Waterman
algorithm)
○ It	
  first	
  searches	
  the	
  best	
  alignment between	
  two	
  patterns
○ The	
  score	
  is	
  higher	
  when	
  there	
  are	
  many	
  matching	
  symbols and	
  few	
  gaps/mismatches
2. Similar	
  patterns	
  are	
  grouped	
  (spectral	
  clustering)
○ Patterns	
  are	
  the	
  nodes of	
  a	
  weighted	
  graph,	
  and	
  the	
  similarity	
  score	
  is	
  the	
  weight	
  of	
  the	
  edge
between	
  two	
  nodes
○ By	
  cutting	
  “weak”	
  edges,	
  the	
  graph	
  is	
  split	
  into	
  partitions that	
  are	
  “strongly	
  connected”	
  (i.e.,	
  
very	
  similar	
  patterns)
EExxaammpplleess	
  	
  ooff	
  	
  cclluusstteerrss
Clusters	
  (EXT3) Behavior Context #	
  patterns
1 gets	
  and	
  sets	
  the	
  file	
  metadata stat	
  syscall 6
2 retrieves	
  and	
  stores	
  in	
  memory	
  the	
  file	
  index	
  block,	
  or	
  updates	
  
it	
  on	
  the	
  disk
open,	
  unlink	
  syscalls 5
3 copies	
  file	
  contents	
  from	
  disk	
  to	
  a	
  cache,	
  and	
  modifies	
  it write	
  syscall 8
4 modifies	
  the	
  contents	
  of	
  a	
  file	
  already	
  in	
  the	
  disk	
  cache write	
  syscall 8
5 copies	
  a	
  large	
  amount	
  of	
  data	
  from	
  a	
  file	
  to	
  a	
  socket sendfile syscall 12
6 copies	
  a	
  small	
  amount	
  of	
  data	
  from	
  a	
  file	
  to	
  a	
  socket sendfile syscall 10
7 flushes	
  a	
  small	
  amount	
  of	
  data	
  from	
  the	
  cache	
  to	
  the	
  disk pdflush kernel	
  task 19
8 flushes	
  a	
  large	
  amount	
  of	
  data	
  from	
  the	
  cache	
  to	
  the	
  disk pdflush kernel	
  task 6
9 updates	
  file	
  metadata	
  to	
  reflect	
  that	
  is	
  has	
  been	
  memory-­‐
mapped
mmap2	
  syscall 5
EXT3 ReiserFS SCSI
# interactions 34,784 97,341 27,311
# sequences 432 239 1,307
#	
  (distinct)	
  sequences 79 57 10
#	
  clusters 9 6 2
#	
  test	
  cases 49 28 10
BBeehhaavviioorraall	
  	
  mmooddeelliinngg	
  	
  eexxaammppllee
0 1
ext3_dirty_inode
2 5 6 7 8
journal_start
journal_dirty_metadata __brelse journal_stop
3 4
kmem_cache_alloc
__getblk
journal_get_write_access
1. A	
  partial	
  state	
  automata	
  is	
  derived	
  from	
  the	
  first	
  pattern	
  in	
  the	
  cluster
BBeehhaavviioorraall	
  	
  mmooddeelliinngg	
  	
  eexxaammppllee
0 1
ext3_dirty_inode
2 5 6 7 8
9 10 11
journal_start
journal_dirty_metadata
alloc_pages
kmem_cache_alloc journal_dirty_metadata
kmem_cache_alloc
__brelse journal_stop
3 4
kmem_cache_alloc
__getblk
journal_get_write_access
1. A	
  partial	
  state	
  automata	
  is	
  derived	
  from	
  the	
  first	
  pattern	
  in	
  the	
  cluster
2. The	
  automata	
  is	
  extended	
   with	
  the	
  second	
  pattern	
  (partially	
  overlapping	
  
with	
  the	
  first	
  pattern)
BBeehhaavviioorraall	
  	
  mmooddeelliinngg	
  	
  eexxaammppllee
0 1
ext3_dirty_inode
2 5 6 7 8
9 10 11
journal_start
journal_dirty_metadata
alloc_pages
kmem_cache_alloc journal_dirty_metadata
kmem_cache_alloc
__brelse journal_stop
3 4
kmem_cache_alloc
__getblk
journal_get_write_access
Robustness	
  test	
  case	
  #1
1. A	
  partial	
  state	
  automata	
  is	
  derived	
  from	
  the	
  first	
  pattern	
  in	
  the	
  cluster
2. The	
  automata	
  is	
  extended	
  with	
  the	
  second	
  pattern	
  (partially	
  overlapping	
  
with	
  the	
  first	
  pattern)
3. A	
  robustness	
  test	
  case	
  is	
  generated	
  for	
  each	
  injectable	
  interaction in	
  the	
  
automata
BBeehhaavviioorraall	
  	
  mmooddeelliinngg	
  	
  eexxaammppllee
0 1
ext3_dirty_inode
2 5 6 7 8
9 10 11
journal_start
journal_dirty_metadata
alloc_pages
kmem_cache_alloc journal_dirty_metadata
kmem_cache_alloc
__brelse journal_stop
3 4
kmem_cache_alloc
__getblk
journal_get_write_access
Robustness	
  test	
  case	
  #1
Robustness	
  test	
  case	
  #2
1. A	
  partial	
  state	
  automata	
  is	
  derived	
  from	
  the	
  first	
  pattern	
  in	
  the	
  cluster
2. The	
  automata	
  is	
  extended	
  with	
  the	
  second	
  pattern	
  (partially	
  overlapping	
  
with	
  the	
  first	
  pattern)
3. A	
  robustness	
  test	
  case	
  is	
  generated	
  for	
  each	
  injectable	
  interaction in	
  the	
  
automata
BBeehhaavviioorraall	
  	
  mmooddeelliinngg	
  	
  eexxaammppllee
0 1
ext3_dirty_inode
2 5 6 7 8
9 10 11
journal_start
journal_dirty_metadata
alloc_pages
kmem_cache_alloc journal_dirty_metadata
kmem_cache_alloc
__brelse journal_stop
3 4
kmem_cache_alloc
__getblk
journal_get_write_access
Robustness	
  test	
  case	
  #1
Robustness	
  test	
  case	
  #2
Robustness	
  test	
  case	
  #3
1. A	
  partial	
  state	
  automata	
  is	
  derived	
  from	
  the	
  first	
  pattern	
  in	
  the	
  cluster
2. The	
  automata	
  is	
  extended	
  with	
  the	
  second	
  pattern	
  (partially	
  overlapping	
  
with	
  the	
  first	
  pattern)
3. A	
  robustness	
  test	
  case	
  is	
  generated	
  for	
  each	
  injectable	
  interaction in	
  the	
  
automata
BBeehhaavviioorraall	
  	
  mmooddeelliinngg
1. For	
  each	
  cluster,	
  we	
  obtain	
  a	
  behavioral	
  model
(kBehavior algorithm)
○ A	
  Finite	
  State	
  Automaton	
  (FSA)	
  is	
  incrementally	
  extended
with	
  new	
  transitions	
  and	
  states
○ Transitionsrepresent	
  interactions	
  of	
  the	
  patterns
2. A	
  robustness	
  test	
  case	
  is	
  generated	
  for	
  each	
  
injectable	
  interaction included	
  in	
  the	
  FSA
○ This	
  allows	
  to	
  perform	
  injections	
  in	
  different	
  contexts
TTeesstt	
  	
  eexxeeccuuttiioonn
● The	
  interactions	
  of	
  the	
  component-­‐under-­‐test	
  are	
  initially	
  
profiled	
  using	
  kernel	
  debugging	
  tools (SystemTap)
● For	
  the	
  traces,	
  we	
  automatically	
  generate	
  a	
  kernel	
  injection	
  
module that	
  keeps	
  trackof	
  the	
  OS	
  state	
  automata	
  at	
  run-­‐time
● During	
  robustness	
  tests,	
  the	
  system	
  is	
  again	
  executed	
  with	
  the	
  
same	
  workload
● When	
  the	
  injector	
  notices	
  that	
  an	
  injectable	
  function	
  is	
  
invoked	
  at	
  a	
  given	
  state,	
  it	
  forces	
  an	
  erroneous	
  return	
  code
from	
  that	
  function	
  call
RRoobbuussttnneessss	
  	
  vvuullnneerraabbiilliittiieess
We	
  found	
  two	
  robustness	
   vulnerabilities,	
   that	
  affected	
   the	
  EXT3	
  and	
  
ReiserFS filesystem (radix_tree_node_alloc and	
  __get_blk)
31
STACK FRAME AT THE TIME OF INJECTION:
0 kmem_cache_alloc
1 radix_tree_node_alloc
2 radix_tree_insert
3 add_to_page_cache
4 add_to_page_cache_lru
5 mpage_readpages
6 ext3_readpages
7 __do_page_cache_readahead
8 ondemand_readahead
9 page_cache_async_readahead
10 generic_file_splice_read
11 do_splice_to
12 splice_direct_to_actor
13 do_splice_direct
14 do_sendfile
15 sys_sendfile64
16 sysenter_past_esp
First	
  called	
  function	
  
(a	
  system	
  call)
Function	
  
call	
  to	
  the	
  
EXT3	
  
filesystem
Function	
  
call	
  to	
  the	
  
memory	
  
allocator
Fault	
  injection!
Kernel	
  crash!
EEffffiicciieennccyy	
  	
  aanndd	
  	
  rreepprroodduucciibbiilliittyy
● With	
  random	
  injection,	
  thousands of	
  tests	
  are	
  needed	
  to	
  hit	
  the	
  
two	
  robustness	
  vulnerabilities
● With	
  model-­‐based	
  injection,	
  the	
  same	
  vulnerabilities	
  can	
  be	
  
found	
  efficiently(only	
  77	
  tests	
  are	
  needed),	
  and	
  tests	
  are	
  highly	
  
reproducible
29,0%
3,8%
68,8% 77,7%
__get_blk radix_tree_node_alloc
Vulnerable	
  functions
EXT3
Random SABRINE
0,2%
9,4%
100,0% 100,0%
__get_blk radix_tree_node_alloc
Vulnerable	
  functions
ReiserFS
Random SABRINE
● Drivers	
  come	
  from	
  
third-­‐party developers
● They	
  are	
  defect-­‐prone	
  
(due	
  to	
  concurrency	
  
and	
  hardware	
  
dependencies)
● If	
  drivers	
  fail,	
  the	
  OS	
  
should	
  avoid	
  an	
  
escalation	
  (stalls,	
  data
corruptions,	
  …)
33
FFaauulltt	
  	
  iinnjjeeccttiioonn
iinn	
  	
  ddeevviiccee	
  	
  ddrriivveerrss
Safety-critical system
FIN.X-RTOS kernel
Device Drivers
Applications
1. a fault is
injected into
driver’s code
2. the
device
driver is in
an error
state
3. the error
state
propagates
to the kernel
OOvveerrvviieeww	
  	
  ooff	
  	
  FFaauulltt	
  	
  IInnjjeeccttiioonn	
  	
  iinn
DDeevviiccee	
  	
  DDrriivveerrss
TThhee	
  	
  ccooddee	
  	
  mmuuttaattiioonn	
  	
  aapppprrooaacchh
complex_routine(...) {
...
...
if ((GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) &&
(off >= NIC_SRAM_STATS_BLK) &&
off < NIC_SRAM_TX_BUFFER_DESC))
{
*val = 0;
return;
}
...
...
}
Code	
  mutation mimics	
  software	
  faults	
  by	
  making	
  
small	
  “faulty”	
  changes into	
  the	
  target	
  code,	
  to	
  
emulate	
  programmers’	
  omissions	
  and	
  mistakes
Missing  variable  
initialization  in  a  
complex  IF  construct
Missing  logical  clause  
among  several
Automatic  
generation  and  
execution of  fault  
injection  tests
Seamless  integration to  the  
project  under  development
Supports  the  
injection  of  an  
extensive  and  
realistic  fault  
suite
SSAAFFEE::
SSooffttwwAArreeFFaauulltt	
  	
  EEmmuullaattoorr
AAuuttoommaattiinngg	
  	
  ffaauulltt	
  	
  iinnjjeeccttiioonn
uussiinngg	
  	
  tthhee	
  	
  SSAAFFEE	
  	
  ttooooll
if(a && b)
{
c=1;
}
Target	
  component	
  
source	
  code
Source	
  code
analysis
...
Mutated	
  target	
  
component
Code	
  
mutation
if(a && b)
{
c=1;
}
if(a && b)
{
c=1;
}
if(a && b)
{
c=2;
}
Fault	
  library
if(a && b)
{
c=1;
}
...
1. The	
  target	
  
component	
  is	
  replaced	
  
with	
  a	
  faulty	
  version
if(a && b)
{
c=1;
}
Software	
  
under	
  test 3. These	
  steps	
  are	
  iterated	
  
several	
  times	
  (one	
  iteration	
  
per	
  faulty	
  version)
AAuuttoommaattiinngg	
  	
  ffaauulltt	
  	
  iinnjjeeccttiioonn
uussiinngg	
  	
  tthhee	
  	
  SSAAFFEE	
  	
  ttooooll
APP LIB
MW
OS
DD
APP LIB
MW
OS
DD
LOGS
2. The	
  software	
  is	
  
exercised	
  under	
  a	
  real	
  or	
  
simulated	
  environment
if(a && b)
{
c=2;
}
LOGS
LOGS
RESULTS
4. Dependability	
  
measures	
  are	
  
computed	
  from	
  
raw	
  data
...
FIN.X-RTOS kernel
● Fault	
  Injection	
  found	
  several	
  robustness	
  issues	
  in	
  
the	
  OS	
  at	
  handling	
  faulty	
  drivers
○ Not	
  detectable	
  through	
  traditional	
  testing	
  techniques
Ethernet
device
driver
crash	
  of	
  an	
  OS	
  
thread	
  that	
  
manages	
  
periodic	
  events
Fault	
  Injection
stall	
  of	
  OS	
  services
writes	
  un-­‐initialized	
  
data
EExxaammpplleess	
  	
  ooff
rroobbuussttnneessss	
  	
  iissssuueess	
  	
  iinn	
  	
  FFIINN..XX-­‐-­‐RRTTOOSS
● The	
  use	
  of	
  corrupted	
  memory	
  caused	
  an	
  illegal	
  
memory	
  access	
  exception
● When	
  the	
  sirq-­‐timer kernel	
  thread	
  is	
  killed,	
  timer	
  
functions	
  in	
  the	
  kernel	
  couldn’t	
  be	
  executed	
  anymore
● To	
  avoid	
  this	
  situation,	
  the	
  kernel’s	
  exception	
  handler
should	
  be	
  modified	
  to	
  restart	
  a	
  kernel	
  thread when	
  an	
  
exception	
  occurs	
  instead	
  of	
  terminating	
  it
● In	
  this	
  way,	
  the	
  kernel	
  could	
  preserve	
  the	
  execution	
  of	
  
other	
  timer	
  functions when	
  a	
  timer	
  functions	
  fails	
  due	
  
to	
  a	
  faulty	
  driver
FFeeeeddbbaacckk	
  	
  ttoo	
  	
  ddeevveellooppeerrss
● The	
  current	
  trend	
  of	
  
application	
  complexity	
  
increases	
   the	
  opportunities	
  
for	
  bad	
  data	
  values	
  to	
  
circulate	
  within	
  a	
  system
● The	
  POSIX	
  OS	
  system	
  calls	
  
must	
  gracefully	
  deal with	
  
such	
  exceptional	
  conditions
○ Which	
  COTS	
  POSIX	
  OS	
  is	
  the	
  most	
  
robust?
○ Are	
  errors	
  detected	
  and	
  handled?	
  
How?
41
FFaauulltt	
  	
  iinnjjeeccttiioonn	
  	
  oonn	
  	
  uusseerr-­‐-­‐ssppaaccee	
  	
  iinntteerrffaacceess
RRoobbuussttnneessss	
  	
  tteessttiinngg	
  	
  ooff	
  	
  PPOOSSIIXX	
  	
  OOSSss
Based	
  on	
  the	
  IEEE	
  1003.1b	
  standard,	
  
a	
  tester	
  process generates
faulty	
  inputs	
  for	
  the	
  OS,
using	
  a	
  set	
  of	
  pre-­‐defined	
  data	
  types
In	
  the	
  ideal	
  case	
  (the	
  test	
  oracle),
the	
  OS	
  should	
  return	
  an	
  error	
  code	
  
from	
  the	
  syscall to	
  the	
  tester	
  process;
crashes,	
  stalls,	
  wrong	
  errors	
  are	
  
robustness	
  failures
The	
  outcome	
  of	
  tests	
  is	
  classified	
  according	
  to	
  the	
  
C.R.A.S.H.	
  failure	
  scale
43
RRoobbuussttnneessss	
  	
  tteessttiinngg	
  	
  ooff	
  	
  PPOOSSIIXX	
  	
  OOSSss
Failure	
  type Description
Catastrophic The	
  OS	
  state	
  becomes	
  corrupted;	
  the	
  machine	
  crashes	
  and	
  
reboots
Restart The	
  OS	
  never	
  returns	
  from	
  a	
  system	
  call;	
  the	
  calling	
  
process	
  is	
  stalled and	
  needs	
  to	
  be	
  restarted
Abort The	
  OS	
  terminates	
  the	
  caller	
  process	
  in	
  an	
  abnormal	
  way
Silent The	
  OS	
  system	
  call	
  does	
  not	
  return	
  an	
  error	
  code
Hindering The	
  OS	
  system	
  call	
  returns	
  a	
  misleading	
  error	
  code
RRoobbuussttnneessss	
  	
  tteessttiinngg	
  	
  ooff	
  	
  PPOOSSIIXX	
  	
  OOSSss
System	
  calls	
  are	
  invoked	
  with	
  combinations	
  of	
  both	
  
valid	
  and	
  invalid	
  parameters	
  (invalid	
  memory	
  
addresses,	
  non-­‐existing	
  paths,	
  ...)
RRoobbuussttnneessss	
  	
  tteessttiinngg	
  	
  ooff	
  	
  PPOOSSIIXX	
  	
  OOSSss
File descriptor
FD_CLOSED
FD_OPEN_READ
FD_OPEN_WRITE
FD_NOEXIST
...
write(int filedes, const void * buffer, size_t nbytes)
Memory buffer
BUF_SMALL_1
BUF_MED_PAGESIZE
BUF_LARGE_512MB
BUF_HUGE_2GB
...
Size
SIZE_1
SIZE_16
SIZE_PAGE
SIZE_PAGEx16plus1
...
46
RRoobbuussttnneessss	
  	
  tteessttiinngg	
  	
  ooff	
  	
  PPOOSSIIXX	
  	
  OOSSss
All	
  OSs	
  studied	
  failed	
  to	
  provide	
  
correct	
  error	
  handling	
  in	
  a	
  
substantial	
  portion	
  of	
  tests.
Indeed,	
  the	
  POSIX	
  standard	
  does	
  not	
  
require	
  comprehensive	
  exception	
  
reporting;	
  but	
  it	
  seems	
  likely	
  that	
  a	
  
growing	
  number	
  of	
  applications	
   will	
  
need	
  it.
● Invalid	
  inputs	
  that	
  were	
  often	
  associated	
  with	
  a	
  
robustness	
  failure:
○ 94.0%	
  of	
  invalid	
  file	
  pointers	
  (excluding	
  NULL)
○ 82.5%	
  of	
  NULL	
  file	
  pointers
○ 49.8%	
  of	
  invalid	
  buffer	
  pointers	
  (excluding	
  NULL)
○ 46.0%	
  of	
  NULL	
  buffer	
  pointers
○ 44.3%	
  of	
  MININT	
  integer	
  values
○ 36.3%	
  of	
  MAXINT	
  integer	
  values
47
RRoobbuussttnneessss	
  	
  tteessttiinngg	
  	
  ooff	
  	
  PPOOSSIIXX	
  	
  OOSSss
AAddddiinngg	
  	
  ““ssttaattee””	
  	
  ttoo	
  	
  rroobbuussttnneessss	
  	
  tteessttiinngg
● The	
  test	
  plan	
  includes	
  an	
  additional	
  state	
  variable S={s1,s2,…,sn},	
  
that	
  reflects	
  the	
  allocated	
  OS	
  resources	
  and	
  processes	
  at	
  the	
  
time	
  of	
  the	
  test
● A	
  State	
  Setter program	
  is	
  used	
  in	
  conjunction	
  with	
  the	
  
robustness	
  test	
  driver
CCUUTT	
  	
  SSttaattee	
  	
  MMooddeell
● The	
  states	
  to	
  be	
  tested	
  are	
  defined	
  using	
  a	
  state	
  model of	
  
the	
  Component	
  Under	
  Test	
  (CUT)
● The	
  state	
  model	
  should:
○ be	
  easy	
  to	
  set	
  and	
  control	
  by	
  the	
  tester
○ represent	
   the	
  state	
  at	
  a	
  level	
  of	
  abstraction	
  high	
  enough	
  to	
  keep	
  the	
  
number	
  of	
  test	
  cases	
  reasonably	
  small
○ include	
  those	
  configurations	
  that	
  are	
  the	
  most	
  influential	
  on	
  the	
  
component	
  behavior
MMooddeelliinngg	
  	
  tthhee	
  	
  FFiillee	
  	
  SSyysstteemm	
  	
  ssttaattee
FileSystem attributes Type Notes
Partition	
  type {Primary,	
  Logical} The	
  partition	
  on	
  which	
  the	
  FS	
  is	
  installed
Partition	
  size Byte Total	
  partition space
Partition allocated Byte Used	
  partition	
  space
FS	
  implementation {ext2, ext3,	
  NTFS} The	
  filesystem module	
  to	
  be	
  loaded
Number of	
  Files Integer Must	
  be	
  preallocated before	
  the	
  tests
Number of	
  Directories Integer Must	
  be	
  preallocated before	
  the	
  tests
FS	
  layout {Balanced,	
  Unbalanced} Randomly	
  generated	
  (the	
  probability	
  that	
  a	
  new	
  directory	
  dk+1 is	
  
appended	
  to	
  dj is	
  (inversely)	
  proportional	
  to	
  depth(dj))
○ Attributes	
  have	
  been	
  defined	
  to	
  cover	
  a	
  large	
  set	
  of	
  test	
  scenarios,	
  and	
  to	
  keep	
  low	
  
the	
  number	
  of	
  test	
  cases	
  at	
  the	
  same	
  time
◌ Numeric	
  attributes:	
  a	
  subset	
  of	
  values	
  is	
  considered	
  (e.g.,	
  free	
  space	
  =	
  {low,	
  medium,	
  high})
◌ Categorical	
  attributes (e.g.,	
  partition	
  type	
  =	
  {primary,	
  logical})
◌ Random	
  attributes:	
  values	
  are	
  defined	
  in	
  terms	
  of	
  random	
  distributions	
  (e.g.,	
  Balanced and	
  
Unbalanced are	
  two	
  distributions	
  for	
  FS	
  layout)
MMooddeelliinngg	
  	
  tthhee	
  	
  FFiillee	
  	
  SSyysstteemm	
  	
  ssttaattee
OperationalProfileattributes Type Notes
Number of	
  tasks	
  performing	
  I/O Integer Tasks	
  performing	
  randomI/O
Averagenumber	
  of	
  ops./s Integer I/O	
  syscalls performed	
  by the	
  tasks
Ratio	
  between read/write	
  ops. Float Types	
  of	
  I/O	
  syscalls performed	
  by	
  the	
  tasks
● OperationalProfile defines	
  the	
  number	
  and	
  type	
  of	
  I/O	
  
operations	
  running	
  in	
  background
○ Processes	
   are	
  instantiated	
  in	
  order	
  to	
  exercise	
   FS-­‐internal	
  resources	
   (e.g.,	
  
buffers,	
   caches,	
  locks)
● Attributes	
  of	
  an	
  Item are	
  selected	
  according	
  to	
  random	
  (user-­‐
defined)	
  distributions
○ E.g.,	
  the	
  name	
  length	
  and	
  size	
  of	
  a	
  file	
  assume	
  a	
  value	
  within	
  a	
  range,	
  
selected	
   using	
  a	
  uniform	
  distribution
EExxaammpplleess	
  	
  ooff	
  	
  ffaaiilluurree	
  	
  ddiissttrriibbuuttiioonn
Stateful tests	
  caused	
  restart	
  failures that	
  did	
  not	
  
happen	
  in	
  the	
  statelesstests
Function #	
  Tests
Stateless	
  Rob.Testing StatefulRob.	
  Testing
#	
  Restart #	
  Abort #	
  Restart #	
  Abort
access() 3,986 0 4 1 4
dup2() 3,954 0 0 1 0
lseek() 3,977 0 0 0 0
mkfifo() 3,870 0 5 1 5
mmap() 4,003 0 0 0 0
open() 3,988 0 8 40 8
read() 3,924 0 253 1 253
unlink() 500 0 1 0 1
write() 3,989 0 68 4 68
Total 32,191 0 339 48 339
IInnccrreeaasseedd	
  	
  ccoovveerraaggee	
  	
  ffoorr	
  	
  ““ccoorrnneerr-­‐-­‐ccaasseess””
static struct dentry * real_lookup(...) { // fs/namei.c:478
/* --- OMISSIS (DECLARATIONS) */
mutex_lock(&dir->i_mutex);
result = d_lookup(parent, name);
if (!result) {
/* --- OMISSIS (PERFORMS LOOK-UP) --- */
mutex_unlock(&dir->i_mutex);
return result;
}
/*
* Uhhuh! Nasty case: the cache was re-populated while
* we waited on the semaphore. Need to revalidate.
*/
mutex_unlock(&dir->i_mutex);
if (result->d_op && result->d_op->d_revalidate) {
result = do_revalidate(result, nd);
if (!result)
result = ERR_PTR(-ENOENT);
}
return result;
}
The	
  cache	
  lookup
Code	
  not	
  covered	
  before
IInnccrreeaasseedd	
  	
  ccoovveerraaggee	
  	
  ffoorr	
  	
  ““ccoorrnneerr-­‐-­‐ccaasseess””
int try_to_free_buffers(struct page *page) { // fs/buffer.c:3057
/* --- OMISSIS (declarations) --- */
BUG_ON(!PageLocked(page));
if (PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
ret = drop_buffers(page, &buffers_to_free);
goto out;
}
/* --- OMISSIS (page writeback and deallocation) --- */
}
• I/O	
  buffers	
  used	
  by	
  a	
  transaction	
  are	
  
marked	
  by	
  “mapping	
  ==	
  NULL”
• If	
  free	
  memory	
  is	
  low,	
  the	
  page	
  cache	
  mgmt
looks	
  for	
  pages	
  that	
  can	
  be	
  freed	
  (checks	
  
with	
  drop_buffers())
• It	
  is	
  a	
  rare	
  condition	
  that	
  happens	
  under	
  
stress
Statement	
  coverage	
  improvement	
   ranged	
  between	
   0.49%	
  and	
  15.11%	
  
(especially	
  for	
  journal-­‐ and	
  driver-­‐related	
   code)
SSttaatteemmeenntt	
  	
  ccoovveerraaggee
Source	
  file Stateless	
  Rob. Test. Stress	
  testing Stateful Rob.	
  Test.
fs/binfmt_elf.c 319/850	
  (37.53%) 331/850	
  (38.94%) 332/850	
  (39.06%)
fs/buffer.c 529/1320	
  (40.08%) 553/1320	
  (41.89%) 565/1320	
  (42.80%)
fs/dcache.c 371/880	
  (42.16%) 341/880	
  (38.75%) 387/880	
  (43.98%)
fs/exec.c 479/807	
  (59.36%) 392/807	
  (48.57%) 486/807	
  (60.22%)
fs/fs-­‐writeback.c 146/273	
  (53.48%) 169/273	
  (61.90%) 174/273	
  (63.74%)
fs/inode.c 252/527	
  (47.82%) 307/527	
  (58.25%) 316/527	
  (59.96%)
fs/namei.c 918/1392	
  (65.95%) 626/1392	
  (44.97%) 925/1392	
  (66.45%)
fs/select.c 237/402	
  (58.96%) 237/402	
  (58.96%) 239/402	
  (59.45%)
fs/ext3/balloc.c 384/556	
  (69.06%) 385/556	
  (69.24%) 398/556	
  (71.58%)
fs/ext3/dir.c 140/219	
  (63.93%) 143/219	
  (65.30%) 144/219	
  (65.75%)
fs/ext3/ialloc.c 181/337	
  (53.71%) 186/337	
  (55.19%) 189/337	
  (56.08%)
fs/ext3/inode.c 719/1204	
  (59.72%) 729/1204	
  (60.55%) 737/1204	
  (61.21%)
fs/ext3/namei.c 607/1088	
  (55.79%) 654/1088	
  (60.11%) 781/1088	
  (71.78%)
fs/jbd/checkpoint.c 102/263	
  (38.78%) 141/263	
  (53.61%) 142/263	
  (53.99%)
fs/jbd/commit.c 300/362	
  (82.87%) 302/362	
  (83.43%) 318/362	
  (87.85%)
fs/jbd/revoke.c 108/228	
  (47.37%) 105/228	
  (46.05%) 116/228	
  (50.87%)
fs/jbd/transaction.c 489/697	
  (70.16%) 500/697	
  (71.74%) 545/697	
  (78.19%)
CCoonncclluussiioonn
● Residual	
  faults	
  are	
  hidden	
  in	
  our	
  
software,	
  and	
  they	
  will	
  eventually	
  
manifest	
  themselves	
  during	
  operation
● Software	
  Fault	
  Injection	
  is	
  a	
  means	
  to	
  
assess	
  and	
  mitigate	
  their	
  impact	
  
before	
  releasing	
  the	
  product
● It	
  is	
  a	
  reasonably	
  mature	
  technology	
  
that	
  can	
  now	
  be	
  adopted	
  in	
  complex	
  
software	
  systems
● Assessing	
  Dependability	
  with	
  Software	
  Fault	
  Injection:	
  A	
  Survey
R.	
  Natella,	
  D.	
  Cotroneo,	
   H.	
  Madeira	
  
ACM	
  Computing	
   Surveys	
  (CSUR),	
  Vol.	
  48,	
  No.	
  3,	
  pages	
  44:1-­‐-­‐44:55,	
  2016
● Fault	
  Injection	
  for	
  Software	
  Certification
D.	
  Cotroneo,	
  R.	
  Natella
IEEE	
  Security	
  &	
  Privacy,	
  Vol.	
  11,	
  No.	
  4,	
  pp.	
  38-­‐45,	
  July/August	
   2013
● On	
  Fault	
  Representativeness	
  of	
  Software	
  Fault	
  Injection
R.	
  Natella,	
  D.	
  Cotroneo,	
   J.	
  Duraes,	
  H.	
  Madeira	
  
IEEE	
  Transactions	
  on	
  Software	
  Engineering	
   (TSE),	
  Vol.	
  39,	
  No.	
  1,	
  pp.	
  80-­‐96,	
  2013
● SABRINE:	
  StAte-­‐Based	
  Robustness	
  testIng of	
  operatiNg systEms
D.	
  Cotroneo,	
  D.	
  Di	
  Leo,	
  F.	
  Fucci,	
  R.	
  Natella
Proc.	
  28th	
  IEEE/ACM	
  International	
  Conference	
  on	
  Automated	
  Software	
  Engineering	
  
(ASE	
  2013)
● A	
  Case	
  Study	
  on	
  State-­‐Based	
  Robustness	
  Testing	
  of	
  an	
  Operating	
  System	
  for	
  the	
  
Avionic	
  Domain
D.	
  Cotroneo,	
  D.	
  Di	
  Leo,	
  R.	
  Natella,	
  R.	
  Pietrantuono
Proc.	
  of	
  the	
  30th	
  International	
  Conference	
  on	
  Computer	
   Safety,	
  Reliability	
  and	
  
Security	
  (SAFECOMP	
  2011)
57
RReesseeaarrcchh	
  	
  ppuubblliiccaattiioonnss

Fault Injection for Software Certification

  • 1.
    Fault Injection for SoftwareCertification Roberto Natella
  • 2.
    Many  industries are  facing   legal  troubles  because  they   are  liable  for  accidents caused   by  computer faults 2 SSooffttwwaarree    rriisskkss    iinn    ccrriittiiccaall    ssyysstteemmss The  Toyota  “unintended   acceleration”  is  a  relevant   example  of  accident  caused   by  bad  software  quality and   lack of fault-­‐tolerance
  • 3.
    Fault  Injection  is  the  process  of  deliberately  introducing  faults (from  software  and  hardware  components)for  validating  fault-­‐ tolerance  properties  of  a  system FFaauulltt    IInnjjeeccttiioonn    TTeessttiinngg
  • 4.
    FFaauulltt    iinnjjeeccttiioonn    iinn    tthhee DDOO-­‐-­‐117788BB//CC    ssaaffeettyy    ssttaannddaarrddss The  standard  recommends  robustness   test  cases  “...  [able  to]  demonstrate   the  ability  of  the  software  to  respond  to  abnormal  inputs  and  conditions.   Activities  include: ○ Real  and  integer  variables  should  be  exercised  using  equivalence  class  selection   of  invalid  values. ○ For  time-­‐related  functions,  such  as  filters,  integrators  and  delays,  test  cases   should  be  developed  for  arithmetic  overflow  protection  mechanisms. ○ For  state  transitions,  test  cases  should  be  developed  to  provoke  transitions  that   are  not  allowed by  the  software  requirements.” ○ ... *   RTCA  DO-­‐178B,  Software  considerations  in  airborne  systems  and  equipment  certification,  Sec.  6.4.2.2
  • 5.
    FFaauulltt    iinnjjeeccttiioonn    iinn    tthhee    IISSOO    2266226622    ssaaffeettyy     ssttaannddaarrdd
  • 6.
    ● The  NASA  Software  Safety  Guidebook  recommends  fault  injection   for  OTS  (off-­‐the-­‐shelf)  software  components ○ Software  fault  injection  (SFI)  is  a  technique  used  to  determine  the  robustness  of  the   software,  and  can  be  used  to  understand  the  behavior  of  OTS  software.  It  injects   faults  into  the  software  and  looks  at  the  results  (Did  the  fault  propagate?  Was  the   end  result  an  undesirable  outcome?).  Basically,  the  intent  is  to  determine  if  the   software  responds  gracefully  to  the  injected  faults. FFaauulltt    IInnjjeeccttiioonn    iinn    tthhee    NNAASSAA    SSooffttwwaarree     SSaaffeettyy    SSttaannddaarrddss
  • 7.
    ● FIN.X-­‐RTOS  is  a  real-­‐time  operating   systemfrom  Leonardo/Finmeccanica,   based  on  open-­‐source  software ● Objective  of  the  project:  to  develop  a   Linux  distribution  compliant  to  the   DO-­‐178B  recommendations ● Built  upon  a  network  of  excellence   between  industriesand  universities CCaassee    ssttuuddyy::    FFIINN..XX-­‐-­‐RRTTOOSS
  • 8.
    ● Industrial  product  management  and  fully  customizable ● Support  for  hard  real-­‐time  on  multi-­‐core  CPUs ● Guaranteed  scalability  (from  embedded  devices  to  high-­‐ performance  systems,  such  as  workstations  and  servers) ● No  dependence  on  a  commercial  product  or  vendor ● Enhanced  IDE  for  software  development ● No  export  license  restriction ● Full  control  of  all  source  packages  and  build  process  (based  on   Gentoo  Linux,  a  Linux  meta-­‐distribution) 8 FFIINN..XX-­‐-­‐RRTTOOSS    oovveerrvviieeww
  • 9.
    9 RReeaall-­‐-­‐ttiimmee    ffeeaattuurreess    ooff    FFIINN..XX-­‐-­‐RRTTOOSS
  • 10.
    10 CCeerrttiiffiiccaattiioonn    pprroocceessss    ooff    FFIINN..XX-­‐-­‐RRTTOOSS Linux  kernel Open  Source FIN.X-­‐RTOS RTCA/DO-­‐178B  D  Level o The  DO-­‐178B  recommendations   allow  the  reuse  of  “previously-­‐developed   software”,  provided  that  safety  evidence  is  produced  from  alternative  sources  such   as  additional  testing and  reverse  engineering o The  functional  requirements of  the  kernel  were  studied,  documented,  and  tested   (complying  to  level  D of  DO-­‐178B)
  • 11.
    11 FFaauulltt    IInnjjeeccttiioonn    iinn    FFIINN..XX-­‐-­‐RRTTOOSS Faults  from  user-­‐space   software (API  misuse  injection) Faults  from  device   drivers  (code   mutation) Faults  from   kernel  APIs (API  error   injection)
  • 12.
    ● There  are  many   potential  cases  of   kernel  API  failures: ○ Resource   exhaustion  (e.g.,   allocation  of  I/O  regions,   pages,  slabs,  ...) ○ Hardware  I/O  errors ○ Resource   busy  (e.g.,   mutexes,   pinned  pages) ● Kernel  API  callers must   check  and  handle  errors 12 FFaauulltt    iinnjjeeccttiioonn    oonn    kkeerrnneell    AAPPIIss
  • 13.
    ● The  Linux  kernel  already  includes  a  fault  injector that  forces  erroneous  return  codes (to  simulate   failed  memory  allocations,  I/O  errors,  ...) 13 AA    ffaauulltt    iinnjjeeccttoorr    iinn    tthhee    LLiinnuuxx    kkeerrnneell void  *  kmem_cache_alloc (struct kmem_cache *  cachep,  gfp_t flags) { void  *  objp;; if  (should_failslab(cachep,  flags)) return  NULL;; ... return  objp;; } The  fault  injector  is   programmed  from  user-­‐space Examples: • Fail  with  X%  probability • Fail  1-­‐every-­‐X  calls  to  API • Fail  after  X  seconds
  • 14.
    LLiimmiittaattiioonnss    ooff    rraannddoomm    ffaauulltt    iinnjjeeccttiioonn ● Faults  are  injected  with  a  “blind”  (black-­‐box)   approach,  with  a  random  timing ● However,  this  approach  neglects  the  internal  state   of  the  system ○ Many  tests  are  redundant:  they  are  performed  on  the  same   state ○ Many  important  states  may  be  missed by  the  tests ● The manual  definition  of  test  scenarios is  not  a   feasible  solution ○ Too  much  effort  for  a  large  system,  and  may  still  be  inaccurate
  • 15.
    ● Basic  idea: ○the  internal  state of  an  OS   component  (such  as  the  FS)  is   given  by  the  history  of  its   interactions ○ we  profile  the  history  of   interactions,  and  extract   behavioral  models of  the  OS   component  under  test ○ based  on  the  behavioral  model,   we  perform  distinct  fault   injections at  each  state,  to   efficiently  cover different  states   of  the  target 15 TThhee    SSAABBRRIINNEE    aapppprrooaacchh ext3_dirty_inode journal_dirty_metadata kmem_cache_alloc
  • 16.
    16 AApppprrooaacchh    oovveerrvviieeww Operating  System OS  component 1 OS  component 2 OS  component N OS  interface Target OS  component User   apps HW System calls Interrupt   requests
  • 17.
    17 AApppprrooaacchh    oovveerrvviieeww Operating  System OS  component 1 OS  component 2 OS  component N OS  interface Target OS  component User   apps HW System calls Interrupt   requests Phase  1:  monitoring
  • 18.
    18 AApppprrooaacchh    oovveerrvviieeww Operating  System OS  component 1 OS  component 2 OS  component N OS  interface Target OS  component User   apps HW System calls Interrupt   requests Phase  1:  monitoring Phase  2: model  learning
  • 19.
    19 AApppprrooaacchh    oovveerrvviieeww Operating  System OS  component 1 OS  component 2 OS  component N OS  interface Target OS  component User   apps HW System calls Interrupt   requests Phase  1:  monitoring Phase  2: model  learning Phase  3: model-­‐based  testing
  • 20.
    PPaatttteerrnn    iiddeennttiiffiiccaattiioonn ●We  get  an  execution  log of  the  target  OS  component,  by  running  a  workload  and  recording  the   function  calls  (interactions)  made  by  the  component ● The  execution  log  is  divided  into  sequences (i.e.,  a  subset  of  interactions that  happen  during  the  same   system  call,  interrupt  request,  or  kernel  task  execution) ● Unique  repeated  sequences  are  grouped  (patterns) ● Patterns  that  are  similar  (even  if  not  identical)  are  further  grouped  into  clusters TRACE ID OPERATION ID SEQ. ID INT. TYPE CALLED FUNCTION CALL POINT Seq.  B Seq.  A Seq.  C ... OUT, pdflush, 428, 1, ll_rw_block, flush_commit_list:1f3eb INJ, pdflush, 428, 1, kmem_cache_alloc, flush_commit_list:1f3eb INJ, pdflush, 428, 1, kmem_cache_alloc, flush_commit_list:1f3eb IN, close, 491, 1, reiserfs_file_release, __fput:c018efda INJ, pdflush, 428, 1, generic_make_request, flush_commit_list:1f3eb OUT, pdflush, 428, 1, __find_get_block, flush_commit_list:1f3cc ... IN, close, 503, 1, reiserfs_file_release, __fput:c018efda ...
  • 21.
  • 22.
    CClluusstteerriinngg    aallggoorriitthhmm 1.For  each  pair  of  patterns,  we  compute  a  similarity  score (Smith-­‐Waterman algorithm) ○ It  first  searches  the  best  alignment between  two  patterns ○ The  score  is  higher  when  there  are  many  matching  symbols and  few  gaps/mismatches 2. Similar  patterns  are  grouped  (spectral  clustering) ○ Patterns  are  the  nodes of  a  weighted  graph,  and  the  similarity  score  is  the  weight  of  the  edge between  two  nodes ○ By  cutting  “weak”  edges,  the  graph  is  split  into  partitions that  are  “strongly  connected”  (i.e.,   very  similar  patterns)
  • 23.
    EExxaammpplleess    ooff    cclluusstteerrss Clusters  (EXT3) Behavior Context #  patterns 1 gets  and  sets  the  file  metadata stat  syscall 6 2 retrieves  and  stores  in  memory  the  file  index  block,  or  updates   it  on  the  disk open,  unlink  syscalls 5 3 copies  file  contents  from  disk  to  a  cache,  and  modifies  it write  syscall 8 4 modifies  the  contents  of  a  file  already  in  the  disk  cache write  syscall 8 5 copies  a  large  amount  of  data  from  a  file  to  a  socket sendfile syscall 12 6 copies  a  small  amount  of  data  from  a  file  to  a  socket sendfile syscall 10 7 flushes  a  small  amount  of  data  from  the  cache  to  the  disk pdflush kernel  task 19 8 flushes  a  large  amount  of  data  from  the  cache  to  the  disk pdflush kernel  task 6 9 updates  file  metadata  to  reflect  that  is  has  been  memory-­‐ mapped mmap2  syscall 5 EXT3 ReiserFS SCSI # interactions 34,784 97,341 27,311 # sequences 432 239 1,307 #  (distinct)  sequences 79 57 10 #  clusters 9 6 2 #  test  cases 49 28 10
  • 24.
    BBeehhaavviioorraall    mmooddeelliinngg    eexxaammppllee 0 1 ext3_dirty_inode 2 5 6 7 8 journal_start journal_dirty_metadata __brelse journal_stop 3 4 kmem_cache_alloc __getblk journal_get_write_access 1. A  partial  state  automata  is  derived  from  the  first  pattern  in  the  cluster
  • 25.
    BBeehhaavviioorraall    mmooddeelliinngg    eexxaammppllee 0 1 ext3_dirty_inode 2 5 6 7 8 9 10 11 journal_start journal_dirty_metadata alloc_pages kmem_cache_alloc journal_dirty_metadata kmem_cache_alloc __brelse journal_stop 3 4 kmem_cache_alloc __getblk journal_get_write_access 1. A  partial  state  automata  is  derived  from  the  first  pattern  in  the  cluster 2. The  automata  is  extended   with  the  second  pattern  (partially  overlapping   with  the  first  pattern)
  • 26.
    BBeehhaavviioorraall    mmooddeelliinngg    eexxaammppllee 0 1 ext3_dirty_inode 2 5 6 7 8 9 10 11 journal_start journal_dirty_metadata alloc_pages kmem_cache_alloc journal_dirty_metadata kmem_cache_alloc __brelse journal_stop 3 4 kmem_cache_alloc __getblk journal_get_write_access Robustness  test  case  #1 1. A  partial  state  automata  is  derived  from  the  first  pattern  in  the  cluster 2. The  automata  is  extended  with  the  second  pattern  (partially  overlapping   with  the  first  pattern) 3. A  robustness  test  case  is  generated  for  each  injectable  interaction in  the   automata
  • 27.
    BBeehhaavviioorraall    mmooddeelliinngg    eexxaammppllee 0 1 ext3_dirty_inode 2 5 6 7 8 9 10 11 journal_start journal_dirty_metadata alloc_pages kmem_cache_alloc journal_dirty_metadata kmem_cache_alloc __brelse journal_stop 3 4 kmem_cache_alloc __getblk journal_get_write_access Robustness  test  case  #1 Robustness  test  case  #2 1. A  partial  state  automata  is  derived  from  the  first  pattern  in  the  cluster 2. The  automata  is  extended  with  the  second  pattern  (partially  overlapping   with  the  first  pattern) 3. A  robustness  test  case  is  generated  for  each  injectable  interaction in  the   automata
  • 28.
    BBeehhaavviioorraall    mmooddeelliinngg    eexxaammppllee 0 1 ext3_dirty_inode 2 5 6 7 8 9 10 11 journal_start journal_dirty_metadata alloc_pages kmem_cache_alloc journal_dirty_metadata kmem_cache_alloc __brelse journal_stop 3 4 kmem_cache_alloc __getblk journal_get_write_access Robustness  test  case  #1 Robustness  test  case  #2 Robustness  test  case  #3 1. A  partial  state  automata  is  derived  from  the  first  pattern  in  the  cluster 2. The  automata  is  extended  with  the  second  pattern  (partially  overlapping   with  the  first  pattern) 3. A  robustness  test  case  is  generated  for  each  injectable  interaction in  the   automata
  • 29.
    BBeehhaavviioorraall    mmooddeelliinngg 1.For  each  cluster,  we  obtain  a  behavioral  model (kBehavior algorithm) ○ A  Finite  State  Automaton  (FSA)  is  incrementally  extended with  new  transitions  and  states ○ Transitionsrepresent  interactions  of  the  patterns 2. A  robustness  test  case  is  generated  for  each   injectable  interaction included  in  the  FSA ○ This  allows  to  perform  injections  in  different  contexts
  • 30.
    TTeesstt    eexxeeccuuttiioonn ●The  interactions  of  the  component-­‐under-­‐test  are  initially   profiled  using  kernel  debugging  tools (SystemTap) ● For  the  traces,  we  automatically  generate  a  kernel  injection   module that  keeps  trackof  the  OS  state  automata  at  run-­‐time ● During  robustness  tests,  the  system  is  again  executed  with  the   same  workload ● When  the  injector  notices  that  an  injectable  function  is   invoked  at  a  given  state,  it  forces  an  erroneous  return  code from  that  function  call
  • 31.
    RRoobbuussttnneessss    vvuullnneerraabbiilliittiieess We  found  two  robustness   vulnerabilities,   that  affected   the  EXT3  and   ReiserFS filesystem (radix_tree_node_alloc and  __get_blk) 31 STACK FRAME AT THE TIME OF INJECTION: 0 kmem_cache_alloc 1 radix_tree_node_alloc 2 radix_tree_insert 3 add_to_page_cache 4 add_to_page_cache_lru 5 mpage_readpages 6 ext3_readpages 7 __do_page_cache_readahead 8 ondemand_readahead 9 page_cache_async_readahead 10 generic_file_splice_read 11 do_splice_to 12 splice_direct_to_actor 13 do_splice_direct 14 do_sendfile 15 sys_sendfile64 16 sysenter_past_esp First  called  function   (a  system  call) Function   call  to  the   EXT3   filesystem Function   call  to  the   memory   allocator Fault  injection! Kernel  crash!
  • 32.
    EEffffiicciieennccyy    aanndd    rreepprroodduucciibbiilliittyy ● With  random  injection,  thousands of  tests  are  needed  to  hit  the   two  robustness  vulnerabilities ● With  model-­‐based  injection,  the  same  vulnerabilities  can  be   found  efficiently(only  77  tests  are  needed),  and  tests  are  highly   reproducible 29,0% 3,8% 68,8% 77,7% __get_blk radix_tree_node_alloc Vulnerable  functions EXT3 Random SABRINE 0,2% 9,4% 100,0% 100,0% __get_blk radix_tree_node_alloc Vulnerable  functions ReiserFS Random SABRINE
  • 33.
    ● Drivers  come  from   third-­‐party developers ● They  are  defect-­‐prone   (due  to  concurrency   and  hardware   dependencies) ● If  drivers  fail,  the  OS   should  avoid  an   escalation  (stalls,  data corruptions,  …) 33 FFaauulltt    iinnjjeeccttiioonn iinn    ddeevviiccee    ddrriivveerrss
  • 34.
    Safety-critical system FIN.X-RTOS kernel DeviceDrivers Applications 1. a fault is injected into driver’s code 2. the device driver is in an error state 3. the error state propagates to the kernel OOvveerrvviieeww    ooff    FFaauulltt    IInnjjeeccttiioonn    iinn DDeevviiccee    DDrriivveerrss
  • 35.
    TThhee    ccooddee    mmuuttaattiioonn    aapppprrooaacchh complex_routine(...) { ... ... if ((GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) && (off >= NIC_SRAM_STATS_BLK) && off < NIC_SRAM_TX_BUFFER_DESC)) { *val = 0; return; } ... ... } Code  mutation mimics  software  faults  by  making   small  “faulty”  changes into  the  target  code,  to   emulate  programmers’  omissions  and  mistakes Missing  variable   initialization  in  a   complex  IF  construct Missing  logical  clause   among  several
  • 36.
    Automatic   generation  and  execution of  fault   injection  tests Seamless  integration to  the   project  under  development Supports  the   injection  of  an   extensive  and   realistic  fault   suite SSAAFFEE:: SSooffttwwAArreeFFaauulltt    EEmmuullaattoorr
  • 37.
    AAuuttoommaattiinngg    ffaauulltt    iinnjjeeccttiioonn uussiinngg    tthhee    SSAAFFEE    ttooooll if(a && b) { c=1; } Target  component   source  code Source  code analysis ... Mutated  target   component Code   mutation if(a && b) { c=1; } if(a && b) { c=1; } if(a && b) { c=2; } Fault  library
  • 38.
    if(a && b) { c=1; } ... 1.The  target   component  is  replaced   with  a  faulty  version if(a && b) { c=1; } Software   under  test 3. These  steps  are  iterated   several  times  (one  iteration   per  faulty  version) AAuuttoommaattiinngg    ffaauulltt    iinnjjeeccttiioonn uussiinngg    tthhee    SSAAFFEE    ttooooll APP LIB MW OS DD APP LIB MW OS DD LOGS 2. The  software  is   exercised  under  a  real  or   simulated  environment if(a && b) { c=2; } LOGS LOGS RESULTS 4. Dependability   measures  are   computed  from   raw  data ...
  • 39.
    FIN.X-RTOS kernel ● Fault  Injection  found  several  robustness  issues  in   the  OS  at  handling  faulty  drivers ○ Not  detectable  through  traditional  testing  techniques Ethernet device driver crash  of  an  OS   thread  that   manages   periodic  events Fault  Injection stall  of  OS  services writes  un-­‐initialized   data EExxaammpplleess    ooff rroobbuussttnneessss    iissssuueess    iinn    FFIINN..XX-­‐-­‐RRTTOOSS
  • 40.
    ● The  use  of  corrupted  memory  caused  an  illegal   memory  access  exception ● When  the  sirq-­‐timer kernel  thread  is  killed,  timer   functions  in  the  kernel  couldn’t  be  executed  anymore ● To  avoid  this  situation,  the  kernel’s  exception  handler should  be  modified  to  restart  a  kernel  thread when  an   exception  occurs  instead  of  terminating  it ● In  this  way,  the  kernel  could  preserve  the  execution  of   other  timer  functions when  a  timer  functions  fails  due   to  a  faulty  driver FFeeeeddbbaacckk    ttoo    ddeevveellooppeerrss
  • 41.
    ● The  current  trend  of   application  complexity   increases   the  opportunities   for  bad  data  values  to   circulate  within  a  system ● The  POSIX  OS  system  calls   must  gracefully  deal with   such  exceptional  conditions ○ Which  COTS  POSIX  OS  is  the  most   robust? ○ Are  errors  detected  and  handled?   How? 41 FFaauulltt    iinnjjeeccttiioonn    oonn    uusseerr-­‐-­‐ssppaaccee    iinntteerrffaacceess
  • 42.
    RRoobbuussttnneessss    tteessttiinngg    ooff    PPOOSSIIXX    OOSSss Based  on  the  IEEE  1003.1b  standard,   a  tester  process generates faulty  inputs  for  the  OS, using  a  set  of  pre-­‐defined  data  types In  the  ideal  case  (the  test  oracle), the  OS  should  return  an  error  code   from  the  syscall to  the  tester  process; crashes,  stalls,  wrong  errors  are   robustness  failures
  • 43.
    The  outcome  of  tests  is  classified  according  to  the   C.R.A.S.H.  failure  scale 43 RRoobbuussttnneessss    tteessttiinngg    ooff    PPOOSSIIXX    OOSSss Failure  type Description Catastrophic The  OS  state  becomes  corrupted;  the  machine  crashes  and   reboots Restart The  OS  never  returns  from  a  system  call;  the  calling   process  is  stalled and  needs  to  be  restarted Abort The  OS  terminates  the  caller  process  in  an  abnormal  way Silent The  OS  system  call  does  not  return  an  error  code Hindering The  OS  system  call  returns  a  misleading  error  code
  • 44.
    RRoobbuussttnneessss    tteessttiinngg    ooff    PPOOSSIIXX    OOSSss System  calls  are  invoked  with  combinations  of  both   valid  and  invalid  parameters  (invalid  memory   addresses,  non-­‐existing  paths,  ...)
  • 45.
    RRoobbuussttnneessss    tteessttiinngg    ooff    PPOOSSIIXX    OOSSss File descriptor FD_CLOSED FD_OPEN_READ FD_OPEN_WRITE FD_NOEXIST ... write(int filedes, const void * buffer, size_t nbytes) Memory buffer BUF_SMALL_1 BUF_MED_PAGESIZE BUF_LARGE_512MB BUF_HUGE_2GB ... Size SIZE_1 SIZE_16 SIZE_PAGE SIZE_PAGEx16plus1 ...
  • 46.
    46 RRoobbuussttnneessss    tteessttiinngg    ooff    PPOOSSIIXX    OOSSss All  OSs  studied  failed  to  provide   correct  error  handling  in  a   substantial  portion  of  tests. Indeed,  the  POSIX  standard  does  not   require  comprehensive  exception   reporting;  but  it  seems  likely  that  a   growing  number  of  applications   will   need  it.
  • 47.
    ● Invalid  inputs  that  were  often  associated  with  a   robustness  failure: ○ 94.0%  of  invalid  file  pointers  (excluding  NULL) ○ 82.5%  of  NULL  file  pointers ○ 49.8%  of  invalid  buffer  pointers  (excluding  NULL) ○ 46.0%  of  NULL  buffer  pointers ○ 44.3%  of  MININT  integer  values ○ 36.3%  of  MAXINT  integer  values 47 RRoobbuussttnneessss    tteessttiinngg    ooff    PPOOSSIIXX    OOSSss
  • 48.
    AAddddiinngg    ““ssttaattee””    ttoo    rroobbuussttnneessss    tteessttiinngg ● The  test  plan  includes  an  additional  state  variable S={s1,s2,…,sn},   that  reflects  the  allocated  OS  resources  and  processes  at  the   time  of  the  test ● A  State  Setter program  is  used  in  conjunction  with  the   robustness  test  driver
  • 49.
    CCUUTT    SSttaattee    MMooddeell ● The  states  to  be  tested  are  defined  using  a  state  model of   the  Component  Under  Test  (CUT) ● The  state  model  should: ○ be  easy  to  set  and  control  by  the  tester ○ represent   the  state  at  a  level  of  abstraction  high  enough  to  keep  the   number  of  test  cases  reasonably  small ○ include  those  configurations  that  are  the  most  influential  on  the   component  behavior
  • 50.
    MMooddeelliinngg    tthhee    FFiillee    SSyysstteemm    ssttaattee FileSystem attributes Type Notes Partition  type {Primary,  Logical} The  partition  on  which  the  FS  is  installed Partition  size Byte Total  partition space Partition allocated Byte Used  partition  space FS  implementation {ext2, ext3,  NTFS} The  filesystem module  to  be  loaded Number of  Files Integer Must  be  preallocated before  the  tests Number of  Directories Integer Must  be  preallocated before  the  tests FS  layout {Balanced,  Unbalanced} Randomly  generated  (the  probability  that  a  new  directory  dk+1 is   appended  to  dj is  (inversely)  proportional  to  depth(dj)) ○ Attributes  have  been  defined  to  cover  a  large  set  of  test  scenarios,  and  to  keep  low   the  number  of  test  cases  at  the  same  time ◌ Numeric  attributes:  a  subset  of  values  is  considered  (e.g.,  free  space  =  {low,  medium,  high}) ◌ Categorical  attributes (e.g.,  partition  type  =  {primary,  logical}) ◌ Random  attributes:  values  are  defined  in  terms  of  random  distributions  (e.g.,  Balanced and   Unbalanced are  two  distributions  for  FS  layout)
  • 51.
    MMooddeelliinngg    tthhee    FFiillee    SSyysstteemm    ssttaattee OperationalProfileattributes Type Notes Number of  tasks  performing  I/O Integer Tasks  performing  randomI/O Averagenumber  of  ops./s Integer I/O  syscalls performed  by the  tasks Ratio  between read/write  ops. Float Types  of  I/O  syscalls performed  by  the  tasks ● OperationalProfile defines  the  number  and  type  of  I/O   operations  running  in  background ○ Processes   are  instantiated  in  order  to  exercise   FS-­‐internal  resources   (e.g.,   buffers,   caches,  locks) ● Attributes  of  an  Item are  selected  according  to  random  (user-­‐ defined)  distributions ○ E.g.,  the  name  length  and  size  of  a  file  assume  a  value  within  a  range,   selected   using  a  uniform  distribution
  • 52.
    EExxaammpplleess    ooff    ffaaiilluurree    ddiissttrriibbuuttiioonn Stateful tests  caused  restart  failures that  did  not   happen  in  the  statelesstests Function #  Tests Stateless  Rob.Testing StatefulRob.  Testing #  Restart #  Abort #  Restart #  Abort access() 3,986 0 4 1 4 dup2() 3,954 0 0 1 0 lseek() 3,977 0 0 0 0 mkfifo() 3,870 0 5 1 5 mmap() 4,003 0 0 0 0 open() 3,988 0 8 40 8 read() 3,924 0 253 1 253 unlink() 500 0 1 0 1 write() 3,989 0 68 4 68 Total 32,191 0 339 48 339
  • 53.
    IInnccrreeaasseedd    ccoovveerraaggee    ffoorr    ““ccoorrnneerr-­‐-­‐ccaasseess”” static struct dentry * real_lookup(...) { // fs/namei.c:478 /* --- OMISSIS (DECLARATIONS) */ mutex_lock(&dir->i_mutex); result = d_lookup(parent, name); if (!result) { /* --- OMISSIS (PERFORMS LOOK-UP) --- */ mutex_unlock(&dir->i_mutex); return result; } /* * Uhhuh! Nasty case: the cache was re-populated while * we waited on the semaphore. Need to revalidate. */ mutex_unlock(&dir->i_mutex); if (result->d_op && result->d_op->d_revalidate) { result = do_revalidate(result, nd); if (!result) result = ERR_PTR(-ENOENT); } return result; } The  cache  lookup Code  not  covered  before
  • 54.
    IInnccrreeaasseedd    ccoovveerraaggee    ffoorr    ““ccoorrnneerr-­‐-­‐ccaasseess”” int try_to_free_buffers(struct page *page) { // fs/buffer.c:3057 /* --- OMISSIS (declarations) --- */ BUG_ON(!PageLocked(page)); if (PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ ret = drop_buffers(page, &buffers_to_free); goto out; } /* --- OMISSIS (page writeback and deallocation) --- */ } • I/O  buffers  used  by  a  transaction  are   marked  by  “mapping  ==  NULL” • If  free  memory  is  low,  the  page  cache  mgmt looks  for  pages  that  can  be  freed  (checks   with  drop_buffers()) • It  is  a  rare  condition  that  happens  under   stress
  • 55.
    Statement  coverage  improvement   ranged  between   0.49%  and  15.11%   (especially  for  journal-­‐ and  driver-­‐related   code) SSttaatteemmeenntt    ccoovveerraaggee Source  file Stateless  Rob. Test. Stress  testing Stateful Rob.  Test. fs/binfmt_elf.c 319/850  (37.53%) 331/850  (38.94%) 332/850  (39.06%) fs/buffer.c 529/1320  (40.08%) 553/1320  (41.89%) 565/1320  (42.80%) fs/dcache.c 371/880  (42.16%) 341/880  (38.75%) 387/880  (43.98%) fs/exec.c 479/807  (59.36%) 392/807  (48.57%) 486/807  (60.22%) fs/fs-­‐writeback.c 146/273  (53.48%) 169/273  (61.90%) 174/273  (63.74%) fs/inode.c 252/527  (47.82%) 307/527  (58.25%) 316/527  (59.96%) fs/namei.c 918/1392  (65.95%) 626/1392  (44.97%) 925/1392  (66.45%) fs/select.c 237/402  (58.96%) 237/402  (58.96%) 239/402  (59.45%) fs/ext3/balloc.c 384/556  (69.06%) 385/556  (69.24%) 398/556  (71.58%) fs/ext3/dir.c 140/219  (63.93%) 143/219  (65.30%) 144/219  (65.75%) fs/ext3/ialloc.c 181/337  (53.71%) 186/337  (55.19%) 189/337  (56.08%) fs/ext3/inode.c 719/1204  (59.72%) 729/1204  (60.55%) 737/1204  (61.21%) fs/ext3/namei.c 607/1088  (55.79%) 654/1088  (60.11%) 781/1088  (71.78%) fs/jbd/checkpoint.c 102/263  (38.78%) 141/263  (53.61%) 142/263  (53.99%) fs/jbd/commit.c 300/362  (82.87%) 302/362  (83.43%) 318/362  (87.85%) fs/jbd/revoke.c 108/228  (47.37%) 105/228  (46.05%) 116/228  (50.87%) fs/jbd/transaction.c 489/697  (70.16%) 500/697  (71.74%) 545/697  (78.19%)
  • 56.
    CCoonncclluussiioonn ● Residual  faults  are  hidden  in  our   software,  and  they  will  eventually   manifest  themselves  during  operation ● Software  Fault  Injection  is  a  means  to   assess  and  mitigate  their  impact   before  releasing  the  product ● It  is  a  reasonably  mature  technology   that  can  now  be  adopted  in  complex   software  systems
  • 57.
    ● Assessing  Dependability  with  Software  Fault  Injection:  A  Survey R.  Natella,  D.  Cotroneo,   H.  Madeira   ACM  Computing   Surveys  (CSUR),  Vol.  48,  No.  3,  pages  44:1-­‐-­‐44:55,  2016 ● Fault  Injection  for  Software  Certification D.  Cotroneo,  R.  Natella IEEE  Security  &  Privacy,  Vol.  11,  No.  4,  pp.  38-­‐45,  July/August   2013 ● On  Fault  Representativeness  of  Software  Fault  Injection R.  Natella,  D.  Cotroneo,   J.  Duraes,  H.  Madeira   IEEE  Transactions  on  Software  Engineering   (TSE),  Vol.  39,  No.  1,  pp.  80-­‐96,  2013 ● SABRINE:  StAte-­‐Based  Robustness  testIng of  operatiNg systEms D.  Cotroneo,  D.  Di  Leo,  F.  Fucci,  R.  Natella Proc.  28th  IEEE/ACM  International  Conference  on  Automated  Software  Engineering   (ASE  2013) ● A  Case  Study  on  State-­‐Based  Robustness  Testing  of  an  Operating  System  for  the   Avionic  Domain D.  Cotroneo,  D.  Di  Leo,  R.  Natella,  R.  Pietrantuono Proc.  of  the  30th  International  Conference  on  Computer   Safety,  Reliability  and   Security  (SAFECOMP  2011) 57 RReesseeaarrcchh    ppuubblliiccaattiioonnss