Today <ul><li>Random testing </li></ul><ul><ul><li>Start off with a practical look, and some useful ideas to get you start...
A Little Background <ul><li>Random testing </li></ul><ul><ul><li>Generate program inputs at  random </li></ul></ul><ul><ul...
A Somewhat Random Tester (Last Week) <ul><li>#define N 5  // 5 is “big enough”? </li></ul><ul><li>int testFind () { </li><...
A Considerably More Random Tester <ul><li>#define N 50  // 50 is “big enough”? </li></ul><ul><li>int testFind () { </li></...
Fuzz Testing <ul><li>One night (it was a dark and stormy night) in 1990, Bart Miller (U Wisc.) was logged in over dialup <...
Fuzz Testing <ul><li>Bart Miller et al., “An Empirical Study of the Reliability of UNIX Utilities” </li></ul><ul><ul><li>I...
Random Testing for Good & Evil <ul><li>Fuzzers </li></ul><ul><ul><li>Tools the send malformed/random input to a program an...
The Problem at JPL <ul><li>Testing is the net that JPL uses to catch software errors before they show up in mission operat...
The Problem at JPL <ul><li>Most mission testing is integration testing of  nominal  scenarios: </li></ul><ul><ul><li>Very ...
The Problem at JPL <ul><li>Nominal (or stress) integration testing relies on expensive and slow radiation hardened flight ...
Building Better Nets <ul><li>Thorough file system testing is a pilot effort to improve software testing at JPL </li></ul><...
Flash File System Testing <ul><li>We (LaRS) are developing a file system for mission use (NVFS) </li></ul><ul><ul><li>A ke...
Quick Primer:  NAND Flash <ul><li>Before we continue, the tested system in a bit more detail </li></ul><ul><li>Flash memor...
The Goals <ul><li>Randomize  early testing (since it is not possible to be exhaustive) </li></ul><ul><ul><li>We don’t know...
Random testing <ul><li>Simulated flash hardware layer allows random fault injection </li></ul><ul><li>Most development/ear...
The Goals <ul><li>Automate  early testing </li></ul><ul><ul><li>Run tests all the time, in the background, while continuin...
The Goals <ul><li>Make use of  desktop hardware  for early testing – vs. expensive (sloooow) flight hardware testbeds </li...
Traditional Testing <ul><li>Limited, fixed, unit tests by developers </li></ul><ul><li>Nominal scenarios on hardware testb...
Random Testing <ul><li>Millions of operations and scenarios, automatically generated </li></ul><ul><li>Run on fast & inexp...
Differential Testing <ul><li>How can we tell if a test succeeds? </li></ul><ul><ul><li>POSIX standard for file system oper...
Differential Testing <ul><li>How can we tell if a test succeeds? </li></ul><ul><ul><li>The POSIX standard specifies (mostl...
Random Differential Testing (inject a fault?) Choose (POSIX) operation  F Perform  F  on NVFS Perform  F  on Reference (if...
Testing a File System <ul><li>Use simulation layer to imitate flash hardware, operating at RAM-disk speed </li></ul><ul><u...
Random Differential Testing <ul><li>Choose file system operations randomly </li></ul><ul><ul><li>Include standard POSIX ca...
Feedback:  How to Pick a Path Name <ul><li>Full random generator:  picks a path of length up to  n , from fixed components...
Feedback:  How to Pick a Path Name full random? n y pick length  n append random component n components ? n y pick path fr...
Fault Injection Example: Reset fs_fd = nvfs_creat (“/dp/images/img019”, ctime); ref_fd = creat  (“/dp/images/img019”, …); ...
Stress Testing <ul><li>Bugs live in the  corner cases , i. e.: </li></ul><ul><ul><li>File system is (running) out of space...
Part of a Typical Random Test <ul><li>5:: - (creat /gamma) = 0 *success* </li></ul><ul><li>6::(rename /gamma /gamma) *EBUS...
Difficulties <ul><li>The reference is not “perfect”: there are cases where Linux/Solaris file systems return a poor (but P...
Test Strategies <ul><li>Overnight/daily runs of long sequences of tests </li></ul><ul><ul><li>Range through random seeds (...
Run Length and Effectiveness
Run Length and Effectiveness
Run Length and Effectiveness
Test Strategies <ul><li>Good news: </li></ul><ul><ul><li>Finds lots of bugs, very quickly </li></ul></ul><ul><li>Bad news:...
Test Case Minimization <ul><li>Solution:  automatic  minimization  of test cases as they are generated </li></ul><ul><li>M...
Test Case Minimization <ul><li>Based on Zeller’s d elta-debugging  tools </li></ul><ul><ul><li>Automated debugging state-o...
Test Case Minimization <ul><li>Based on a clever modification of a “binary search” strategy  </li></ul>Original Test Case ...
Test Case Minimization <ul><li>One problem </li></ul><ul><ul><li>Sometimes every large test case contains an embedded vers...
Test Case Minimization <ul><li>We’ll revisit Zeller’s delta-debugging when we cover debugging </li></ul><ul><li>http://www...
Regression Strategy <ul><li>When regression is clear, re-check  all  stored runs from the set that produced the test case ...
Some Results:  Tests <ul><li>Over two hundred minimized regression test cases </li></ul><ul><ul><li>No failures over these...
Some Results:  Coverage <ul><li>80-85% typical statement coverage </li></ul><ul><li>Hand inspection to show that uncovered...
Getting Code Coverage <ul><li>Can just use  gcov </li></ul><ul><li>Free tool available for use with  gcc </li></ul><ul><li...
Code Not Covered by Tests <ul><li>Defensive coding: </li></ul><ul><li>Trivial parameter checks: </li></ul>531914: 780: if ...
Don’t Use Random Testing for Everything! <ul><li>Why  not  test handing  read  a null pointer? </li></ul><ul><ul><li>Becau...
Some Results:  After >~10 9+  POSIX Ops. For runs with potential defects All test runs (estimated) Defect reports filed
Some Results:  Defect Tracking
The Real Results:  The Bugs <ul><li>POSIX Divergences: </li></ul><ul><ul><li>Early testing exposed numerous incorrect choi...
The Real Results:  The Bugs <ul><li>File System Integrity/Functionality Losses: </li></ul><ul><ul><li>A substantial number...
The Real Results:  The Bugs <ul><li>We believe it is extremely unlikely that traditional testing procedures would have exp...
Why We Found the Bugs (We Think!) <ul><li>Design for testability: </li></ul><ul><ul><li>Scalable down to small Flash syste...
Why We Found the Bugs <ul><li>Need both: </li></ul><ul><ul><li>If only nominal scenarios are executed, design for testabil...
Reusing the Test Framework <ul><li>Internal development efforts at JPL: </li></ul><ul><ul><li>RAMFS </li></ul></ul><ul><ul...
Inheriting Test Code <ul><li>RAMFS:  A RAM file system with reliability across warm resets: </li></ul><ul><ul><li>Used the...
Testing an Externally Developed FS <ul><li>Like JPL, contractor decided that past mission flash file systems were inadequa...
Testing an Externally Developed FS <ul><li>Nonetheless, JPL management requested that LaRS perform additional acceptance t...
Performing the Testing <ul><li>LaRS received an Interface Control Document (ICD) and an executable </li></ul><ul><ul><li>B...
Test Results <ul><li>Exposed  16 previously undetected errors – 14 were fixed </li></ul><ul><ul><li>Each error had potenti...
Sample error <ul><li>Reset during  close  can cause fatal file system corruption and crash </li></ul><ul><ul><li>Reset aft...
Reset Testing <ul><li>Discussion with developer revealed that extensive reset testing  had been  done </li></ul><ul><li>Th...
Reset Testing <ul><li>Random testing can try thousands of scenarios with resets at random points </li></ul><ul><li>More im...
Test Results <ul><li>Reported on remaining major vulnerabilities: </li></ul><ul><ul><li>High mission risk for use of  rena...
Test Results <ul><li>Test efforts well received by the project and by the contractor development team </li></ul><ul><ul><l...
Principles Used <ul><li>Random testing (with feedback) </li></ul><ul><li>Test automation </li></ul><ul><li>Hardware simula...
Synopsis <ul><li>Random testing is sometimes a powerful method and could likely be applied more broadly in other missions ...
Ongoing Work <ul><li>Used framework / hardware simulation / reference in model checking of storage system </li></ul><ul><l...
Challenge for “Formal Verification” <ul><li>Traditionally: </li></ul><ul><ul><li>“ Testing is good for the 1/10 3  bugs” <...
Upcoming SlideShare
Loading in …5
×

Lecture 3: Random Testing

605 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
605
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Add pictures for the bugs
  • Explain using pictures, not operation names
  • BGFS can’t be tested on full volumes due to mkdir stuff
  • BGFS can’t be tested on full volumes due to mkdir stuff
  • BGFS can’t be tested on full volumes due to mkdir stuff
  • Rajeev needs last bullet reworded (not that all doesn’t need reword)
  • Lecture 3: Random Testing

    1. 1. Today <ul><li>Random testing </li></ul><ul><ul><li>Start off with a practical look, and some useful ideas to get you started on the project: random testing for file systems </li></ul></ul><ul><ul><li>Then take a deeper look at the notion of feedback and why it is useful: method for testing OO systems from ICSE a couple of years </li></ul></ul><ul><ul><ul><li>Then back out to take a look at the general idea of random testing, if time permits </li></ul></ul></ul>
    2. 2. A Little Background <ul><li>Random testing </li></ul><ul><ul><li>Generate program inputs at random </li></ul></ul><ul><ul><li>Drawn from some (possibly changing) probability distribution </li></ul></ul><ul><ul><li>“ Throw darts at the state space, without drawing a bullseye” </li></ul></ul><ul><ul><li>May generate the same test (or equivalent tests) many times </li></ul></ul><ul><ul><li>Will perform operations no sane human would ever perform </li></ul></ul>
    3. 3. A Somewhat Random Tester (Last Week) <ul><li>#define N 5 // 5 is “big enough”? </li></ul><ul><li>int testFind () { </li></ul><ul><li>int a[N]; </li></ul><ul><li>int p, i; </li></ul><ul><li>for (p = 0; p < N; p++) { </li></ul><ul><li>random_assign(a, N) </li></ul><ul><li>a[p] = 3; </li></ul><ul><li>for (i = p; i < N; i++) { </li></ul><ul><li>if (a[i] == 3) </li></ul><ul><li>a[i] = a[i] – 1; </li></ul><ul><li>} </li></ul><ul><li>printf (“TEST: findLast({”); </li></ul><ul><li>print_array(a, N); </li></ul><ul><li>printf (“}, %d, 3)”, N); </li></ul><ul><li>assert (findLast(a, N, 3) == p); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    4. 4. A Considerably More Random Tester <ul><li>#define N 50 // 50 is “big enough”? </li></ul><ul><li>int testFind () { </li></ul><ul><li>int a[N]; </li></ul><ul><li>int p, x, n, i, j; </li></ul><ul><li>for (i = 0; i < NUM_TESTS; i++) { </li></ul><ul><li>pick(n, 0, N); </li></ul><ul><li>pick(x, -2^31, 2^31); </li></ul><ul><li>pick(p, -1, n-1) </li></ul><ul><li>random_assign(a, n) </li></ul><ul><li>if (p != -1) { </li></ul><ul><li>a[p] = x; </li></ul><ul><li>} </li></ul><ul><li>for (j = p+1; j < n; j++) { </li></ul><ul><li>if (a[j] == x) </li></ul><ul><li>a[j] = a[j] – 1; </li></ul><ul><li>} </li></ul><ul><li>printf (“TEST: findLast({”); </li></ul><ul><li>print_array(a, n); </li></ul><ul><li>printf (“}, %d, %d) with item at %d”, N, x, p); </li></ul><ul><li>assert (findLast(a, n, x) == p); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    5. 5. Fuzz Testing <ul><li>One night (it was a dark and stormy night) in 1990, Bart Miller (U Wisc.) was logged in over dialup </li></ul><ul><ul><li>There was a lot of line noise due to the storm </li></ul></ul><ul><ul><li>His shell and editors kept crashing </li></ul></ul><ul><ul><li>This gave him an idea… </li></ul></ul>
    6. 6. Fuzz Testing <ul><li>Bart Miller et al., “An Empirical Study of the Reliability of UNIX Utilities” </li></ul><ul><ul><li>Idea: feed “fuzz” (streams of pure randomness, noise from /dev/urandom pretty much) to OS & utility code </li></ul></ul><ul><ul><ul><li>Watch it break! </li></ul></ul></ul><ul><ul><ul><li>In 1990, could crash 25-33% of utilities </li></ul></ul></ul><ul><ul><ul><li>Reports every few years since then </li></ul></ul></ul><ul><ul><ul><li>Some of the bugs are the same ones in common security exploits (particularly buffer overruns) </li></ul></ul></ul>
    7. 7. Random Testing for Good & Evil <ul><li>Fuzzers </li></ul><ul><ul><li>Tools the send malformed/random input to a program and hope to crash it or find a security hole </li></ul></ul><ul><ul><li>Firefox is internally using random testing to find (security) problems </li></ul></ul><ul><ul><ul><li>One developer I know says they aren’t publishing much because it would be too useful to the bad guys </li></ul></ul></ul><ul><ul><ul><ul><li>Fuzzing is useful for finding bugs to protect programs (“white hat” work) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>But also for finding bugs to hack into systems (“black hat”)! </li></ul></ul></ul></ul>
    8. 8. The Problem at JPL <ul><li>Testing is the net that JPL uses to catch software errors before they show up in mission operation </li></ul><ul><ul><li>Last line of defense – if a bug gets through it can mean mission failure </li></ul></ul><ul><li>Traditional software testing nets have big holes </li></ul>
    9. 9. The Problem at JPL <ul><li>Most mission testing is integration testing of nominal scenarios: </li></ul><ul><ul><li>Very thorough checks that when expected things happen, other expected things happen – including fault protection (expected unexpected things) </li></ul></ul><ul><ul><li>Unfortunately when the unexpected unexpected happens… </li></ul></ul>
    10. 10. The Problem at JPL <ul><li>Nominal (or stress) integration testing relies on expensive and slow radiation hardened flight hardware </li></ul><ul><ul><li>Lots of competition for limited computational resources </li></ul></ul><ul><ul><li>Computationally infeasible to make use of statistical approaches, such as random testing </li></ul></ul>
    11. 11. Building Better Nets <ul><li>Thorough file system testing is a pilot effort to improve software testing at JPL </li></ul><ul><ul><li>Reduce bugs found at the final system I&T level – or in operation – by more effective early use of computational power on core modules of flight software </li></ul></ul><ul><ul><li>Exploit models and reference implementations to reduce developer & tester effort </li></ul></ul>
    12. 12. Flash File System Testing <ul><li>We (LaRS) are developing a file system for mission use (NVFS) </li></ul><ul><ul><li>A key JPL mission component </li></ul></ul><ul><ul><li>Problems with previous file systems used in missions (MER flash anomaly, others I can’t tell you about here) </li></ul></ul><ul><ul><li>If bugs in our code show up in flight, JPL loses, science loses, etc. </li></ul></ul><ul><li>High reliability is critical: </li></ul><ul><ul><li>Must preserve integrity of data </li></ul></ul><ul><ul><ul><li>in presence of arbitrary system resets </li></ul></ul></ul><ul><ul><ul><li>in presence of hardware failures </li></ul></ul></ul><ul><li>How do we thoroughly test such a module? </li></ul>
    13. 13. Quick Primer: NAND Flash <ul><li>Before we continue, the tested system in a bit more detail </li></ul><ul><li>Flash memory is a set of blocks </li></ul><ul><ul><li>A block is a set of pages </li></ul></ul><ul><ul><li>A page can be written once; read many times </li></ul></ul><ul><ul><li>Page must be erased before it can be re-written </li></ul></ul><ul><ul><li>Erase unit is a full block of pages </li></ul></ul>Block of pages PAGE WRITE PAGE WRITE PAGE WRITE (obsoletes old data) More WRITES… BLOCK ERASE Used page Free page “ Dirty” page
    14. 14. The Goals <ul><li>Randomize early testing (since it is not possible to be exhaustive) </li></ul><ul><ul><li>We don’t know where the bugs are </li></ul></ul>Nominal Scenario Tests Randomized Testing
    15. 15. Random testing <ul><li>Simulated flash hardware layer allows random fault injection </li></ul><ul><li>Most development/early testing can be done on workstations </li></ul><ul><li>Lots of available compute power – can cover many system behaviors </li></ul><ul><li>Will stress software in ways nominal testing will not </li></ul>
    16. 16. The Goals <ul><li>Automate early testing </li></ul><ul><ul><li>Run tests all the time, in the background, while continuing development efforts </li></ul></ul><ul><li>Automate test evaluation </li></ul><ul><ul><li>Using reference systems for fault detection and diagnosis </li></ul></ul><ul><ul><li>Automated test minimization techniques to speed debugging and increase regression test effectiveness </li></ul></ul><ul><li>Automate fault injection </li></ul><ul><ul><li>Simulate hardware failures in a controlled test environment </li></ul></ul>
    17. 17. The Goals <ul><li>Make use of desktop hardware for early testing – vs. expensive (sloooow) flight hardware testbeds </li></ul><ul><ul><li>Many faults can be exposed without full bit-level hardware simulation </li></ul></ul>
    18. 18. Traditional Testing <ul><li>Limited, fixed, unit tests by developers </li></ul><ul><li>Nominal scenarios on hardware testbeds </li></ul><ul><ul><li>Small number of scenarios, due to limited resources </li></ul></ul><ul><ul><li>Test engineers inspect results manually </li></ul></ul><ul><ul><li>Limited fault injection capability (reset means manually hitting the “red button”) </li></ul></ul>Test engineer A day of testing
    19. 19. Random Testing <ul><li>Millions of operations and scenarios, automatically generated </li></ul><ul><li>Run on fast & inexpensive workstations </li></ul><ul><li>Results checked automatically by a reference oracle </li></ul><ul><li>Hardware simulation for fault injection and reset simulation </li></ul>A day (& night) of testing (x 100000) (x 100000) (x 100000) (x 100000)
    20. 20. Differential Testing <ul><li>How can we tell if a test succeeds? </li></ul><ul><ul><li>POSIX standard for file system operations </li></ul></ul><ul><ul><ul><li>IEEE produced, ANSI/ISO recognized standard for file systems </li></ul></ul></ul><ul><ul><ul><li>Defines operations and what they should do/return, including nominal and fault behavior </li></ul></ul></ul>POSIX operation Result mkdir (“/eng”, …) SUCCESS mkdir (“/data”, …) SUCCESS creat (“/data/image01”, …) SUCCESS creat (“/eng/fsw/code”, …) ENOENT mkdir (“/data/telemetry”, …) SUCCESS unlink (“/data/image01”) SUCCESS / /eng /data image01 /telemetry File system
    21. 21. Differential Testing <ul><li>How can we tell if a test succeeds? </li></ul><ul><ul><li>The POSIX standard specifies (mostly) what correct behavior is </li></ul></ul><ul><ul><li>We have heavily tested implementations of the POSIX standard in every flavor of UNIX, readily available to us </li></ul></ul><ul><ul><li>We can use UNIX file systems (ext3fs, tmpfs, etc.) as reference systems to verify the correct behavior of flash </li></ul></ul><ul><ul><li>First differential approach (published) was McKeeman’s testing for compilers </li></ul></ul>
    22. 22. Random Differential Testing (inject a fault?) Choose (POSIX) operation F Perform F on NVFS Perform F on Reference (if applicable) Compare return values Compare error codes Compare file systems Check invariants
    23. 23. Testing a File System <ul><li>Use simulation layer to imitate flash hardware, operating at RAM-disk speed </li></ul><ul><ul><li>I.e., much faster than the real flight hardware </li></ul></ul><ul><ul><li>Making large-scale random testing possible </li></ul></ul><ul><li>Simulation layer provides same interface as the real hardware driver </li></ul><ul><li>Simulation layer provides ability to inject faults: bad blocks, system resets, read failures </li></ul>
    24. 24. Random Differential Testing <ul><li>Choose file system operations randomly </li></ul><ul><ul><li>Include standard POSIX calls + other operations ( mount, unmount, format ) </li></ul></ul><ul><ul><li>Bias choice by a (coarse) model of file system contents, but allow failing operations </li></ul></ul><ul><ul><ul><li>Akin to randomized testing with feedback (Pacheco et al., ICSE 07) </li></ul></ul></ul><ul><li>Perform on both systems: </li></ul>fs_fd = nvfs_creat (“/dp/images/img019”, ctime); ref_fd = creat (“/dp/images/img019”, …); Compare return values Compare error codes Compare file systems Check invariants
    25. 25. Feedback: How to Pick a Path Name <ul><li>Full random generator: picks a path of length up to n , from fixed components, e.g.: </li></ul><ul><ul><li>/alpha/beta/beta/gamma/alpha </li></ul></ul><ul><li>History-based generator: picks a random path from a list of all paths that have ever been created </li></ul><ul><ul><li>With some probability of adding an extra random component </li></ul></ul>
    26. 26. Feedback: How to Pick a Path Name full random? n y pick length n append random component n components ? n y pick path from history [append random component] return chosen path Tune P(full random) to balance chance of useful operations with ability to catch unlikely faults Note that no operation should ever succeed on a path that can’t be produced from history plus one extra component / /alpha /alpha/beta /gamma /delta/delta /gamma/beta … /delta/delta/alpha 2 /beta /beta/alpha
    27. 27. Fault Injection Example: Reset fs_fd = nvfs_creat (“/dp/images/img019”, ctime); ref_fd = creat (“/dp/images/img019”, …); Compare return values Compare error codes Compare file systems Check invariants match? did reset occur? ref_ret = mkdir(“/dp/images/old”, …); n n y y (reset before commit) restart/remount NVFS compare file system contents (reset took place after commit) Compare return values Compare error codes Compare file systems Check invariants no resets fs_ret = nvfs_mkdir(“/dp/images/old”, ctime); a test with random reset scheduled:
    28. 28. Stress Testing <ul><li>Bugs live in the corner cases , i. e.: </li></ul><ul><ul><li>File system is (running) out of space </li></ul></ul><ul><ul><li>High rate of bad blocks </li></ul></ul><ul><li>Use a small virtual flash device to test for these conditions: 6-13 blocks, 4 pages per block, 200-400 bytes per page </li></ul>Used page Free page Dirty page Bad block
    29. 29. Part of a Typical Random Test <ul><li>5:: - (creat /gamma) = 0 *success* </li></ul><ul><li>6::(rename /gamma /gamma) *EBUSY* </li></ul><ul><li>7::(rename /gamma /gamma) *EBUSY* </li></ul><ul><li>8::(truncate /gamma offset 373) *EOPNOTSUPP* </li></ul><ul><li>9::(rmdir /gamma) *ENOTDIR* </li></ul><ul><li>10::(unlink /gamma) *success* </li></ul><ul><li>11::(open /gamma RDWR(2)) *ENOENT* </li></ul><ul><li>12::(open /gamma RDWR|O APPEND(1026)) *ENOENT* </li></ul><ul><li>13::(open /gamma O RDONLY|O CREAT|O EXCL) *success* </li></ul><ul><li>14::(rmdir /gamma) *ENOTDIR* </li></ul><ul><li>15:: (creat /alpha) = 2 *success* </li></ul><ul><li>16::(idle compact 0 0) *success* </li></ul><ul><li>17::(idle compact 0 1) *success* </li></ul><ul><li>18:: (read 0 (399 bytes) /gamma) *EBADF* </li></ul><ul><li>19::(rmdir /gamma) *ENOTDIR* </li></ul><ul><li>20:: (write 0 479 /gamma) Wrote 479 bytes to FLASH </li></ul><ul><li>. . . </li></ul><ul><li>********************************************* </li></ul><ul><li>Scheduling reset in 1... </li></ul><ul><li>********************************************* </li></ul><ul><li>195::(rename /delta/gamma/alpha /gamma) *ENOENT* </li></ul><ul><li>196::(read -9999 400 /delta/gamma/alpha) *EBADF* </li></ul><ul><li>197:: (creat /delta/gamma/delta) </li></ul><ul><li>write of page 7 block 1 failed on reset trap </li></ul><ul><li>********************************************* </li></ul><ul><li>Reset event took place during this operation. </li></ul><ul><li>********************************************* </li></ul><ul><li>(mount) fs Block 4 bad -- hardware memory </li></ul><ul><li>*success* </li></ul><ul><li>*ENOSPC* </li></ul><ul><li>Note: Not comparing results/error codes due to reset. </li></ul><ul><li>Clearing file descriptors and open directories... </li></ul><ul><li>198::(write -9999 320 /delta/gamma/delta) *EBADF* </li></ul><ul><li>199::(rmdir /delta) *EROFS* </li></ul>Even with some feedback, we get lots of redundant and “pointless” operations But many errors involve operations that should fail but succeed, so it is hard to filter out the rest in order to improve test efficiency: baby with the bathwater
    30. 30. Difficulties <ul><li>The reference is not “perfect”: there are cases where Linux/Solaris file systems return a poor (but POSIX-compliant) choice of error code </li></ul><ul><li>Special efforts to test operations that are not in the reference system – such as bad block management </li></ul><ul><li>Sometimes we don’t want POSIX: eventually decided that on a spacecraft, using creat to destroy existing files is bad </li></ul>
    31. 31. Test Strategies <ul><li>Overnight/daily runs of long sequences of tests </li></ul><ul><ul><li>Range through random seeds (e.g., from 1 to 1,000,000) </li></ul></ul><ul><ul><li>When tests fail, add one representative for each suspected cause to regressions </li></ul></ul><ul><ul><li>Vary test configurations (an art, not a science, alas) </li></ul></ul><ul><ul><li>Test length varies – an interesting question: how much does it matter? </li></ul></ul>
    32. 32. Run Length and Effectiveness
    33. 33. Run Length and Effectiveness
    34. 34. Run Length and Effectiveness
    35. 35. Test Strategies <ul><li>Good news: </li></ul><ul><ul><li>Finds lots of bugs, very quickly </li></ul></ul><ul><li>Bad news: </li></ul><ul><ul><li>Randomness means potentially long test cases to examine, and thousands of variations of the same error </li></ul></ul>
    36. 36. Test Case Minimization <ul><li>Solution: automatic minimization of test cases as they are generated </li></ul><ul><li>Minimized test case: subset of original sequence of operations such that </li></ul><ul><ul><li>Test case still fails </li></ul></ul><ul><ul><li>Removing any one operation makes the test case successful </li></ul></ul><ul><li>Typical improvement: order of magnitude or greater reduction in length of a test case </li></ul><ul><ul><li>Highly effective technique, essential for quick debugging </li></ul></ul>
    37. 37. Test Case Minimization <ul><li>Based on Zeller’s d elta-debugging tools </li></ul><ul><ul><li>Automated debugging state-of-the-art </li></ul></ul><ul><ul><li>Set of Python scripts easily modified to automatically minimize tests in different settings </li></ul></ul><ul><ul><li>Requires that you be able to </li></ul></ul><ul><ul><ul><li>Play back test cases and determine success or failure automatically </li></ul></ul></ul><ul><ul><ul><li>Define the subsets of a test case – provide a test case decomposition </li></ul></ul></ul><ul><ul><ul><li>We’ll cover delta-debugging and variants in depth later </li></ul></ul></ul>
    38. 38. Test Case Minimization <ul><li>Based on a clever modification of a “binary search” strategy </li></ul>Original Test Case First half First half Second half First three fourths Last three fourths First half Second half
    39. 39. Test Case Minimization <ul><li>One problem </li></ul><ul><ul><li>Sometimes every large test case contains an embedded version of a small test case that fails for a different reason </li></ul></ul><ul><ul><ul><li>When you delta-debug, these small cases dominate </li></ul></ul></ul><ul><ul><li>Our solution: only consider a test failing (when minimizing) if the last operation is the same </li></ul></ul><ul><ul><ul><li>Heuristic seems to work very well in practice </li></ul></ul></ul>fd = creat (“foo”) write (fd, 128) unlink (“foo”)
    40. 40. Test Case Minimization <ul><li>We’ll revisit Zeller’s delta-debugging when we cover debugging </li></ul><ul><li>http://www.st.cs.uni-sb.de/dd </li></ul><ul><ul><li>Check out if you want to get started </li></ul></ul><ul><ul><li>Could be useful now, for your tests </li></ul></ul>
    41. 41. Regression Strategy <ul><li>When regression is clear, re-check all stored runs from the set that produced the test case </li></ul><ul><ul><li>Add new (minimized only!) regressions if any of these tests fail </li></ul></ul><ul><ul><li>Re-run all stored test cases on a weekly basis </li></ul></ul>Overnight Test run Overnight Test run Overnight Test run Overnight Test run Overnight Test run Overnight Test run Regressions Weekly regression
    42. 42. Some Results: Tests <ul><li>Over two hundred minimized regression test cases </li></ul><ul><ul><li>No failures over these tests for latest version of file system </li></ul></ul><ul><ul><li>Success on ~2,000,000 new randomized tests </li></ul></ul><ul><ul><li>Can continue testing: why stop? </li></ul></ul><ul><ul><ul><li>Background task on a compute server… </li></ul></ul></ul><ul><ul><ul><li>Low cost of testing means there’s no real reason to stop looking for rare glitches </li></ul></ul></ul>Test team
    43. 43. Some Results: Coverage <ul><li>80-85% typical statement coverage </li></ul><ul><li>Hand inspection to show that uncovered code is either: </li></ul><ul><ul><li>Extremely defensive coding to handle (non-provably) impossible conditions – coverage here would indicate a bug in the file system… </li></ul></ul><ul><ul><li>Cases intentionally not checked, to improve test efficiency: null pointers, invalid filename characters – can (statically) show these do not change (or depend on) file system state </li></ul></ul>
    44. 44. Getting Code Coverage <ul><li>Can just use gcov </li></ul><ul><li>Free tool available for use with gcc </li></ul><ul><li>Compile program with extra flags </li></ul><ul><ul><li>--fprofile-arcs --ftest-coverage </li></ul></ul><ul><li>After all (or each) test case finishes, run </li></ul><ul><ul><li>gcov –o object-files-location source-files </li></ul></ul><ul><li>Will produce some output & some files </li></ul><ul><ul><li>Output gives coverage %s per file </li></ul></ul><ul><ul><li>And you get an annotated copy of source </li></ul></ul>
    45. 45. Code Not Covered by Tests <ul><li>Defensive coding: </li></ul><ul><li>Trivial parameter checks: </li></ul>531914: 780: if (!FS_ASSERT((dp->type & FS_G) == FS_G)) #####: 781: { fs_handle_condition(dp->type); #####: 782: FS_SET_ERR(EEASSERT); -: 783: } 15007634: 1844: if (want < 0 || b_in == NULL) #####: 1845: { fs_i_release_access(Lp); #####: 1846: FS_SET_ERR(EINVAL); #####: 1847: return FS_ERROR; -: 1848: } If this runs, we’ve found a fault This is a bit more subtle… INDICATES CODE NOT COVERED BY THE TESTS – 0 executions
    46. 46. Don’t Use Random Testing for Everything! <ul><li>Why not test handing read a null pointer? </li></ul><ul><ul><li>Because (assuming the code is correct) it guarantees some portion of test operations will not induce failure </li></ul></ul><ul><ul><li>But if the code is incorrect, it’s easier and more efficient to write a single test </li></ul></ul><ul><ul><li>The file system state doesn’t have any impact (we hope!) on whether there is a null check for the buffer passed to read </li></ul></ul><ul><li>But we have to remember to actually do these non-random fixed tests, or we may miss critical, easy-to-find bugs! </li></ul>
    47. 47. Some Results: After >~10 9+ POSIX Ops. For runs with potential defects All test runs (estimated) Defect reports filed
    48. 48. Some Results: Defect Tracking
    49. 49. The Real Results: The Bugs <ul><li>POSIX Divergences: </li></ul><ul><ul><li>Early testing exposed numerous incorrect choices of POSIX error code – easily resolved, mostly low impact </li></ul></ul><ul><li>Fault Interactions: </li></ul><ul><ul><li>A large number of cases involved hardware failure interactions – failure to track the bad block list properly, for the most part </li></ul></ul>
    50. 50. The Real Results: The Bugs <ul><li>File System Integrity/Functionality Losses: </li></ul><ul><ul><li>A substantial number of errors discovered involved low-probability, very high impact scenarios </li></ul></ul><ul><ul><ul><li>Complete loss of file system contents </li></ul></ul></ul><ul><ul><ul><li>Loss of file contents (for a file not involved in an operation, in some cases) </li></ul></ul></ul><ul><ul><ul><li>Null pointer dereference </li></ul></ul></ul><ul><ul><ul><li>Inability to unmount the file system (!) </li></ul></ul></ul><ul><ul><ul><li>Failed assertions on global invariants </li></ul></ul></ul><ul><ul><ul><li>Undead files – I thought I killed that! </li></ul></ul></ul>Our version of feedback really helps with finding these
    51. 51. The Real Results: The Bugs <ul><li>We believe it is extremely unlikely that traditional testing procedures would have exposed several of these errors </li></ul><ul><ul><li>Probably would have missed a lot of the POSIX and hardware fault errors too, but those aren’t as important </li></ul></ul><ul><li>Backing out to a larger perspective: why were we able to find them? </li></ul>The big question
    52. 52. Why We Found the Bugs (We Think!) <ul><li>Design for testability: </li></ul><ul><ul><li>Scalable down to small Flash systems </li></ul></ul><ul><ul><li>Very heavy use of assertions and invariant checks </li></ul></ul><ul><ul><li>Chose system behavior to make the system predictable (thus testable) </li></ul></ul><ul><li>Performed millions of different automated tests, thanks to randomization with feedback + a powerful oracle (differential testing) </li></ul>
    53. 53. Why We Found the Bugs <ul><li>Need both: </li></ul><ul><ul><li>If only nominal scenarios are executed, design for testability </li></ul></ul><ul><ul><ul><li>can’t take advantage of small configurations </li></ul></ul></ul><ul><ul><ul><li>gives less chance to exercise assertions </li></ul></ul></ul><ul><ul><li>Large-scale random testing is less effective if </li></ul></ul><ul><ul><ul><li>you can’t scale down the hardware </li></ul></ul></ul><ul><ul><ul><li>there are no sanity checks </li></ul></ul></ul><ul><ul><ul><li>system unpredictability makes it difficult to use a reference oracle </li></ul></ul></ul>
    54. 54. Reusing the Test Framework <ul><li>Internal development efforts at JPL: </li></ul><ul><ul><li>RAMFS </li></ul></ul><ul><ul><ul><li>Use of code instrumentation for “hardware simulation” (memory is the hardware) </li></ul></ul></ul><ul><ul><li>NVDS: low level storage module </li></ul></ul><ul><ul><li>Adaptations for new flash hardware/MSL </li></ul></ul><ul><li>Request from Discovery class NASA mission – used to perform acceptance testing on a (non-POSIX) flight file system </li></ul><ul><ul><li>Exposed serious undetected errors </li></ul></ul>
    55. 55. Inheriting Test Code <ul><li>RAMFS: A RAM file system with reliability across warm resets: </li></ul><ul><ul><li>Used the same test framework and reference file system as for flash </li></ul></ul><ul><ul><li>Unable to inject faults through a custom driver layer – “write” is C assignment or memcpy </li></ul></ul><ul><ul><li>Used automatic code instrumentation to simulate arbitrary system resets </li></ul></ul><ul><ul><ul><li>Add a potential longjmp escape at each write to global memory (everything but stack vars) </li></ul></ul></ul>
    56. 56. Testing an Externally Developed FS <ul><li>Like JPL, contractor decided that past mission flash file systems were inadequate </li></ul><ul><ul><li>Developed new “highly-reliable” flash file system </li></ul></ul><ul><ul><li>JPL management wanted to get a better feel for the quality of this system </li></ul></ul><ul><ul><li>JPL mission managemet knew of LaRS work </li></ul></ul><ul><li>Contractor stated that the development process followed and previous testing were first-rate </li></ul><ul><ul><li>“ One of our best developers” (true) </li></ul></ul><ul><ul><li>“ Following our best process” (probably true) </li></ul></ul><ul><ul><ul><li>CMMI Level 3 </li></ul></ul></ul><ul><ul><ul><li>For mission critical software </li></ul></ul></ul><ul><ul><li>“ Ready to fly” </li></ul></ul>
    57. 57. Testing an Externally Developed FS <ul><li>Nonetheless, JPL management requested that LaRS perform additional acceptance testing with our random test methods </li></ul><ul><ul><li>Validate effectiveness as a highly reliable file system for mission data </li></ul></ul><ul><ul><li>Evaluate risk to mission </li></ul></ul><ul><ul><li>Improve quality of file system </li></ul></ul>
    58. 58. Performing the Testing <ul><li>LaRS received an Interface Control Document (ICD) and an executable </li></ul><ul><ul><li>But no source, requirements or design documents, due to IP concerns </li></ul></ul><ul><li>Prior to receiving executable, our queries about behavior described in the ICD resulted in two Software Change Requests for serious flaws </li></ul><ul><ul><li>Tester’s job may begin before even receiving code: thinking about how to test a system can expose faults </li></ul></ul><ul><ul><ul><li>Good case for Beizer’s levels – we certainly didn’t actually execute any code to find those problems </li></ul></ul></ul><ul><li>Testing began early January, report delivered February 13 </li></ul>Black box! Or at least very gray box
    59. 59. Test Results <ul><li>Exposed 16 previously undetected errors – 14 were fixed </li></ul><ul><ul><li>Each error had potential for file system corruption or loss-of-functionality </li></ul></ul><ul><ul><li>Delivered a C program with a minimal test case for each error to ease diagnosis </li></ul></ul><ul><ul><li>8 new releases to correct errors, sent shortly after our test cases arrived </li></ul></ul><ul><ul><li>Final version successfully executed hundreds of thousands of operations </li></ul></ul><ul><ul><ul><li>Modified tester to avoid remaining problems that were not fixed </li></ul></ul></ul>Useful idea: have an automatic tester generate stand-alone program test cases automatically: very helpful for sending to developers, whether in-house or outside – and ensures bug isn’t in the test framework!
    60. 60. Sample error <ul><li>Reset during close can cause fatal file system corruption and crash </li></ul><ul><ul><li>Reset after 2nd write while closing a file can produce system corruption. When system is mounted next, results in crash with segmentation fault </li></ul></ul>close write to flash System reboot before this page can be written open file flash storage device Restart system Mount file system… CRASH! Fixed
    61. 61. Reset Testing <ul><li>Discussion with developer revealed that extensive reset testing had been done </li></ul><ul><li>This means that the testing had been better than is typical, but still covered only a few scenarios </li></ul><ul><li>Should still be considered incomplete </li></ul>Handful of fixed test scenarios Try reset at each point and check result Reset Reset Reset Reset Reset Reset Reset Reset Reset Reset
    62. 62. Reset Testing <ul><li>Random testing can try thousands of scenarios with resets at random points </li></ul><ul><li>More important: it can also vary flash contents & operations </li></ul>Hypothesis: more important to vary states in which reset takes place extensively than to exhaustively check all placements for reset in a limited set of scenarios
    63. 63. Test Results <ul><li>Reported on remaining major vulnerabilities: </li></ul><ul><ul><li>High mission risk for use of rename operation – can destroy file system contents if used on full volume </li></ul></ul><ul><ul><li>Contractor hardware model may not reflect actual hardware behavior </li></ul></ul><ul><ul><li>The 2 unfixed errors (design flaws) prevented significant testing on a full or nearly full file system </li></ul></ul>
    64. 64. Test Results <ul><li>Test efforts well received by the project and by the contractor development team </li></ul><ul><ul><li>Reliability was improved – corrections for many errors, and more information about remaining risks </li></ul></ul><ul><ul><li>LaRS team was invited to attend the code and device driver review </li></ul></ul>
    65. 65. Principles Used <ul><li>Random testing (with feedback) </li></ul><ul><li>Test automation </li></ul><ul><li>Hardware simulation & fault injection </li></ul><ul><li>Use of a well-tested reference implementation as oracle (differential testing) </li></ul><ul><li>Automatic test minimization (delta-debugging) </li></ul><ul><li>Design for testability </li></ul><ul><ul><li>Assertions </li></ul></ul><ul><ul><li>Downward scalability (small model property) </li></ul></ul><ul><ul><li>Preference for predictability </li></ul></ul>
    66. 66. Synopsis <ul><li>Random testing is sometimes a powerful method and could likely be applied more broadly in other missions </li></ul><ul><ul><li>Already applied to four file system-related development efforts </li></ul></ul><ul><ul><li>Part or all of this approach is applicable to other critical components (esp. with better models to use as references) </li></ul></ul>
    67. 67. Ongoing Work <ul><li>Used framework / hardware simulation / reference in model checking of storage system </li></ul><ul><li>Developing hybrid methods combining model checking, constraint solving, random testing </li></ul><ul><ul><li>State spaces are still too large </li></ul></ul><ul><ul><li>Sound abstractions are very difficult to devise </li></ul></ul><ul><li>Theorem proving efforts on design proved very labor-intensive, even with insights from early efforts & layered design </li></ul>
    68. 68. Challenge for “Formal Verification” <ul><li>Traditionally: </li></ul><ul><ul><li>“ Testing is good for the 1/10 3 bugs” </li></ul></ul><ul><ul><li>“ For 1/10 7 bugs, you need model checking or the like” </li></ul></ul><ul><li>We’ve found such low probability errors (checksum overlap with reset partway through memcpy in RAMFS) </li></ul><ul><ul><li>(Also found some bugs with model checking we did not find with random testing) </li></ul></ul>

    ×