Lin
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Lin

on

  • 418 views

 

Statistics

Views

Total Views
418
Views on SlideShare
418
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lin Presentation Transcript

  • 1. Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software Subhachandra Chandra Peter M. Chen University of Michigan Presentation – Lin Tan Published in DSN 2000
  • 2. Hypothesis
    • Most faults in release applications are transient [Jim Gray86]
      • Transient faults are more difficult to reproduce and to debug
    • Can generic recovery techniques survive most application faults without using application-specific information ?
  • 3. Methodology
    • Classify software faults into 3 types
      • One type: eliminated by generic recovery techniques
      • How many faults are this type?
    • Study a subset of faults of 3 applications
      • Apache – widely used HTTP server
      • Gnome – desktop environment
      • MySQL – multi-thread SQL database server
    • Conclusions
  • 4. Fixed environment -> deterministic execution
    • Given a fixed operating environment, a set of concurrent, sequential processes is completely deterministic. [Dijkstra 72]
  • 5. Software Fault Classification
    • Environment-independent - Determinstic
      • Long URL
    • Environment-dependent
      • Environment-dependent non-transient (Subjective)
        • Disk full
      • Environment-dependent transient (Subjective)
        • Race condition
  • 6. Program Operating Environment
    • Software
      • Other programs
      • Kernel
    • Hardware
      • ECC errors
      • Interrupts
      • Thread scheduler
    • Timing of workload requests: typing speed
    • User Input:
      • part of the program
      • NOT part of the environment
  • 7. Selection of Bugs
    • Apache: 50 bugs out of 5220 bug reports
      • Severe or critical bugs
    • Gnome: 45 bugs out of 500 bug reports
      • Only in core files, libraries, and four commonly used Gnome applications
    • Apache: 44 bugs out of 5220 messages from mailing list
      • Serious bugs
  • 8. Example Bugs
    • Apache
      • Long URL causes overflow.
    • MySQL
      • Lack of file descrpitors.
    • Gnome
      • Race condition between a request for action from an applet and its removal.
      • Race condition between a image viewer and a property editor.
  • 9. Results - Apache 7 Environment-dependent transient 7 Environment-dependent non-transient 36 Environment-independent #Faults Class
  • 10. Results - Gnome 3 Environment-dependent transient 3 Environment-dependent non-transient 39 Environment-independent #Faults Class
  • 11. Results - MySQL 2 Environment-dependent Transient 4 Environment-dependent non-transient 38 Environment-independent #Faults Class
  • 12. Limitations & Discussions
    • May differ for other applications
      • Only 3 applications
      • Only manually studied reported severe bugs (50/5220, 45/500, 44/44,000)
        • Use automated tools?
    • Better to implement a general recovery approach and verify the results.
  • 13. Limitations & Discussions
    • Why so few transient faults?
      • People tend to not report transient bugs?
      • Ignore occurrence frequency of bugs
      • More reliable systems have more transient bugs?
  • 14. Related Work
    • 5-13%: timing or synchronization related in the MVS OS, the DB2 and IMS DB. [Sullivan91, Sullivan92]
    • 14%: timing and race conditions in the Tandem GUARDIAN OS. [Lee and Iyer 93]
    • 29%: transient and could be recovered by the Tandem process-pair. [Lee and Iyer 93]
  • 15. Conclusions
    • Classical application-generic recovery techniques, such as process pairs, without application specific information , will NOT be sufficient to enable these applications to survive most software faults.