Automated Debugging: Are We There Yet?

1,692 views

Published on

This is an updated version of the opening talk I gave on February 4, 2013 at the Dagstuhl Seminar on Fault Prediction, Localization, and Repair (http://goo.gl/vc6Cl).

Abstract: Software debugging, which involves localizing, understanding, and removing the cause of a failure, is a notoriously difficult, extremely time consuming, and human-intensive activity. For this reason, researchers have invested a great deal of effort in developing automated techniques and tools for supporting various debugging tasks. Although potentially useful, most of these techniques have yet to fully demonstrate their practical effectiveness. Moreover, many current debugging approaches suffer from some common limitations and rely on several strong assumptions on both the characteristics of the code being debugged and how developers behave when debugging such code. This talk provides an overview of the state of the art in the broader area of software debugging, discuss strengths and weaknesses of the main existing debugging techniques, present a set of open challenges in this area, and sketch future research directions that may help address these challenges.

The second part of the talk discusses joint work with Chris Parnin presented at ISSTA 2011: C. Parnin and A. Orso, "Are Automated Debugging Techniques Actually Helping Programmers?", in Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2011), pp 199-209, http://doi.acm.org/10.1145/2001420.2001445

Published in: Technology

Automated Debugging: Are We There Yet?

  1. 1. Automated Debugging: Are We There Yet? Alessandro (Alex) Orso School of Computer Science – College of Computing Georgia Institute of Technology http://www.cc.gatech.edu/~orso/ Partially supported by: NSF, IBM, and MSR
  2. 2. Automated Debugging: Are We There Yet? Alessandro (Alex) Orso School of Computer Science – College of Computing Georgia Institute of Technology http://www.cc.gatech.edu/~orso/ Partially supported by: NSF, IBM, and MSR
  3. 3. How are we doing? Are we there yet? Where shall we go next?
  4. 4. How Are We Doing? A Short History of Debugging
  5. 5. ??? The Birth of Debugging First reference to software errors Your guess? 2013
  6. 6. The Birth of Debugging 1840 1843 • Software errors mentioned in Ada Byron's notes on Charles Babbage's analytical engine • Several uses of the term bug to indicate defects in computers and software • First actual bug and actual debugging (Admiral Grace Hopper’s associates working on Mark II Computer at Harvard University) 2013
  7. 7. The Birth of Debugging 1840 ... • Software errors mentioned in Ada Byron's notes on Charles Babbage's analytical engine 1940 • Several uses of the term bug to indicate defects in computers and software • First actual bug and actual debugging (Admiral Grace Hopper’s associates working on Mark II Computer at Harvard University) 2013
  8. 8. The Birth of Debugging 1840 • Software errors mentioned in Ada Byron's notes on Charles Babbage's analytical engine • Several uses of the term bug to indicate defects in computers and software 1947 2013 • First actual bug and actual debugging (Admiral Grace Hopper’s associates working on Mark II Computer at Harvard University)
  9. 9. Symbolic Debugging 1840 • UNIVAC 1100’s FLIT (Fault Location by Interpretive Testing) 1962 2013 • Richard Stallman’s GDB • DDD • ...
  10. 10. Symbolic Debugging 1840 • UNIVAC 1100’s FLIT (Fault Location by Interpretive Testing) 1986 2013 • GDB • DDD • ...
  11. 11. Symbolic Debugging 1840 • UNIVAC 1100’s FLIT (Fault Location by Interpretive Testing) 1996 2013 • GDB • DDD • ...
  12. 12. Symbolic Debugging 1840 • UNIVAC 1100’s FLIT (Fault Location by Interpretive Testing) 1996 2013 • GDB • DDD • ...
  13. 13. Program Slicing 1960 • Intuition: developers “slice” backwards when debugging 1981 2013 • Weiser’s breakthrough paper • Korel and Laski’s dynamic slicing • Agrawal • Ko’s Whyline
  14. 14. Program Slicing 1960 • Intuition: developers “slice” backwards when debugging 1981 2013 • Weiser’s breakthrough paper • Korel and Laski’s dynamic slicing • Agrawal • Ko’s Whyline
  15. 15. Static Slicing Example mid() { int x,y,z,m; 1: read(“Enter 3 numbers:”,x,y,z); 2: m = z; 3: if (y<z) 4: if (x<y) 5: m = y; 6: else if (x<z) 7: m = y; // bug 8: else 9: if (x>y) 10: m = y; 11: else if (x>z) 12: m = x; 13: print(“Middle number is:”, m); }
  16. 16. Program Slicing 1960 • Intuition: developers “slice” backwards when debugging 1981 2013 • Weiser’s breakthrough paper • Korel and Laski’s dynamic slicing • Agrawal • Ko’s Whyline
  17. 17. Program Slicing 1960 • Intuition: developers “slice” backwards when debugging 1988 1993 2013 • Weiser’s breakthrough paper • Korel and Laski’s dynamic slicing • Agrawal • Ko’s Whyline
  18. 18. Dynamic Slicing Example mid() { int x,y,z,m; 1: read(“Enter 3 numbers:”,x,y,z); 2: m = z; 3: if (y<z) 4: if (x<y) 5: m = y; 6: else if (x<z) 7: m = y; // bug 8: else 9: if (x>y) 10: m = y; 11: else if (x>z) 12: m = x; 13: print(“Middle number is:”, m); }
  19. 19. Dynamic Slicing Example 1,2,3 3,2,1 5,5,5 5,3,4 2,1,3 mid() { int x,y,z,m; 1: read(“Enter 3 numbers:”,x,y,z); 2: m = z; 3: if (y<z) 4: if (x<y) 5: m = y; 6: else if (x<z) 7: m = y; // bug 8: else 9: if (x>y) 10: m = y; 11: else if (x>z) 12: m = x; 13: print(“Middle number is:”, m); } Pass/Fail 3,3,5 Test Cases • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • P • F • • • • P • P • P • P
  20. 20. Dynamic Slicing Example 1,2,3 3,2,1 5,5,5 5,3,4 2,1,3 mid() { int x,y,z,m; 1: read(“Enter 3 numbers:”,x,y,z); 2: m = z; 3: if (y<z) 4: if (x<y) 5: m = y; 6: else if (x<z) 7: m = y; // bug 8: else 9: if (x>y) 10: m = y; 11: else if (x>z) 12: m = x; 13: print(“Middle number is:”, m); } Pass/Fail 3,3,5 Test Cases • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • P • F • • • • P • P • P • P
  21. 21. Program Slicing 1960 • Intuition: developers “slice” backwards when debugging 1988 1993 2013 • Weiser’s breakthrough paper • Korel and Laski’s dynamic slicing • Agrawal • Ko’s Whyline
  22. 22. Program Slicing 1960 • Intuition: developers “slice” backwards when debugging 2008 2013 • Weiser’s breakthrough paper • Korel and Laski’s dynamic slicing • Agrawal • Ko’s Whyline
  23. 23. Delta Debugging 1960 • Intuition: it’s all about differences! • Isolates failure causes automatically • Zeller’s “Yesterday, My Program Worked. Today, It Does Not. Why?” 1999 2013 •
  24. 24. Delta Debugging 1960 • Intuition: it’s all about differences! • Isolates failure causes automatically • Zeller’s “Yesterday, My Program Worked. Today, It Does Not. Why?” 1999 2013 • Applied in several contexts •
  25. 25. Today ✘ ✔ ✔ Yesterday
  26. 26. Today ✘ ✘ ✔ ✔ Yesterday
  27. 27. Today ✘ ✘ ✘ ✔ ✔ Yesterday
  28. 28. Today ✘ ✘ ✘ ✔ ✔ Yesterday
  29. 29. Today ✘ ✘ ✘ … ✘ ✔ … Yesterday Failure cause ✔ ✔
  30. 30. , ts ✘u p in ✘ Today … s, m ✘ a r . g .. ro , ✘ p es to tat ✔ d s Failure cause e li Yesterday … p p A ✔ ✔
  31. 31. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions • Tarantula • Liblit’s CBI 2001 2013 • Many others!
  32. 32. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions • Tarantula • Liblit’s CBI 2001 2013 • Many others!
  33. 33. 1,2,3 3,2,1 5,5,5 5,3,4 2,1,3 sus mid() { int x,y,z,m; 1: read(“Enter 3 numbers:”,x,y,z); 2: m = z; 3: if (y<z) 4: if (x<y) 5: m = y; 6: else if (x<z) 7: m = y; // bug 8: else 9: if (x>y) 10: m = y; 11: else if (x>z) 12: m = x; 13: print(“Middle number is:”, m); } Pass/Fail 3,3,5 Test Cases pic iou sne ss Tarantula • • • • • • • • • • • • • • • • • • • • • • • 0.5 0.5 0.5 0.6 0.0 0.7 0.8 0.0 0.0 0.0 0.0 0.0 0.5 • • • • • • • • • P • F • • • • P • P • P • P
  34. 34. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions • Tarantula • Liblit’s CBI 2001 2013 • Many others!
  35. 35. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions • Tarantula • CBI 2003 2013 • Many others!
  36. 36. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions • Tarantula • CBI • Ochiai 2006 2013 • Many others!
  37. 37. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions 2010 2013 • Tarantula • CBI • Ochiai • Causal inference based • Many others!
  38. 38. Statistical Debugging 1960 • Intuition: debugging techniques can leverage multiple executions ... 2013 • Tarantula ion: rat , teg ltar • CBI w in GZo o rkfl tula, • OchiaiWo aran nit, ... T E based Causal inferencezU • • Many others!
  39. 39. Formula-based Debugging (AKA Failure Explanation) 1960 • Intuition: executions can be expressed as formulas that we can reason about 2009 2013 • Darwin • Cause Clue Clauses • Error invariants •
  40. 40. Input I Formula a b le atisfi uns 1 Input = I ∧ 2 c1 ∧ c2 ∧ c3 ∧ ... ∧ ... ∧ cn-2∧cn-1∧ cn ∧ 3 A MAX-SAT Complement Assertion A { ci }
  41. 41. Formula-based Debugging (AKA Failure Explanation) 1960 • Intuition: executions can be expressed as formulas that we can reason about 2009 2013 • Darwin • Cause Clue Clauses • Error invariants •
  42. 42. Formula-based Debugging (AKA Failure Explanation) 1960 • Intuition: executions can be expressed as formulas that we can reason about • Darwin • Bug Assist • Error invariants • 2011 2013
  43. 43. Formula-based Debugging (AKA Failure Explanation) 1960 • Intuition: executions can be expressed as formulas that we can reason about • Darwin • Bug Assist • Error invariants • 2011 2013
  44. 44. Formula-based Debugging (AKA Failure Explanation) 1960 • Intuition: executions can be expressed as formulas that we can reason about • Darwin • Bug Assist • Error invariants • Angelic debugging 2011 2013
  45. 45. Additional Techniques 1960 2013 • Contracts (e.g., Meyer et al.) • Counterexample-based (e.g., Groce et al., Ball et al.) • Tainting-based (e.g., Leek et al.) • Debugging of field failures (e.g., Jin et al.) ive! • Predicate switching (e.g., Zhang et al.) eal.)s het n • Fault localization for multiple faults (e.g., Steimann p re om c(e.g., Park et al.) failures • Debugging of concurrencybe • Automated data structure repair (e.g., Rinard et al.) t to ean genetic programming Finding patches with • ot m N • Domain specific fixes (tests, web pages, comments, concurrency) • • • Identifying workarounds/recovery strategies (e.g., Gorla et al.) Formula based debugging (e.g., Jose et al., Ermis et al.) ...
  46. 46. Are We There Yet? Can We Debug at the Push of a Button?
  47. 47. Automated Debugging (rank based) 1)" 2)" 3)" 4)" Here$is$a$list$of$ places$to$check$out$ …" Ok,$I$will$check$out$ your$sugges3ons$ one$by$one.$
  48. 48. Automated Debugging Conceptual Model ✔ ✔ ✔ 1)" 2)" 3)" 4)" …" Found&the&bug!&
  49. 49. Performance of Automated Debugging Techniques %  of  faulty  versions 100 Space Siemens 80 60 40 20 0                  20                40                60                80              100 %  of  program  to  be  examined  to  find  fault
  50. 50. Mission Accomplished? Best result: fault in 10% of the code. Great, but... 100 LOC ➡ 10 LOC 10,000 LOC ➡ 1,000 LOC 100,000 LOC ➡ 10,000 LOC
  51. 51. Mission Accomplished? Best result: fault in 10% of the code. Great, but... 100 LOC ➡ 10 LOC ions pt sum as 10,000 LOC ➡ 1,000 g on LOC str ver, r eo Mo 100,000 LOC ➡ 10,000 LOC
  52. 52. Assumption #1: Programmers exhibit perfect bug understanding Do you see a bug?
  53. 53. Assumption #2: Programmers inspect a list linearly and exhaustively Good for comparison, but is it realistic?
  54. 54. Assumption #2: Programmers inspect a list linearly and exhaustively se? sen ak e el m Good afor d l mo comparison, d it? p tu nce uate va e co but lis eit lrealistic? h a ly es t Do e re ve w Ha
  55. 55. Where Shall We Go Next? Are We Headed in the Right Direction? AKA: “Are Automated Debugging Techniques Actually Helping Programmers?” ISSTA 2011 Chris Parnin and Alessandro Orso
  56. 56. What do we know about automated Studies on tools Human studies
  57. 57. What do Let’s&see…&know we Over&50&years&of&research& abouton&automated&debugging.& automated 2001.%Sta)s)cal%Debugging% Studies on tools 1999.$Delta$Debugging$ Human studies 1981.%Weiser.%Program%Slicing% 1962.&Symbolic&Debugging&(UNIVAC&FLIT)&
  58. 58. What do we know about automated man Hu d ies stu o o ls on t d ies Stu Weiser Kusumoto Sherwood Ko DeLine
  59. 59. Are these Techniques and Tools Actually Helping Programmers? • What if we gave developers a ranked list of statements? • How would they use it? • Would they easily see the bug in the list? • Would ranking make a difference?
  60. 60. Hypotheses H1: Programmers who use automated debugging tools will locate bugs faster than programmers who do not use such tools H2: Effectiveness of automated tools increases with the level of difficulty of the debugging task H3: Effectiveness of debugging with automated tools is affected by the faulty statement’s rank
  61. 61. Research Questions RQ1: How do developers navigate a list of statements ranked by suspiciousness? In order of suspiciousness or jumping from one statement to the other? RQ2: Does perfect bug understanding exist? How much effort is involved in inspecting and assessing potentially faulty statements? RQ3: What are the challenges involved in using automated debugging tools effectively? Can unexpected, emerging strategies be observed?
  62. 62. Experimental Protocol: Setup Participants: 34 developers MS’s Students Different levels of expertise (low, medium, high) ✔ ✔ ✔ 1)" 2)" 3)" 4)" …"
  63. 63. Experimental Protocol: Setup ✔ ✔ ✔ 1)" 2)" 3)" 4)" …" Tools • Rank-based tool (Eclipse plug-in, logging) • Eclipse debugger
  64. 64. Experimental Protocol: Setup Software subjects: • Tetris (~2.5KLOC) • NanoXML (~4.5KLOC) ✔ ✔ ✔ 1)" 2)" 3)" 4)" …"
  65. 65. Tetris Bug (Easier)
  66. 66. NanoXML Bug (Harder)
  67. 67. Experimental Protocol: Setup Software subjects: • Tetris (~2.5KLOC) • NanoXML (~4.5KLOC) ✔ ✔ ✔ 1)" 2)" 3)" 4)" …"
  68. 68. Experimental Protocol: Setup Tasks: • Fault in Tetris • Fault in NanoXML • 30 minutes per task • Questionnaire at the end ✔ ✔ ✔ 1)" 2)" 3)" 4)" …"
  69. 69. Experimental Protocol: Studies and Groups
  70. 70. Experimental Protocol: Studies and Groups A! Study 1 B!
  71. 71. Experimental Protocol: Studies and Groups C! D! Rank! 7➡35 Rank! 83➡16 Study 2
  72. 72. Study Results A! B! C! D! Rank! Rank! Tetris A B C D NanoXML
  73. 73. Study Results A! B! C! D! Rank! Rank! Tetris A B C D Not significantly different NanoXML
  74. 74. Study Results A! B! C! D! Rank! Rank! Tetris A B C D NanoXML Not significantly different Not significantly different Not significantly different Not significantly different
  75. 75. Study Results A! B! C! D! Rank! Rank! Tetris Stratifying participants A B C D NanoXML Significantly different for high performers Not significantly different Not significantly different Not significantly different
  76. 76. Study Results A! B! C! D! Rank! Rank! Stratifyin g and ults Tetris res NanoXML of s... SignificantlynaireNot ysis n A al An differento significantly esti for different B qu high C D performers Not significantly different Not significantly different
  77. 77. Findings: Hypotheses H1: Programmers who use automated debugging tools will locate bugs faster than programmers who do not use such tools Experts are faster when using the tool ➡ Support for H1 (with caveats) H2: Effectiveness of automated tools increases with the level of difficulty of the debugging task The tool did not help harder tasks ➡ No support for H2 H3: Effectiveness of debugging with automated tools is affected by the faulty statement’s rank Changes in rank have no significant effects ➡ No support for H3
  78. 78. Findings: RQs RQ1: How do developers navigate a list of statements ranked by suspiciousness? In order of suspiciousness or jumping b/w stmts? Programmers do not visit each statement in the list, they search RQ2: Does perfect bug understanding exist? How much effort is involved in inspecting and assessing potentially faulty statements? Perfect bug understanding is generally not a realistic assumption RQ3: What are the challenges involved in using automated debugging tools effectively? Can unexpected, emerging strategies be observed? 1) The statements in the list were sometimes useful as starting points 2) (Tetris) Several participants preferred to search based on intuition 3) (NanoXML) Several participants gave up on the tool after investigating too many false positives
  79. 79. Research Implications • Percentages will not cut it (e.g., 1.8% == 83rd position) ➡ Implication 1: Techniques should focus on improving absolute rank rather than percentage rank • Ranking can be successfully combined with search ➡ Implication 2: Future tools may focus on searching through (or automatically highlighting) certain suspicious statements • Developers want explanations, not recommendations ➡ Implication 3: We should move away from pure ranking and define techniques that provide context and ability to explore • We must grow the ecosystem ➡ Implication 4: We should aim to create an ecosystem that provides the entire tool chain for fault localization, including managing and orchestrating test cases
  80. 80. In Summary • We came a long way since the early days of debugging ... • There is still a long way to go...
  81. 81. Where Shall We Go Next • Hybrid, semi-automated fault localization techniques • Debugging of field failures (with limited information) • Failure understanding and explanation • (Semi-)automated repair and workarounds • User studies, user studies, user studies! (true also for other areas)
  82. 82. With much appreciated input/contributions from • Andy Ko • Wei Jin • Jim Jones • Wes Masri • Chris Parnin • Abhik Roychoudhury • Wes Weimer • Tao Xie • Andreas Zeller • Xiangyu Zhang

×