Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

262 views

Published on

In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

  1. 1. It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair Westley Weimer ThanhVu Nguyen Claire Le Goues Stephanie Forrest
  2. 2. 2 Survivor Bias
  3. 3. Flashback to ICSE 2009 3
  4. 4. (…And then extended for TSE) 4
  5. 5. The Once And Future Problem: Bugs • 2002 NIST survey: software bugs to cost 0.6% of the US GDP. • 2003 textbooks: bemoaned that up to 90% of a software project's cost was dedicated to maintenance and bug repair. • 2005 Mozilla developers: complained about 300 bugs appearing per day: "far too many" to handle. • As a graduate student, Wes had worked on SLAM and BLAST and envied Dawson Engler. – Only to hear "we already have tens of thousands of un-fixed bug reports; don't bother finding more bugs”. 5
  6. 6. The Cunning Plan • Automatically, efficiently repair certain classes of bugs in off-the-shelf, unannotated legacy programs. • Basic idea: biased, random search through the space of all programs for a variant that repairs the problem. 6 https://upload.wikimedia.org/wikipedia/commons/a/a4/13-02-27-spielbank-wiesbaden-by-RalfR-093.jpg
  7. 7. Genetic programming: the application of evolutionary or genetic algorithms to program source code. 7
  8. 8. INPUT OUTPUT EVALUATE FITNESS DISCARD ACCEPT MUTATE 8
  9. 9. MUTATE DISCARD INPUT EVALUATE FITNESS ACCEPT OUTPUT 9
  10. 10. Original Secret Sauces • Use existing test cases to evaluate candidate repairs. • Search by perturbing parts of the program likely to contain the error. • Existing program code and behavior contains the seeds of many repairs. – Leverage existing developer expertise rather than inventing new code! 10
  11. 11. Candidate Repairs: Modified Abstract Syntax Trees • Programs statements are manipulated. – Reduces search space compared to changing expressions. • Custom spectrum fault localization. – Called the "weighted path" in the papers: it wasn't very good. – Reduces search space compared to changing the whole program. • Simple mutation (e.g., in the style of introductory students). – Choose a statement based on fault localization weights. – Delete S, Replace S with S1, or Insert S2 after S. • Choose S1 and S2 from the entire AST. • Reduces search space compared to inventing statements (synthesis). 11
  12. 12. MUTATE 12 INPUT DISCARD ACCEPT EVALUATE FITNESS Minimization using delta debugging to “mitigate the risk of breaking untested functionality.”
  13. 13. Search-Based Software Engineering • This search-based approach admits other (multi-objective) fitness functions. – Energy reduction, graphical fidelity, readability, execution time, etc. • In 2009: bugs via pass/fail tests only. [ Harman and Jones. Search-based software engineering. Information & Software Technology (2001) ] 13
  14. 14. The original! 14 Inspired By Fuzzing
  15. 15. The original! Note the overachievement! We were aiming for a minimum of 50k LOC. 15 Inspired By Fuzzing
  16. 16. Fast Forward to the 2019 ICSE SEIP Track…. (Presented just two hours ago!) 16 “Results from repair applied to 6 multi- million line systems.” “Facebook, Inc” “one widely-studied [repair] approach uses software testing to guide the repair process, as typified by GenProg.”
  17. 17. How did we get here? 17
  18. 18. An incomplete list of acknowledgements • As a group, we saw an emphasis on taking risks and allowing junior researchers to rise to the occasion. • Mark Harman and collaborators at CREST and SBST • John Knight, Jack Davidson, Anh Nguyen-Tuong, Eric Schulte, Zak Fry, Ethan Fast, and Michael Dewey-Vogt from UVA/UNM • Tom Ball and collaborators from Microsoft Research • Pat Hurley and Sol Greenspan for initial funding support • … and many more! 18
  19. 19. Another possibility: We were right about everything! All we needed was for Facebook to take an interest and implement it. 19
  20. 20. … Not quite. 20
  21. 21. MUTATE DISCARD INPUT EVALUATE FITNESS ACCEPT OUTPUT 21 Storing full programs does not scale. Storing patches instead would help make genetic improvement feasible.
  22. 22. Candidate Repairs: Modified Abstract Syntax Trees • Programs statements are manipulated. – Reduces search space compared to changing expressions. • Custom spectrum fault localization. – High (1.0) if S is not visited on a passed test. – Low (0.1, 0.0) if S is also visited on a passed test. • Simple mutation. – Choose a statement based on fault localization weights. – Delete S, Replace S with S1, or Insert S2 after S. • Choose S1 and S2 from the entire AST. • Reduces search space compared to inventing statements (synthesis). 22
  23. 23. EVALUATE FITNESS MUTATE 23 INPUT OUTPUT ACCEPT DISCARD Arbitrary weightings for the failing test did not guide us well.
  24. 24. MUTATE 24 INPUT DISCARD ACCEPT EVALUATE FITNESS Minimization does not affect semantic patch quality. ML experts have known since the 70's that model size can be independent of degree of overfitting.
  25. 25. Selection and Maintenance Debt • If there are multiple high-fitness candidates, how do you pick one? – Used stochastic universal sampling (roulette wheel selection). – Better approaches (e.g., tournament selection) were known. – SUS was first on the list on Wikipedia (no, really …). • First GenProg prototype was made for a Multi University Research Initiatives grant meeting. – Fixed GCD infinite loop and Nullhttpd buffer overrun. – Victim of its own success: no refactoring for years. 25
  26. 26. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. 26
  27. 27. Another possibility: We were right about everything! All we needed was for Facebook to take an interest and implement it. 27
  28. 28. The Three Major Challenges 28 Scalability Repair quality Expressive power
  29. 29. The original! 29
  30. 30. “If I gave you the last 100 bugs from <my project>, how many could <your technique> fix?” – Real Engineers 30
  31. 31. Systematic Benchmark Construction 31 • Approach: use historical data to approximate discovery and repair of bugs in the wild. • Mine program versions (going back in time on SourceForge, Google Code, Fedora SRPM, etc.) where test case behavior changes. • Corresponds to a human-written repair for the bug tested by the failing test case(s).
  32. 32. ManyBugs drove algorithmic changes Program LOC Tests Bugs Description fbc 97,000 773 3 Language (legacy) gmp 145,000 146 2 Multiple precision math gzip 491,000 12 5 Data compression libtiff 77,000 78 24 Image manipulation lighttpd 62,000 295 9 Web server php 1,046,000 8,471 44 Language (web) python 407,000 355 11 Language (general) wireshark 2,814,000 63 7 Network packet analyzer Total 5,139,000 10,193 105 32
  33. 33. ManyBugs drove algorithmic changes Program LOC Tests Bugs Description fbc 97,000 773 3 Language (legacy) gmp 145,000 146 2 Multiple precision math gzip 491,000 12 5 Data compression libtiff 77,000 78 24 Image manipulation lighttpd 62,000 295 9 Web server php 1,046,000 8,471 44 Language (web) python 407,000 355 11 Language (general) wireshark 2,814,000 63 7 Network packet analyzer Total 5,139,000 10,193 105 33
  34. 34. ManyBugs drove algorithmic changes Program LOC Tests Bugs Description fbc 97,000 773 3 Language (legacy) gmp 145,000 146 2 Multiple precision math gzip 491,000 12 5 Data compression libtiff 77,000 78 24 Image manipulation lighttpd 62,000 295 9 Web server php 1,046,000 8,471 44 Language (web) python 407,000 355 11 Language (general) wireshark 2,814,000 63 7 Network packet analyzer Total 5,139,000 10,193 105 34
  35. 35. 35 Scalability • Both search- and synthesis-based repair techniques underwent fundamental reconfigurations to enable scalability. • A shared dataset of indicative, real-world bugs drove innovation in test- driven program repair. [ Nguyen et al. SemFix: program repair via semantic analysis. ICSE 2013. ] [ Mechtaev et al. Angelix: scalable multiline program patch synthesis via symbolic analysis. ICSE 2016. ] [ Sim et al. Using benchmarking to advance research: A challenge to software engineering. ICSE 2003. ] The Three Major Challenges
  36. 36. One aspect of impactful research? Release your code! • This is still much less common than you’d expect. – I am not and have not been perfect about this, but I try. • Releasing code takes time and energy. Why clean up your code and write documentation when it won’t turn into another paper? • Releasing code and data supports extension and comparison, a cornerstone of the scientific process. 36
  37. 37. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. 37
  38. 38. 38 Scalability Repair quality The Three Major Challenges
  39. 39. EVALUATE FITNESS MUTATE 39 INPUT OUTPUT ACCEPT DISCARD
  40. 40. Using tests was controversial 40
  41. 41. Tests make for a dangerous fitness function. 41 “reuse a content- length check from elsewhere in the code”  nullhttpd: a webserver with basic GET + POST functionality. Version 0.5.0: remote-exploitable heap-based buffer overflow in handling of POST. Failing test case: run exploit, see if webserver is still running Easy passing test cases: 1. “GET index.html” 2. “GET image.jpg” 3. “GET notfound.html” 4. ”POST /cgi-bin/hello.pl” + =
  42. 42. Tests make for a dangerous fitness function. 42  nullhttpd: a webserver with basic GET + POST functionality. Version 0.5.0: remote-exploitable heap-based buffer overflow in handling of POST. Failing test case: run exploit, see if webserver is still running + = “delete handling of POST requests” Easy passing test cases: 1. “GET index.html” 2. “GET image.jpg” 3. “GET notfound.html” 4. ”POST /cgi-bin/hello.pl”
  43. 43. Test suite quality definitely matters. 43 How much, practically, is a different question… Wes Claire GRAD SCHOOL Swords
  44. 44. 44The journal extension!  Scenario: Long-running servers + IDS + generate repairs for detected anomalies.  Workloads: a day/week of unfiltered requests to the UVA CS webserver./php application. [Rinard, et al.. Enhancing server availability and security through failure- oblivious computing. OSDI ‘04.]
  45. 45. 45The journal extension!  Scenario: Long-running servers + IDS + generate repairs for detected anomalies.  Workloads: a day/week of unfiltered requests to the UVA CS webserver./php application. THIS PATCH DELETED CODE [Rinard, et al.. Enhancing server availability and security through failure- oblivious computing. OSDI ‘04.]
  46. 46. 46The journal extension! Even a functionality-reducing repair had little practical impact. [Rinard, et al.. Enhancing server availability and security through failure- oblivious computing. OSDI ‘04.]
  47. 47. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code can drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. 47
  48. 48. 2019: “We follow the standard practice of using test cases to evaluate patch correctness.” –many papers • Majority of 10-12 repair papers at ICSE 2019 use tests. • Research efforts to understand, characterize, measure, and promote patch quality are ongoing. • Tests are good for many reasons! – Developers understand them. – They can be used to check a wide array of properties. • Tests support a general repair paradigm that can (in principle) integrate into existing QA practice. 48
  49. 49. We were wrong about tests at the time. [ Urli et al. How to design a program repair bot?: insights from the Repairnator project. ICSE (SEIP) 2018 ] • The rise/dominance of continuous integration, a natural extension point for automatic repair, is a fairly recent phenomenon. • Modern workflows make it easy to include a human in the patch review loop, further reducing risk of low-quality patches. • But: the actual use case we proposed in 2009 was unready for prime time. 49
  50. 50. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. • Lesson 4: It is OK for SE research to lead SE practice. 50
  51. 51. Scientific Foundations of Repair • On the one hand, it is surprising that GenProg has worked well. • On the other hand, why hasn't it worked better? • If evolution worked as well in software as it does in biology, we would have seen many more advances. 51
  52. 52. Software = Engineering + Evolution ? • There are already many evident analogues between Darwinian processes and software engineering. – Successful code is copied and reused (clones, COTS, interfaces vs. inheritance). – Programmers make small modifications (localization, churn vs. variation). • This suggests new directions for understanding and improving software and software engineering. 52
  53. 53. Repair Theory: Evolutionary Computation • Genetic Programming was introduced in 1985. – Rarely scaled beyond polynomials. – Steph thought repair using GP sounded fun, but would not work! • The open question is to bridge the gap between results and techniques in evolutionary biology and software engineering. • One direction is to understand why it works at all. – Transition insights from evolutionary computing to software. [ Arcuri. On the automation of fixing software bugs. ICSE Companion 2008 ] 53
  54. 54. Taking Evolution Seriously • Potential biological properties of software: - e.g.: Mutational robustness vs. environmental robustness. • Understanding the search space can help us search effectively. - Many bugs are small. Software is not fragile. - Other potentially interesting analogues: • Neutrality and epistasis. • Fitness distributions. • Neutral network topology. • Call to arms: be open to insights from other fields. 54
  55. 55. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. • Lesson 4: It is OK for research to lead industrial practice. • Lesson 5: Be open to insights from other fields. 55
  56. 56. How did we get here? 56
  57. 57. 10 years of progress 57 Scalability Repair quality Expressive power Big leaps in scalability. Incremental leaps, fundamental research.
  58. 58. 10 years of progress 58 Scalability Repair quality Expressive power Big leaps in scalability. Incremental leaps, fundamental research. Relaxing of concerns in practice.
  59. 59. 10 years of progress 59 Scalability
  60. 60. Future work: the ongoing challenges 60 Repair quality Expressive power
  61. 61. Future work: the ongoing challenges Expressive power • More complicated, compositional, multi-part patches. • New ways to use machine learning over prior human edits to inform patch construction. • Integration into existing QA processes. Repair quality • Better understanding of partial correctness, hostile fitness landscapes, intermediate quality. • Are human-informed repairs more acceptable/trustworthy? • Use of additional, non-test signals available from the development process. 61 [ Saha et al. Harnessing Evolution for Multi-Hunk Program Repair. ICSE 2019 ] [ Long. Automatic Patch Generation via Learning from Successful Human Patches. PhD Thesis ] [ Monperrus. A critical review of "automatic patch generation learned from human-written patches": essay on the problem statement and the evaluation of automatic software repair. ICSE 2014 ]
  62. 62. Stepping back: A challenge about the future of SE work 62 Automation Humans have not traditionally been expected to read, understand, or modify generated code. Human development effort Compilers/code generators raise the level of abstraction at which humans can operate. Software Bots Automated synthesis + transformation is integrating into the [complex socio technical | natural Darwinian] process of SE. How?
  63. 63. Lessons from a decade of program repair • We made a dozen mistakes in algorithmic design. • Many of those mistakes were discovered/addressed via shared code and indicative benchmarks. • …that was a lot of work. • Test cases effectively capture key aspects of acceptability for deployment. • The success/failure of evolutionary approaches surfaces fundamental properties of software. • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. • Lesson 4: It is OK for research to lead industrial practice. • Lesson 5: Be open to insights from other fields. 63

×