In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
5. The Once And Future Problem: Bugs
• 2002 NIST survey: software bugs to cost 0.6% of the US GDP.
• 2003 textbooks: bemoaned that up to 90% of a software
project's cost was dedicated to maintenance and bug repair.
• 2005 Mozilla developers: complained about 300 bugs
appearing per day: "far too many" to handle.
• As a graduate student, Wes had worked on SLAM and BLAST
and envied Dawson Engler.
– Only to hear "we already have tens of thousands of un-fixed bug
reports; don't bother finding more bugs”.
5
6. The Cunning Plan
• Automatically, efficiently
repair certain classes of
bugs in off-the-shelf,
unannotated legacy
programs.
• Basic idea: biased, random
search through the space of
all programs for a variant
that repairs the problem.
6
https://upload.wikimedia.org/wikipedia/commons/a/a4/13-02-27-spielbank-wiesbaden-by-RalfR-093.jpg
7. Genetic programming: the application
of evolutionary or genetic algorithms
to program source code.
7
10. Original Secret Sauces
• Use existing test cases to evaluate candidate
repairs.
• Search by perturbing parts of the program likely
to contain the error.
• Existing program code and behavior contains the
seeds of many repairs.
– Leverage existing developer expertise rather than
inventing new code!
10
11. Candidate Repairs:
Modified Abstract Syntax Trees
• Programs statements are manipulated.
– Reduces search space compared to changing expressions.
• Custom spectrum fault localization.
– Called the "weighted path" in the papers: it wasn't very good.
– Reduces search space compared to changing the whole program.
• Simple mutation (e.g., in the style of introductory students).
– Choose a statement based on fault localization weights.
– Delete S, Replace S with S1, or Insert S2 after S.
• Choose S1 and S2 from the entire AST.
• Reduces search space compared to inventing statements (synthesis).
11
13. Search-Based Software Engineering
• This search-based approach admits other
(multi-objective) fitness functions.
– Energy reduction, graphical fidelity, readability,
execution time, etc.
• In 2009: bugs via pass/fail tests only.
[ Harman and Jones. Search-based software engineering. Information &
Software Technology (2001) ]
13
16. Fast Forward to the
2019 ICSE SEIP
Track….
(Presented just two hours ago!)
16
“Results from repair
applied to 6 multi-
million line systems.”
“Facebook, Inc”
“one widely-studied
[repair] approach uses
software testing to guide
the repair process, as
typified by GenProg.”
18. An incomplete list of acknowledgements
• As a group, we saw an emphasis on taking risks and allowing junior
researchers to rise to the occasion.
• Mark Harman and collaborators at CREST and SBST
• John Knight, Jack Davidson, Anh Nguyen-Tuong, Eric Schulte, Zak
Fry, Ethan Fast, and Michael Dewey-Vogt from UVA/UNM
• Tom Ball and collaborators from Microsoft Research
• Pat Hurley and Sol Greenspan for initial funding support
• … and many more!
18
19. Another possibility: We were right
about everything! All we needed was
for Facebook to take an interest and
implement it.
19
22. Candidate Repairs:
Modified Abstract Syntax Trees
• Programs statements are manipulated.
– Reduces search space compared to changing expressions.
• Custom spectrum fault localization.
– High (1.0) if S is not visited on a passed test.
– Low (0.1, 0.0) if S is also visited on a passed test.
• Simple mutation.
– Choose a statement based on fault localization weights.
– Delete S, Replace S with S1, or Insert S2 after S.
• Choose S1 and S2 from the entire AST.
• Reduces search space compared to inventing statements (synthesis).
22
25. Selection and Maintenance Debt
• If there are multiple high-fitness candidates, how do you pick
one?
– Used stochastic universal sampling (roulette wheel selection).
– Better approaches (e.g., tournament selection) were known.
– SUS was first on the list on Wikipedia (no, really …).
• First GenProg prototype was made for a Multi University
Research Initiatives grant meeting.
– Fixed GCD infinite loop and Nullhttpd buffer overrun.
– Victim of its own success: no refactoring for years.
25
26. Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
26
27. Another possibility: We were right
about everything! All we needed was
for Facebook to take an interest and
implement it.
27
30. “If I gave you the last 100 bugs
from <my project>, how many
could <your technique> fix?”
– Real Engineers
30
31. Systematic Benchmark Construction
31
• Approach: use historical data to
approximate discovery and repair
of bugs in the wild.
• Mine program versions (going
back in time on SourceForge,
Google Code, Fedora SRPM, etc.)
where test case behavior changes.
• Corresponds to a human-written
repair for the bug tested by the
failing test case(s).
32. ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
32
33. ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
33
34. ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
34
35. 35
Scalability
• Both search- and
synthesis-based repair
techniques underwent
fundamental
reconfigurations to enable
scalability.
• A shared dataset of
indicative, real-world bugs
drove innovation in test-
driven program repair.
[ Nguyen et al. SemFix: program repair via semantic analysis. ICSE
2013. ]
[ Mechtaev et al. Angelix: scalable multiline program patch
synthesis via symbolic analysis. ICSE 2016. ]
[ Sim et al. Using benchmarking to advance research: A challenge
to software engineering. ICSE 2003. ]
The Three
Major
Challenges
36. One aspect of impactful research?
Release your code!
• This is still much less common than you’d expect.
– I am not and have not been perfect about this, but I try.
• Releasing code takes time and energy. Why clean up your code
and write documentation when it won’t turn into another
paper?
• Releasing code and data supports extension and comparison, a
cornerstone of the scientific process.
36
37. Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
37
41. Tests make for a dangerous fitness function. 41
“reuse a content-
length check from
elsewhere in the code”
nullhttpd: a webserver with basic
GET + POST functionality.
Version 0.5.0: remote-exploitable
heap-based buffer overflow in
handling of POST.
Failing test case: run
exploit, see if
webserver is still
running
Easy passing test cases:
1. “GET index.html”
2. “GET image.jpg”
3. “GET notfound.html”
4. ”POST /cgi-bin/hello.pl”
+
=
42. Tests make for a dangerous fitness function. 42
nullhttpd: a webserver with basic
GET + POST functionality.
Version 0.5.0: remote-exploitable
heap-based buffer overflow in
handling of POST.
Failing test case: run
exploit, see if
webserver is still
running
+
=
“delete handling
of POST requests”
Easy passing test cases:
1. “GET index.html”
2. “GET image.jpg”
3. “GET notfound.html”
4. ”POST /cgi-bin/hello.pl”
43. Test suite quality definitely matters.
43
How much, practically, is a different question…
Wes
Claire
GRAD SCHOOL
Swords
44. 44The journal extension!
Scenario: Long-running servers +
IDS + generate repairs for
detected anomalies.
Workloads: a day/week of
unfiltered requests to the UVA CS
webserver./php application.
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]
45. 45The journal extension!
Scenario: Long-running servers +
IDS + generate repairs for
detected anomalies.
Workloads: a day/week of
unfiltered requests to the UVA CS
webserver./php application.
THIS PATCH
DELETED
CODE
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]
46. 46The journal extension!
Even a functionality-reducing repair
had little practical impact.
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]
47. Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code can drive
research progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
47
48. 2019: “We follow the standard practice of using test
cases to evaluate patch correctness.” –many papers
• Majority of 10-12 repair papers at ICSE 2019 use tests.
• Research efforts to understand, characterize, measure, and
promote patch quality are ongoing.
• Tests are good for many reasons!
– Developers understand them.
– They can be used to check a wide array of properties.
• Tests support a general repair paradigm that can (in principle)
integrate into existing QA practice.
48
49. We were wrong about tests at the time.
[ Urli et al. How to design a program repair bot?:
insights from the Repairnator project. ICSE (SEIP)
2018 ]
• The rise/dominance of continuous
integration, a natural extension
point for automatic repair, is a
fairly recent phenomenon.
• Modern workflows make it easy to
include a human in the patch
review loop, further reducing risk
of low-quality patches.
• But: the actual use case we
proposed in 2009 was unready for
prime time.
49
50. Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
• Lesson 4: It is OK for SE research to lead SE practice.
50
51. Scientific Foundations of Repair
• On the one hand, it is surprising that GenProg has worked well.
• On the other hand, why hasn't it worked better?
• If evolution worked as well in software as it does in biology, we
would have seen many more advances.
51
52. Software = Engineering + Evolution ?
• There are already many evident analogues between Darwinian
processes and software engineering.
– Successful code is copied and reused (clones, COTS, interfaces vs.
inheritance).
– Programmers make small modifications (localization, churn vs.
variation).
• This suggests new directions for understanding and improving
software and software engineering.
52
53. Repair Theory: Evolutionary Computation
• Genetic Programming was introduced in 1985.
– Rarely scaled beyond polynomials.
– Steph thought repair using GP sounded fun, but would not work!
• The open question is to bridge the gap between results and
techniques in evolutionary biology and software engineering.
• One direction is to understand why it works at all.
– Transition insights from evolutionary computing to software.
[ Arcuri. On the automation of fixing software bugs. ICSE Companion 2008 ]
53
54. Taking Evolution Seriously
• Potential biological properties of software:
- e.g.: Mutational robustness vs. environmental robustness.
• Understanding the search space can help us search effectively.
- Many bugs are small. Software is not fragile.
- Other potentially interesting analogues:
• Neutrality and epistasis.
• Fitness distributions.
• Neutral network topology.
• Call to arms: be open to insights from other fields.
54
55. Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
• Lesson 4: It is OK for research to lead industrial practice.
• Lesson 5: Be open to insights from other fields.
55
58. 10 years
of
progress
58
Scalability Repair quality
Expressive power
Big leaps in
scalability.
Incremental
leaps,
fundamental
research.
Relaxing of
concerns in
practice.
61. Future work: the ongoing challenges
Expressive power
• More complicated, compositional,
multi-part patches.
• New ways to use machine
learning over prior human edits to
inform patch construction.
• Integration into existing QA
processes.
Repair quality
• Better understanding of partial
correctness, hostile fitness
landscapes, intermediate quality.
• Are human-informed repairs more
acceptable/trustworthy?
• Use of additional, non-test signals
available from the development
process.
61
[ Saha et al. Harnessing Evolution for Multi-Hunk Program Repair. ICSE 2019 ]
[ Long. Automatic Patch Generation via Learning from Successful Human Patches. PhD Thesis ]
[ Monperrus. A critical review of "automatic patch generation learned from human-written patches":
essay on the problem statement and the evaluation of automatic software repair. ICSE 2014 ]
62. Stepping back: A challenge about the future of SE work
62
Automation
Humans have not
traditionally been
expected to read,
understand, or modify
generated code.
Human
development
effort
Compilers/code
generators raise the
level of abstraction at
which humans can
operate.
Software Bots
Automated synthesis +
transformation is integrating
into the [complex socio
technical | natural Darwinian]
process of SE. How?
63. Lessons from a decade of program repair
• We made a dozen mistakes in
algorithmic design.
• Many of those mistakes were
discovered/addressed via shared
code and indicative benchmarks.
• …that was a lot of work.
• Test cases effectively capture key
aspects of acceptability for
deployment.
• The success/failure of evolutionary
approaches surfaces fundamental
properties of software.
• Lesson 1: Don’t let the perfect be
the enemy of the good.
• Lesson 2: Shared metrics,
benchmarks, and code drive
research progress and innovation.
• Lesson 3: Impact arises from more
than just a single paper.
• Lesson 4: It is OK for research to
lead industrial practice.
• Lesson 5: Be open to insights from
other fields.
63
Editor's Notes
17:46
3D growth: https://www.flickr.com/photos/86530412@N02/7935377706, www.stockmonkeys.com, labeled CC BY 2.0
quality: https://pixabay.com/en/approved-control-quality-stamp-147677/, public domain
diversity: CC BY-NC-SA 2.0 https://www.flickr.com/photos/cimmyt/5219256862, Photo credit: Xochiquetzal Fonseca/CIMMYT.