It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

It Does What You Say, Not What You Mean:
Lessons From A Decade of Program Repair
Westley Weimer
ThanhVu Nguyen
Claire Le Goues
Stephanie Forrest

(…And then extended for TSE)
4

The Once And Future Problem: Bugs
• 2002 NIST survey: software bugs to cost 0.6% of the US GDP.
• 2003 textbooks: bemoaned that up to 90% of a software
project's cost was dedicated to maintenance and bug repair.
• 2005 Mozilla developers: complained about 300 bugs
appearing per day: "far too many" to handle.
• As a graduate student, Wes had worked on SLAM and BLAST
and envied Dawson Engler.
– Only to hear "we already have tens of thousands of un-fixed bug
reports; don't bother finding more bugs”.
5

The Cunning Plan
• Automatically, efficiently
repair certain classes of
bugs in off-the-shelf,
unannotated legacy
programs.
• Basic idea: biased, random
search through the space of
all programs for a variant
that repairs the problem.
6
https://upload.wikimedia.org/wikipedia/commons/a/a4/13-02-27-spielbank-wiesbaden-by-RalfR-093.jpg

Genetic programming: the application
of evolutionary or genetic algorithms
to program source code.
7

INPUT
OUTPUT
EVALUATE FITNESS
DISCARD
ACCEPT
MUTATE
8

MUTATE
DISCARD
INPUT EVALUATE FITNESS
ACCEPT
OUTPUT
9

Original Secret Sauces
• Use existing test cases to evaluate candidate
repairs.
• Search by perturbing parts of the program likely
to contain the error.
• Existing program code and behavior contains the
seeds of many repairs.
– Leverage existing developer expertise rather than
inventing new code!
10

Candidate Repairs:
Modified Abstract Syntax Trees
• Programs statements are manipulated.
– Reduces search space compared to changing expressions.
• Custom spectrum fault localization.
– Called the "weighted path" in the papers: it wasn't very good.
– Reduces search space compared to changing the whole program.
• Simple mutation (e.g., in the style of introductory students).
– Choose a statement based on fault localization weights.
– Delete S, Replace S with S1, or Insert S2 after S.
• Choose S1 and S2 from the entire AST.
• Reduces search space compared to inventing statements (synthesis).
11

MUTATE
12
INPUT
DISCARD
ACCEPT
EVALUATE FITNESS
Minimization using
delta debugging to
“mitigate the risk of
breaking untested
functionality.”

Search-Based Software Engineering
• This search-based approach admits other
(multi-objective) fitness functions.
– Energy reduction, graphical fidelity, readability,
execution time, etc.
• In 2009: bugs via pass/fail tests only.
[ Harman and Jones. Search-based software engineering. Information &
Software Technology (2001) ]
13

The
original!
14
Inspired
By
Fuzzing

The
original!
Note the overachievement!
We were aiming for a
minimum of 50k LOC.
15
Inspired
By
Fuzzing

Fast Forward to the
2019 ICSE SEIP
Track….
(Presented just two hours ago!)
16
“Results from repair
applied to 6 multi-
million line systems.”
“Facebook, Inc”
“one widely-studied
[repair] approach uses
software testing to guide
the repair process, as
typified by GenProg.”

An incomplete list of acknowledgements
• As a group, we saw an emphasis on taking risks and allowing junior
researchers to rise to the occasion.
• Mark Harman and collaborators at CREST and SBST
• John Knight, Jack Davidson, Anh Nguyen-Tuong, Eric Schulte, Zak
Fry, Ethan Fast, and Michael Dewey-Vogt from UVA/UNM
• Tom Ball and collaborators from Microsoft Research
• Pat Hurley and Sol Greenspan for initial funding support
• … and many more!
18

Another possibility: We were right
about everything! All we needed was
for Facebook to take an interest and
implement it.
19

MUTATE
DISCARD
INPUT EVALUATE FITNESS
ACCEPT
OUTPUT
21
Storing full programs does not scale.
Storing patches instead would help
make genetic improvement feasible.

Candidate Repairs:
Modified Abstract Syntax Trees
• Programs statements are manipulated.
– Reduces search space compared to changing expressions.
• Custom spectrum fault localization.
– High (1.0) if S is not visited on a passed test.
– Low (0.1, 0.0) if S is also visited on a passed test.
• Simple mutation.
– Choose a statement based on fault localization weights.
– Delete S, Replace S with S1, or Insert S2 after S.
• Choose S1 and S2 from the entire AST.
• Reduces search space compared to inventing statements (synthesis).
22

EVALUATE FITNESS
MUTATE
23
INPUT
OUTPUT
ACCEPT
DISCARD
Arbitrary weightings
for the failing test
did not guide us well.

MUTATE
24
INPUT
DISCARD
ACCEPT
EVALUATE FITNESS
Minimization does not
affect semantic patch
quality. ML experts have
known since the 70's
that model size can be
independent of degree
of overfitting.

Selection and Maintenance Debt
• If there are multiple high-fitness candidates, how do you pick
one?
– Used stochastic universal sampling (roulette wheel selection).
– Better approaches (e.g., tournament selection) were known.
– SUS was first on the list on Wikipedia (no, really …).
• First GenProg prototype was made for a Multi University
Research Initiatives grant meeting.
– Fixed GCD infinite loop and Nullhttpd buffer overrun.
– Victim of its own success: no refactoring for years.
25

Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
26

Another possibility: We were right
about everything! All we needed was
for Facebook to take an interest and
implement it.
27

The Three
Major
Challenges
28
Scalability Repair quality
Expressive power

“If I gave you the last 100 bugs
from <my project>, how many
could <your technique> fix?”
– Real Engineers
30

Systematic Benchmark Construction
31
• Approach: use historical data to
approximate discovery and repair
of bugs in the wild.
• Mine program versions (going
back in time on SourceForge,
Google Code, Fedora SRPM, etc.)
where test case behavior changes.
• Corresponds to a human-written
repair for the bug tested by the
failing test case(s).

ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
32

php 1,046,000 8,471 44 Language (web)
Total 5,139,000 10,193 105
33

php 1,046,000 8,471 44 Language (web)
Total 5,139,000 10,193 105
34

35
Scalability
• Both search- and
synthesis-based repair
techniques underwent
fundamental
reconfigurations to enable
scalability.
• A shared dataset of
indicative, real-world bugs
drove innovation in test-
driven program repair.
[ Nguyen et al. SemFix: program repair via semantic analysis. ICSE
2013. ]
[ Mechtaev et al. Angelix: scalable multiline program patch
synthesis via symbolic analysis. ICSE 2016. ]
[ Sim et al. Using benchmarking to advance research: A challenge
to software engineering. ICSE 2003. ]
The Three
Major
Challenges

One aspect of impactful research?
Release your code!
• This is still much less common than you’d expect.
– I am not and have not been perfect about this, but I try.
• Releasing code takes time and energy. Why clean up your code
and write documentation when it won’t turn into another
paper?
• Releasing code and data supports extension and comparison, a
cornerstone of the scientific process.
36

• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
37

38
The Three
Major
Challenges

EVALUATE FITNESS
MUTATE
39
INPUT
OUTPUT
ACCEPT
DISCARD

Using tests was controversial
40

Tests make for a dangerous fitness function. 41
“reuse a content-
length check from
elsewhere in the code”
 nullhttpd: a webserver with basic
GET + POST functionality.
Version 0.5.0: remote-exploitable
heap-based buffer overflow in
handling of POST.
Failing test case: run
exploit, see if
webserver is still
running
Easy passing test cases:
1. “GET index.html”
2. “GET image.jpg”
3. “GET notfound.html”
4. ”POST /cgi-bin/hello.pl”
+
=

Tests make for a dangerous fitness function. 42
 nullhttpd: a webserver with basic
GET + POST functionality.
Version 0.5.0: remote-exploitable
heap-based buffer overflow in
handling of POST.
Failing test case: run
exploit, see if
webserver is still
running
+
=
“delete handling
of POST requests”
Easy passing test cases:
1. “GET index.html”
2. “GET image.jpg”
3. “GET notfound.html”
4. ”POST /cgi-bin/hello.pl”

Test suite quality definitely matters.
43
How much, practically, is a different question…
Wes
Claire
GRAD SCHOOL
Swords

44The journal extension!
 Scenario: Long-running servers +
IDS + generate repairs for
detected anomalies.
 Workloads: a day/week of
unfiltered requests to the UVA CS
webserver./php application.
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]

 Scenario: Long-running servers +
IDS + generate repairs for
detected anomalies.
 Workloads: a day/week of
unfiltered requests to the UVA CS
webserver./php application.
THIS PATCH
DELETED
CODE

Even a functionality-reducing repair
had little practical impact.

• Lesson 2: Shared metrics, benchmarks, and code can drive
research progress and innovation.
47

2019: “We follow the standard practice of using test
cases to evaluate patch correctness.” –many papers
• Majority of 10-12 repair papers at ICSE 2019 use tests.
• Research efforts to understand, characterize, measure, and
promote patch quality are ongoing.
• Tests are good for many reasons!
– Developers understand them.
– They can be used to check a wide array of properties.
• Tests support a general repair paradigm that can (in principle)
integrate into existing QA practice.
48

We were wrong about tests at the time.
[ Urli et al. How to design a program repair bot?:
insights from the Repairnator project. ICSE (SEIP)
2018 ]
• The rise/dominance of continuous
integration, a natural extension
point for automatic repair, is a
fairly recent phenomenon.
• Modern workflows make it easy to
include a human in the patch
review loop, further reducing risk
of low-quality patches.
• But: the actual use case we
proposed in 2009 was unready for
prime time.
49

• Lesson 4: It is OK for SE research to lead SE practice.
50

Scientific Foundations of Repair
• On the one hand, it is surprising that GenProg has worked well.
• On the other hand, why hasn't it worked better?
• If evolution worked as well in software as it does in biology, we
would have seen many more advances.
51

Software = Engineering + Evolution ?
• There are already many evident analogues between Darwinian
processes and software engineering.
– Successful code is copied and reused (clones, COTS, interfaces vs.
inheritance).
– Programmers make small modifications (localization, churn vs.
variation).
• This suggests new directions for understanding and improving
software and software engineering.
52

Repair Theory: Evolutionary Computation
• Genetic Programming was introduced in 1985.
– Rarely scaled beyond polynomials.
– Steph thought repair using GP sounded fun, but would not work!
• The open question is to bridge the gap between results and
techniques in evolutionary biology and software engineering.
• One direction is to understand why it works at all.
– Transition insights from evolutionary computing to software.
[ Arcuri. On the automation of fixing software bugs. ICSE Companion 2008 ]
53

Taking Evolution Seriously
• Potential biological properties of software:
- e.g.: Mutational robustness vs. environmental robustness.
• Understanding the search space can help us search effectively.
- Many bugs are small. Software is not fragile.
- Other potentially interesting analogues:
• Neutrality and epistasis.
• Fitness distributions.
• Neutral network topology.
• Call to arms: be open to insights from other fields.
54

• Lesson 4: It is OK for research to lead industrial practice.
• Lesson 5: Be open to insights from other fields.
55

10 years
of
progress
57
Expressive power
Big leaps in
scalability.
Incremental
leaps,
fundamental
research.

10 years
of
progress
58
Expressive power
Big leaps in
scalability.
Incremental
leaps,
fundamental
research.
Relaxing of
concerns in
practice.

10 years
of
progress
59
Scalability

Future
work: the
ongoing
challenges
60
Repair quality
Expressive power

Future work: the ongoing challenges
Expressive power
• More complicated, compositional,
multi-part patches.
• New ways to use machine
learning over prior human edits to
inform patch construction.
• Integration into existing QA
processes.
Repair quality
• Better understanding of partial
correctness, hostile fitness
landscapes, intermediate quality.
• Are human-informed repairs more
acceptable/trustworthy?
• Use of additional, non-test signals
available from the development
process.
61
[ Saha et al. Harnessing Evolution for Multi-Hunk Program Repair. ICSE 2019 ]
[ Long. Automatic Patch Generation via Learning from Successful Human Patches. PhD Thesis ]
[ Monperrus. A critical review of "automatic patch generation learned from human-written patches":
essay on the problem statement and the evaluation of automatic software repair. ICSE 2014 ]

Stepping back: A challenge about the future of SE work
62
Automation
Humans have not
traditionally been
expected to read,
understand, or modify
generated code.
Human
development
effort
Compilers/code
generators raise the
level of abstraction at
which humans can
operate.
Software Bots
Automated synthesis +
transformation is integrating
into the [complex socio
technical | natural Darwinian]
process of SE. How?

• We made a dozen mistakes in
algorithmic design.
• Many of those mistakes were
discovered/addressed via shared
code and indicative benchmarks.
• …that was a lot of work.
• Test cases effectively capture key
aspects of acceptability for
deployment.
• The success/failure of evolutionary
approaches surfaces fundamental
properties of software.
• Lesson 1: Don’t let the perfect be
the enemy of the good.
• Lesson 2: Shared metrics,
benchmarks, and code drive
research progress and innovation.
• Lesson 3: Impact arises from more
than just a single paper.
• Lesson 4: It is OK for research to
lead industrial practice.
• Lesson 5: Be open to insights from
other fields.
63

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

Similar to It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair (20)

Recently uploaded

Recently uploaded (20)

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

Editor's Notes