SlideShare a Scribd company logo
1 of 63
It Does What You Say, Not What You Mean:
Lessons From A Decade of Program Repair
Westley Weimer
ThanhVu Nguyen
Claire Le Goues
Stephanie Forrest
2
Survivor Bias
Flashback to ICSE 2009
3
(…And then extended for TSE)
4
The Once And Future Problem: Bugs
• 2002 NIST survey: software bugs to cost 0.6% of the US GDP.
• 2003 textbooks: bemoaned that up to 90% of a software
project's cost was dedicated to maintenance and bug repair.
• 2005 Mozilla developers: complained about 300 bugs
appearing per day: "far too many" to handle.
• As a graduate student, Wes had worked on SLAM and BLAST
and envied Dawson Engler.
– Only to hear "we already have tens of thousands of un-fixed bug
reports; don't bother finding more bugs”.
5
The Cunning Plan
• Automatically, efficiently
repair certain classes of
bugs in off-the-shelf,
unannotated legacy
programs.
• Basic idea: biased, random
search through the space of
all programs for a variant
that repairs the problem.
6
https://upload.wikimedia.org/wikipedia/commons/a/a4/13-02-27-spielbank-wiesbaden-by-RalfR-093.jpg
Genetic programming: the application
of evolutionary or genetic algorithms
to program source code.
7
INPUT
OUTPUT
EVALUATE FITNESS
DISCARD
ACCEPT
MUTATE
8
MUTATE
DISCARD
INPUT EVALUATE FITNESS
ACCEPT
OUTPUT
9
Original Secret Sauces
• Use existing test cases to evaluate candidate
repairs.
• Search by perturbing parts of the program likely
to contain the error.
• Existing program code and behavior contains the
seeds of many repairs.
– Leverage existing developer expertise rather than
inventing new code!
10
Candidate Repairs:
Modified Abstract Syntax Trees
• Programs statements are manipulated.
– Reduces search space compared to changing expressions.
• Custom spectrum fault localization.
– Called the "weighted path" in the papers: it wasn't very good.
– Reduces search space compared to changing the whole program.
• Simple mutation (e.g., in the style of introductory students).
– Choose a statement based on fault localization weights.
– Delete S, Replace S with S1, or Insert S2 after S.
• Choose S1 and S2 from the entire AST.
• Reduces search space compared to inventing statements (synthesis).
11
MUTATE
12
INPUT
DISCARD
ACCEPT
EVALUATE FITNESS
Minimization using
delta debugging to
“mitigate the risk of
breaking untested
functionality.”
Search-Based Software Engineering
• This search-based approach admits other
(multi-objective) fitness functions.
– Energy reduction, graphical fidelity, readability,
execution time, etc.
• In 2009: bugs via pass/fail tests only.
[ Harman and Jones. Search-based software engineering. Information &
Software Technology (2001) ]
13
The
original!
14
Inspired
By
Fuzzing
The
original!
Note the overachievement!
We were aiming for a
minimum of 50k LOC.
15
Inspired
By
Fuzzing
Fast Forward to the
2019 ICSE SEIP
Track….
(Presented just two hours ago!)
16
“Results from repair
applied to 6 multi-
million line systems.”
“Facebook, Inc”
“one widely-studied
[repair] approach uses
software testing to guide
the repair process, as
typified by GenProg.”
How did we get here?
17
An incomplete list of acknowledgements
• As a group, we saw an emphasis on taking risks and allowing junior
researchers to rise to the occasion.
• Mark Harman and collaborators at CREST and SBST
• John Knight, Jack Davidson, Anh Nguyen-Tuong, Eric Schulte, Zak
Fry, Ethan Fast, and Michael Dewey-Vogt from UVA/UNM
• Tom Ball and collaborators from Microsoft Research
• Pat Hurley and Sol Greenspan for initial funding support
• … and many more!
18
Another possibility: We were right
about everything! All we needed was
for Facebook to take an interest and
implement it.
19
… Not quite.
20
MUTATE
DISCARD
INPUT EVALUATE FITNESS
ACCEPT
OUTPUT
21
Storing full programs does not scale.
Storing patches instead would help
make genetic improvement feasible.
Candidate Repairs:
Modified Abstract Syntax Trees
• Programs statements are manipulated.
– Reduces search space compared to changing expressions.
• Custom spectrum fault localization.
– High (1.0) if S is not visited on a passed test.
– Low (0.1, 0.0) if S is also visited on a passed test.
• Simple mutation.
– Choose a statement based on fault localization weights.
– Delete S, Replace S with S1, or Insert S2 after S.
• Choose S1 and S2 from the entire AST.
• Reduces search space compared to inventing statements (synthesis).
22
EVALUATE FITNESS
MUTATE
23
INPUT
OUTPUT
ACCEPT
DISCARD
Arbitrary weightings
for the failing test
did not guide us well.
MUTATE
24
INPUT
DISCARD
ACCEPT
EVALUATE FITNESS
Minimization does not
affect semantic patch
quality. ML experts have
known since the 70's
that model size can be
independent of degree
of overfitting.
Selection and Maintenance Debt
• If there are multiple high-fitness candidates, how do you pick
one?
– Used stochastic universal sampling (roulette wheel selection).
– Better approaches (e.g., tournament selection) were known.
– SUS was first on the list on Wikipedia (no, really …).
• First GenProg prototype was made for a Multi University
Research Initiatives grant meeting.
– Fixed GCD infinite loop and Nullhttpd buffer overrun.
– Victim of its own success: no refactoring for years.
25
Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
26
Another possibility: We were right
about everything! All we needed was
for Facebook to take an interest and
implement it.
27
The Three
Major
Challenges
28
Scalability Repair quality
Expressive power
The
original!
29
“If I gave you the last 100 bugs
from <my project>, how many
could <your technique> fix?”
– Real Engineers
30
Systematic Benchmark Construction
31
• Approach: use historical data to
approximate discovery and repair
of bugs in the wild.
• Mine program versions (going
back in time on SourceForge,
Google Code, Fedora SRPM, etc.)
where test case behavior changes.
• Corresponds to a human-written
repair for the bug tested by the
failing test case(s).
ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
32
ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
33
ManyBugs drove algorithmic changes
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 8,471 44 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
Total 5,139,000 10,193 105
34
35
Scalability
• Both search- and
synthesis-based repair
techniques underwent
fundamental
reconfigurations to enable
scalability.
• A shared dataset of
indicative, real-world bugs
drove innovation in test-
driven program repair.
[ Nguyen et al. SemFix: program repair via semantic analysis. ICSE
2013. ]
[ Mechtaev et al. Angelix: scalable multiline program patch
synthesis via symbolic analysis. ICSE 2016. ]
[ Sim et al. Using benchmarking to advance research: A challenge
to software engineering. ICSE 2003. ]
The Three
Major
Challenges
One aspect of impactful research?
Release your code!
• This is still much less common than you’d expect.
– I am not and have not been perfect about this, but I try.
• Releasing code takes time and energy. Why clean up your code
and write documentation when it won’t turn into another
paper?
• Releasing code and data supports extension and comparison, a
cornerstone of the scientific process.
36
Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
37
38
Scalability Repair quality
The Three
Major
Challenges
EVALUATE FITNESS
MUTATE
39
INPUT
OUTPUT
ACCEPT
DISCARD
Using tests was controversial
40
Tests make for a dangerous fitness function. 41
“reuse a content-
length check from
elsewhere in the code”
 nullhttpd: a webserver with basic
GET + POST functionality.
Version 0.5.0: remote-exploitable
heap-based buffer overflow in
handling of POST.
Failing test case: run
exploit, see if
webserver is still
running
Easy passing test cases:
1. “GET index.html”
2. “GET image.jpg”
3. “GET notfound.html”
4. ”POST /cgi-bin/hello.pl”
+
=
Tests make for a dangerous fitness function. 42
 nullhttpd: a webserver with basic
GET + POST functionality.
Version 0.5.0: remote-exploitable
heap-based buffer overflow in
handling of POST.
Failing test case: run
exploit, see if
webserver is still
running
+
=
“delete handling
of POST requests”
Easy passing test cases:
1. “GET index.html”
2. “GET image.jpg”
3. “GET notfound.html”
4. ”POST /cgi-bin/hello.pl”
Test suite quality definitely matters.
43
How much, practically, is a different question…
Wes
Claire
GRAD SCHOOL
Swords
44The journal extension!
 Scenario: Long-running servers +
IDS + generate repairs for
detected anomalies.
 Workloads: a day/week of
unfiltered requests to the UVA CS
webserver./php application.
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]
45The journal extension!
 Scenario: Long-running servers +
IDS + generate repairs for
detected anomalies.
 Workloads: a day/week of
unfiltered requests to the UVA CS
webserver./php application.
THIS PATCH
DELETED
CODE
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]
46The journal extension!
Even a functionality-reducing repair
had little practical impact.
[Rinard, et al.. Enhancing server availability and security through failure-
oblivious computing. OSDI ‘04.]
Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code can drive
research progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
47
2019: “We follow the standard practice of using test
cases to evaluate patch correctness.” –many papers
• Majority of 10-12 repair papers at ICSE 2019 use tests.
• Research efforts to understand, characterize, measure, and
promote patch quality are ongoing.
• Tests are good for many reasons!
– Developers understand them.
– They can be used to check a wide array of properties.
• Tests support a general repair paradigm that can (in principle)
integrate into existing QA practice.
48
We were wrong about tests at the time.
[ Urli et al. How to design a program repair bot?:
insights from the Repairnator project. ICSE (SEIP)
2018 ]
• The rise/dominance of continuous
integration, a natural extension
point for automatic repair, is a
fairly recent phenomenon.
• Modern workflows make it easy to
include a human in the patch
review loop, further reducing risk
of low-quality patches.
• But: the actual use case we
proposed in 2009 was unready for
prime time.
49
Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
• Lesson 4: It is OK for SE research to lead SE practice.
50
Scientific Foundations of Repair
• On the one hand, it is surprising that GenProg has worked well.
• On the other hand, why hasn't it worked better?
• If evolution worked as well in software as it does in biology, we
would have seen many more advances.
51
Software = Engineering + Evolution ?
• There are already many evident analogues between Darwinian
processes and software engineering.
– Successful code is copied and reused (clones, COTS, interfaces vs.
inheritance).
– Programmers make small modifications (localization, churn vs.
variation).
• This suggests new directions for understanding and improving
software and software engineering.
52
Repair Theory: Evolutionary Computation
• Genetic Programming was introduced in 1985.
– Rarely scaled beyond polynomials.
– Steph thought repair using GP sounded fun, but would not work!
• The open question is to bridge the gap between results and
techniques in evolutionary biology and software engineering.
• One direction is to understand why it works at all.
– Transition insights from evolutionary computing to software.
[ Arcuri. On the automation of fixing software bugs. ICSE Companion 2008 ]
53
Taking Evolution Seriously
• Potential biological properties of software:
- e.g.: Mutational robustness vs. environmental robustness.
• Understanding the search space can help us search effectively.
- Many bugs are small. Software is not fragile.
- Other potentially interesting analogues:
• Neutrality and epistasis.
• Fitness distributions.
• Neutral network topology.
• Call to arms: be open to insights from other fields.
54
Lessons from a decade of program repair
• Lesson 1: Don’t let the perfect be the enemy of the good.
• Lesson 2: Shared metrics, benchmarks, and code drive research
progress and innovation.
• Lesson 3: Impact arises from more than just a single paper.
• Lesson 4: It is OK for research to lead industrial practice.
• Lesson 5: Be open to insights from other fields.
55
How did we get here?
56
10 years
of
progress
57
Scalability Repair quality
Expressive power
Big leaps in
scalability.
Incremental
leaps,
fundamental
research.
10 years
of
progress
58
Scalability Repair quality
Expressive power
Big leaps in
scalability.
Incremental
leaps,
fundamental
research.
Relaxing of
concerns in
practice.
10 years
of
progress
59
Scalability
Future
work: the
ongoing
challenges
60
Repair quality
Expressive power
Future work: the ongoing challenges
Expressive power
• More complicated, compositional,
multi-part patches.
• New ways to use machine
learning over prior human edits to
inform patch construction.
• Integration into existing QA
processes.
Repair quality
• Better understanding of partial
correctness, hostile fitness
landscapes, intermediate quality.
• Are human-informed repairs more
acceptable/trustworthy?
• Use of additional, non-test signals
available from the development
process.
61
[ Saha et al. Harnessing Evolution for Multi-Hunk Program Repair. ICSE 2019 ]
[ Long. Automatic Patch Generation via Learning from Successful Human Patches. PhD Thesis ]
[ Monperrus. A critical review of "automatic patch generation learned from human-written patches":
essay on the problem statement and the evaluation of automatic software repair. ICSE 2014 ]
Stepping back: A challenge about the future of SE work
62
Automation
Humans have not
traditionally been
expected to read,
understand, or modify
generated code.
Human
development
effort
Compilers/code
generators raise the
level of abstraction at
which humans can
operate.
Software Bots
Automated synthesis +
transformation is integrating
into the [complex socio
technical | natural Darwinian]
process of SE. How?
Lessons from a decade of program repair
• We made a dozen mistakes in
algorithmic design.
• Many of those mistakes were
discovered/addressed via shared
code and indicative benchmarks.
• …that was a lot of work.
• Test cases effectively capture key
aspects of acceptability for
deployment.
• The success/failure of evolutionary
approaches surfaces fundamental
properties of software.
• Lesson 1: Don’t let the perfect be
the enemy of the good.
• Lesson 2: Shared metrics,
benchmarks, and code drive
research progress and innovation.
• Lesson 3: Impact arises from more
than just a single paper.
• Lesson 4: It is OK for research to
lead industrial practice.
• Lesson 5: Be open to insights from
other fields.
63

More Related Content

What's hot

Android quiz application
Android quiz applicationAndroid quiz application
Android quiz applicationMOHDAHMED52
 
Quiz managment system
Quiz managment systemQuiz managment system
Quiz managment systemtamourk2
 
online examination portal project presentation
online examination portal project presentationonline examination portal project presentation
online examination portal project presentationShobhit Jain
 
Online Examination System Project report
Online Examination System Project report Online Examination System Project report
Online Examination System Project report SARASWATENDRA SINGH
 
Student Management System
Student Management SystemStudent Management System
Student Management SystemAmit Gandhi
 
Online examination system
Online examination systemOnline examination system
Online examination systemAvinash Prakash
 
Dynamic System Development Method (DSDM)
Dynamic System Development Method (DSDM)Dynamic System Development Method (DSDM)
Dynamic System Development Method (DSDM)LennonDukeDuero
 
Online examination documentation
Online examination documentationOnline examination documentation
Online examination documentationWakimul Alam
 
Student feedback system
Student feedback systemStudent feedback system
Student feedback systemAkshay Surve
 
Quizine: An online Test
Quizine: An online TestQuizine: An online Test
Quizine: An online TestRandhir Gupta
 
Split my monolith - Devoxx
Split my monolith - DevoxxSplit my monolith - Devoxx
Split my monolith - Devoxxflorentpellet
 
COSMIC Functional Measurement of Mobile Applications and Code Size Estimation
COSMIC Functional Measurement of Mobile Applications and Code Size EstimationCOSMIC Functional Measurement of Mobile Applications and Code Size Estimation
COSMIC Functional Measurement of Mobile Applications and Code Size EstimationPasquale Salza
 
Software requirement specification for online examination system
Software requirement specification for online examination systemSoftware requirement specification for online examination system
Software requirement specification for online examination systemkarthik venkatesh
 
ONLINE EXAMINATION on ASP.NET
ONLINE EXAMINATION on ASP.NETONLINE EXAMINATION on ASP.NET
ONLINE EXAMINATION on ASP.NETRupam Dey
 
Quizz app By Raihan Sikdar
Quizz app By Raihan SikdarQuizz app By Raihan Sikdar
Quizz app By Raihan Sikdarraihansikdar
 

What's hot (20)

Abstract
AbstractAbstract
Abstract
 
Android quiz application
Android quiz applicationAndroid quiz application
Android quiz application
 
Quiz managment system
Quiz managment systemQuiz managment system
Quiz managment system
 
online examination portal project presentation
online examination portal project presentationonline examination portal project presentation
online examination portal project presentation
 
Online Examination System Project report
Online Examination System Project report Online Examination System Project report
Online Examination System Project report
 
Student Management System
Student Management SystemStudent Management System
Student Management System
 
Online examination system
Online examination systemOnline examination system
Online examination system
 
Dynamic System Development Method (DSDM)
Dynamic System Development Method (DSDM)Dynamic System Development Method (DSDM)
Dynamic System Development Method (DSDM)
 
Online examination documentation
Online examination documentationOnline examination documentation
Online examination documentation
 
Student feedback system
Student feedback systemStudent feedback system
Student feedback system
 
Quizine: An online Test
Quizine: An online TestQuizine: An online Test
Quizine: An online Test
 
Attendance Management System
Attendance Management SystemAttendance Management System
Attendance Management System
 
Online quiz system
Online quiz systemOnline quiz system
Online quiz system
 
Split my monolith - Devoxx
Split my monolith - DevoxxSplit my monolith - Devoxx
Split my monolith - Devoxx
 
COSMIC Functional Measurement of Mobile Applications and Code Size Estimation
COSMIC Functional Measurement of Mobile Applications and Code Size EstimationCOSMIC Functional Measurement of Mobile Applications and Code Size Estimation
COSMIC Functional Measurement of Mobile Applications and Code Size Estimation
 
Project template
Project templateProject template
Project template
 
Software requirement specification for online examination system
Software requirement specification for online examination systemSoftware requirement specification for online examination system
Software requirement specification for online examination system
 
ONLINE EXAMINATION on ASP.NET
ONLINE EXAMINATION on ASP.NETONLINE EXAMINATION on ASP.NET
ONLINE EXAMINATION on ASP.NET
 
Quizz app By Raihan Sikdar
Quizz app By Raihan SikdarQuizz app By Raihan Sikdar
Quizz app By Raihan Sikdar
 
Online quiz system
Online quiz systemOnline quiz system
Online quiz system
 

Similar to It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software developmentMartin Pinzger
 
On the Value of User Preferences in Search-Based Software Engineering
On the Value of User Preferences in Search-Based Software EngineeringOn the Value of User Preferences in Search-Based Software Engineering
On the Value of User Preferences in Search-Based Software EngineeringAbdel Salam Sayyad
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsLionel Briand
 
Final Exam Questions Fall03
Final Exam Questions Fall03Final Exam Questions Fall03
Final Exam Questions Fall03Radu_Negulescu
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Lionel Briand
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Workshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank EnglishWorkshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank EnglishMarcus Drost
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 
Testing Zen
Testing ZenTesting Zen
Testing Zenday
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Xing Xu
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Keynote VST2020 (Workshop on Validation, Analysis and Evolution of Software ...
Keynote VST2020 (Workshop on  Validation, Analysis and Evolution of Software ...Keynote VST2020 (Workshop on  Validation, Analysis and Evolution of Software ...
Keynote VST2020 (Workshop on Validation, Analysis and Evolution of Software ...University of Antwerp
 

Similar to It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair (20)

In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 
On the Value of User Preferences in Search-Based Software Engineering
On the Value of User Preferences in Search-Based Software EngineeringOn the Value of User Preferences in Search-Based Software Engineering
On the Value of User Preferences in Search-Based Software Engineering
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance Systems
 
Final Exam Questions Fall03
Final Exam Questions Fall03Final Exam Questions Fall03
Final Exam Questions Fall03
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Workshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank EnglishWorkshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank English
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Testing Zen
Testing ZenTesting Zen
Testing Zen
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Keynote VST2020 (Workshop on Validation, Analysis and Evolution of Software ...
Keynote VST2020 (Workshop on  Validation, Analysis and Evolution of Software ...Keynote VST2020 (Workshop on  Validation, Analysis and Evolution of Software ...
Keynote VST2020 (Workshop on Validation, Analysis and Evolution of Software ...
 

Recently uploaded

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMchpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMNanaAgyeman13
 

Recently uploaded (20)

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMchpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
 

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

  • 1. It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair Westley Weimer ThanhVu Nguyen Claire Le Goues Stephanie Forrest
  • 5. The Once And Future Problem: Bugs • 2002 NIST survey: software bugs to cost 0.6% of the US GDP. • 2003 textbooks: bemoaned that up to 90% of a software project's cost was dedicated to maintenance and bug repair. • 2005 Mozilla developers: complained about 300 bugs appearing per day: "far too many" to handle. • As a graduate student, Wes had worked on SLAM and BLAST and envied Dawson Engler. – Only to hear "we already have tens of thousands of un-fixed bug reports; don't bother finding more bugs”. 5
  • 6. The Cunning Plan • Automatically, efficiently repair certain classes of bugs in off-the-shelf, unannotated legacy programs. • Basic idea: biased, random search through the space of all programs for a variant that repairs the problem. 6 https://upload.wikimedia.org/wikipedia/commons/a/a4/13-02-27-spielbank-wiesbaden-by-RalfR-093.jpg
  • 7. Genetic programming: the application of evolutionary or genetic algorithms to program source code. 7
  • 10. Original Secret Sauces • Use existing test cases to evaluate candidate repairs. • Search by perturbing parts of the program likely to contain the error. • Existing program code and behavior contains the seeds of many repairs. – Leverage existing developer expertise rather than inventing new code! 10
  • 11. Candidate Repairs: Modified Abstract Syntax Trees • Programs statements are manipulated. – Reduces search space compared to changing expressions. • Custom spectrum fault localization. – Called the "weighted path" in the papers: it wasn't very good. – Reduces search space compared to changing the whole program. • Simple mutation (e.g., in the style of introductory students). – Choose a statement based on fault localization weights. – Delete S, Replace S with S1, or Insert S2 after S. • Choose S1 and S2 from the entire AST. • Reduces search space compared to inventing statements (synthesis). 11
  • 12. MUTATE 12 INPUT DISCARD ACCEPT EVALUATE FITNESS Minimization using delta debugging to “mitigate the risk of breaking untested functionality.”
  • 13. Search-Based Software Engineering • This search-based approach admits other (multi-objective) fitness functions. – Energy reduction, graphical fidelity, readability, execution time, etc. • In 2009: bugs via pass/fail tests only. [ Harman and Jones. Search-based software engineering. Information & Software Technology (2001) ] 13
  • 15. The original! Note the overachievement! We were aiming for a minimum of 50k LOC. 15 Inspired By Fuzzing
  • 16. Fast Forward to the 2019 ICSE SEIP Track…. (Presented just two hours ago!) 16 “Results from repair applied to 6 multi- million line systems.” “Facebook, Inc” “one widely-studied [repair] approach uses software testing to guide the repair process, as typified by GenProg.”
  • 17. How did we get here? 17
  • 18. An incomplete list of acknowledgements • As a group, we saw an emphasis on taking risks and allowing junior researchers to rise to the occasion. • Mark Harman and collaborators at CREST and SBST • John Knight, Jack Davidson, Anh Nguyen-Tuong, Eric Schulte, Zak Fry, Ethan Fast, and Michael Dewey-Vogt from UVA/UNM • Tom Ball and collaborators from Microsoft Research • Pat Hurley and Sol Greenspan for initial funding support • … and many more! 18
  • 19. Another possibility: We were right about everything! All we needed was for Facebook to take an interest and implement it. 19
  • 21. MUTATE DISCARD INPUT EVALUATE FITNESS ACCEPT OUTPUT 21 Storing full programs does not scale. Storing patches instead would help make genetic improvement feasible.
  • 22. Candidate Repairs: Modified Abstract Syntax Trees • Programs statements are manipulated. – Reduces search space compared to changing expressions. • Custom spectrum fault localization. – High (1.0) if S is not visited on a passed test. – Low (0.1, 0.0) if S is also visited on a passed test. • Simple mutation. – Choose a statement based on fault localization weights. – Delete S, Replace S with S1, or Insert S2 after S. • Choose S1 and S2 from the entire AST. • Reduces search space compared to inventing statements (synthesis). 22
  • 24. MUTATE 24 INPUT DISCARD ACCEPT EVALUATE FITNESS Minimization does not affect semantic patch quality. ML experts have known since the 70's that model size can be independent of degree of overfitting.
  • 25. Selection and Maintenance Debt • If there are multiple high-fitness candidates, how do you pick one? – Used stochastic universal sampling (roulette wheel selection). – Better approaches (e.g., tournament selection) were known. – SUS was first on the list on Wikipedia (no, really …). • First GenProg prototype was made for a Multi University Research Initiatives grant meeting. – Fixed GCD infinite loop and Nullhttpd buffer overrun. – Victim of its own success: no refactoring for years. 25
  • 26. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. 26
  • 27. Another possibility: We were right about everything! All we needed was for Facebook to take an interest and implement it. 27
  • 30. “If I gave you the last 100 bugs from <my project>, how many could <your technique> fix?” – Real Engineers 30
  • 31. Systematic Benchmark Construction 31 • Approach: use historical data to approximate discovery and repair of bugs in the wild. • Mine program versions (going back in time on SourceForge, Google Code, Fedora SRPM, etc.) where test case behavior changes. • Corresponds to a human-written repair for the bug tested by the failing test case(s).
  • 32. ManyBugs drove algorithmic changes Program LOC Tests Bugs Description fbc 97,000 773 3 Language (legacy) gmp 145,000 146 2 Multiple precision math gzip 491,000 12 5 Data compression libtiff 77,000 78 24 Image manipulation lighttpd 62,000 295 9 Web server php 1,046,000 8,471 44 Language (web) python 407,000 355 11 Language (general) wireshark 2,814,000 63 7 Network packet analyzer Total 5,139,000 10,193 105 32
  • 33. ManyBugs drove algorithmic changes Program LOC Tests Bugs Description fbc 97,000 773 3 Language (legacy) gmp 145,000 146 2 Multiple precision math gzip 491,000 12 5 Data compression libtiff 77,000 78 24 Image manipulation lighttpd 62,000 295 9 Web server php 1,046,000 8,471 44 Language (web) python 407,000 355 11 Language (general) wireshark 2,814,000 63 7 Network packet analyzer Total 5,139,000 10,193 105 33
  • 34. ManyBugs drove algorithmic changes Program LOC Tests Bugs Description fbc 97,000 773 3 Language (legacy) gmp 145,000 146 2 Multiple precision math gzip 491,000 12 5 Data compression libtiff 77,000 78 24 Image manipulation lighttpd 62,000 295 9 Web server php 1,046,000 8,471 44 Language (web) python 407,000 355 11 Language (general) wireshark 2,814,000 63 7 Network packet analyzer Total 5,139,000 10,193 105 34
  • 35. 35 Scalability • Both search- and synthesis-based repair techniques underwent fundamental reconfigurations to enable scalability. • A shared dataset of indicative, real-world bugs drove innovation in test- driven program repair. [ Nguyen et al. SemFix: program repair via semantic analysis. ICSE 2013. ] [ Mechtaev et al. Angelix: scalable multiline program patch synthesis via symbolic analysis. ICSE 2016. ] [ Sim et al. Using benchmarking to advance research: A challenge to software engineering. ICSE 2003. ] The Three Major Challenges
  • 36. One aspect of impactful research? Release your code! • This is still much less common than you’d expect. – I am not and have not been perfect about this, but I try. • Releasing code takes time and energy. Why clean up your code and write documentation when it won’t turn into another paper? • Releasing code and data supports extension and comparison, a cornerstone of the scientific process. 36
  • 37. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. 37
  • 38. 38 Scalability Repair quality The Three Major Challenges
  • 40. Using tests was controversial 40
  • 41. Tests make for a dangerous fitness function. 41 “reuse a content- length check from elsewhere in the code”  nullhttpd: a webserver with basic GET + POST functionality. Version 0.5.0: remote-exploitable heap-based buffer overflow in handling of POST. Failing test case: run exploit, see if webserver is still running Easy passing test cases: 1. “GET index.html” 2. “GET image.jpg” 3. “GET notfound.html” 4. ”POST /cgi-bin/hello.pl” + =
  • 42. Tests make for a dangerous fitness function. 42  nullhttpd: a webserver with basic GET + POST functionality. Version 0.5.0: remote-exploitable heap-based buffer overflow in handling of POST. Failing test case: run exploit, see if webserver is still running + = “delete handling of POST requests” Easy passing test cases: 1. “GET index.html” 2. “GET image.jpg” 3. “GET notfound.html” 4. ”POST /cgi-bin/hello.pl”
  • 43. Test suite quality definitely matters. 43 How much, practically, is a different question… Wes Claire GRAD SCHOOL Swords
  • 44. 44The journal extension!  Scenario: Long-running servers + IDS + generate repairs for detected anomalies.  Workloads: a day/week of unfiltered requests to the UVA CS webserver./php application. [Rinard, et al.. Enhancing server availability and security through failure- oblivious computing. OSDI ‘04.]
  • 45. 45The journal extension!  Scenario: Long-running servers + IDS + generate repairs for detected anomalies.  Workloads: a day/week of unfiltered requests to the UVA CS webserver./php application. THIS PATCH DELETED CODE [Rinard, et al.. Enhancing server availability and security through failure- oblivious computing. OSDI ‘04.]
  • 46. 46The journal extension! Even a functionality-reducing repair had little practical impact. [Rinard, et al.. Enhancing server availability and security through failure- oblivious computing. OSDI ‘04.]
  • 47. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code can drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. 47
  • 48. 2019: “We follow the standard practice of using test cases to evaluate patch correctness.” –many papers • Majority of 10-12 repair papers at ICSE 2019 use tests. • Research efforts to understand, characterize, measure, and promote patch quality are ongoing. • Tests are good for many reasons! – Developers understand them. – They can be used to check a wide array of properties. • Tests support a general repair paradigm that can (in principle) integrate into existing QA practice. 48
  • 49. We were wrong about tests at the time. [ Urli et al. How to design a program repair bot?: insights from the Repairnator project. ICSE (SEIP) 2018 ] • The rise/dominance of continuous integration, a natural extension point for automatic repair, is a fairly recent phenomenon. • Modern workflows make it easy to include a human in the patch review loop, further reducing risk of low-quality patches. • But: the actual use case we proposed in 2009 was unready for prime time. 49
  • 50. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. • Lesson 4: It is OK for SE research to lead SE practice. 50
  • 51. Scientific Foundations of Repair • On the one hand, it is surprising that GenProg has worked well. • On the other hand, why hasn't it worked better? • If evolution worked as well in software as it does in biology, we would have seen many more advances. 51
  • 52. Software = Engineering + Evolution ? • There are already many evident analogues between Darwinian processes and software engineering. – Successful code is copied and reused (clones, COTS, interfaces vs. inheritance). – Programmers make small modifications (localization, churn vs. variation). • This suggests new directions for understanding and improving software and software engineering. 52
  • 53. Repair Theory: Evolutionary Computation • Genetic Programming was introduced in 1985. – Rarely scaled beyond polynomials. – Steph thought repair using GP sounded fun, but would not work! • The open question is to bridge the gap between results and techniques in evolutionary biology and software engineering. • One direction is to understand why it works at all. – Transition insights from evolutionary computing to software. [ Arcuri. On the automation of fixing software bugs. ICSE Companion 2008 ] 53
  • 54. Taking Evolution Seriously • Potential biological properties of software: - e.g.: Mutational robustness vs. environmental robustness. • Understanding the search space can help us search effectively. - Many bugs are small. Software is not fragile. - Other potentially interesting analogues: • Neutrality and epistasis. • Fitness distributions. • Neutral network topology. • Call to arms: be open to insights from other fields. 54
  • 55. Lessons from a decade of program repair • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. • Lesson 4: It is OK for research to lead industrial practice. • Lesson 5: Be open to insights from other fields. 55
  • 56. How did we get here? 56
  • 57. 10 years of progress 57 Scalability Repair quality Expressive power Big leaps in scalability. Incremental leaps, fundamental research.
  • 58. 10 years of progress 58 Scalability Repair quality Expressive power Big leaps in scalability. Incremental leaps, fundamental research. Relaxing of concerns in practice.
  • 61. Future work: the ongoing challenges Expressive power • More complicated, compositional, multi-part patches. • New ways to use machine learning over prior human edits to inform patch construction. • Integration into existing QA processes. Repair quality • Better understanding of partial correctness, hostile fitness landscapes, intermediate quality. • Are human-informed repairs more acceptable/trustworthy? • Use of additional, non-test signals available from the development process. 61 [ Saha et al. Harnessing Evolution for Multi-Hunk Program Repair. ICSE 2019 ] [ Long. Automatic Patch Generation via Learning from Successful Human Patches. PhD Thesis ] [ Monperrus. A critical review of "automatic patch generation learned from human-written patches": essay on the problem statement and the evaluation of automatic software repair. ICSE 2014 ]
  • 62. Stepping back: A challenge about the future of SE work 62 Automation Humans have not traditionally been expected to read, understand, or modify generated code. Human development effort Compilers/code generators raise the level of abstraction at which humans can operate. Software Bots Automated synthesis + transformation is integrating into the [complex socio technical | natural Darwinian] process of SE. How?
  • 63. Lessons from a decade of program repair • We made a dozen mistakes in algorithmic design. • Many of those mistakes were discovered/addressed via shared code and indicative benchmarks. • …that was a lot of work. • Test cases effectively capture key aspects of acceptability for deployment. • The success/failure of evolutionary approaches surfaces fundamental properties of software. • Lesson 1: Don’t let the perfect be the enemy of the good. • Lesson 2: Shared metrics, benchmarks, and code drive research progress and innovation. • Lesson 3: Impact arises from more than just a single paper. • Lesson 4: It is OK for research to lead industrial practice. • Lesson 5: Be open to insights from other fields. 63

Editor's Notes

  1. 17:46
  2. 3D growth: https://www.flickr.com/photos/86530412@N02/7935377706, www.stockmonkeys.com, labeled CC BY 2.0 quality: https://pixabay.com/en/approved-control-quality-stamp-147677/, public domain diversity: CC BY-NC-SA 2.0 https://www.flickr.com/photos/cimmyt/5219256862, Photo credit: Xochiquetzal Fonseca/CIMMYT.
  3. Figure from: https://cloud.google.com/solutions/continuous-integration/