Human-In-The-Loop Automatic Program Repair

Human-In-The-Loop  
Automatic Program Repair
Marcel Böhme
Monash University
Van-Thuan Pham 
University of Melbourne
Charaka Gheetal
Monash University

• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.
Test-driven APR without the tests? 
1. Generate a repair candidate.

• Use test cases to localize potential repair sites.

• Change one or more repair sites to generate a repair candidate.

2. Validate repair candidate.

• Use test cases to measure the “quality” of the repair candidate.

• If more tests pass, use repair candidate and go back to 1.



• Huh? What about overfitting?

• The quality of the auto-repair strictly depends on the quality of the test suite.

Test-driven APR without the tests?





• Huh? What if no test suite is available, at all?






• We have no automated oracle that judges whether the bug is observed.

• We do have access to the bug-reporting user, however.

Learn2Fix Overview 
negotiates with the bug-reporting user the  
condition under which the bug is observed.


• When a budget of queries is exhausted, it attempts to repair the bug.



• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.



“When executing this alternative test input,
the program produces the following output;
is the bug observed?”
A learn2ﬁx query.



“The bug is observed only if the input is an
equilateral triangle with sides of unit-length.”
A learned automated bug oracle.




• Hypothesis: A machine that classiﬁes more accurately which inputs 
are bug-revealing, can also auto-generate repairs of better quality.




Our key challenge is to
maximize the oracle’s accuracy
given only a small budget of queries.

Learn2Fix Example 
Bug: classify(2,2,2) returns 2 while we expect 1.

Returns an unexpected output  
for all equilateral triangles except [1, 1, 1], and  
for all isosceles triangles where c == 1.

Our automated oracle:

Learn2Fix Ingredients 
• Mutational fuzzing to generate more test cases  
in the “neighborhood” of the failing test case.

• Active learning to construct an SMT(LRA) constraint  
that is satisﬁed only by test cases the user would classify as failing.

• Test-driven APR to auto-generate the repair from the labeled test cases.

Learn2Fix: Active Learning 
• A test is a tuple: <input vector, output vector>.

• A labeled test is a user-classiﬁed test; labeled either passing or failing.

• Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is
satisﬁed by all failing tests but no passing tests.

• A test is a tuple: <input vector, output vector>.

• A labeled test is a user-classiﬁed test; labeled either passing or failing.

• Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is
satisﬁed by all failing tests but no passing tests.

• Learn2Fix extends INCAL’s passive learning into an active learning approach.

• The learner (Learn2Fix) queries the teacher (user) 
about the labels of the most informative samples (tests).

• How to maximise oracle accuracy given a limited budget of queries?


• Learn2Fix maximizes the probability that the user labels failing tests.

• To reduce labelling eﬀort.

• The user can’t label every generated test case.

• To address the class imbalance problem.

• Learn2Fix better learns to identify failing tests cases.



• How do we know failure probability before we even ask the user? 
Our automated oracle is only binary: the bug is observed, or it is not.




• From our oracle, we construct an unbiased committee of classifiers.

• Given a test t, we estimate its failure probability 
as the proportion of classifiers that label t as failing.

• If more than 50% classifiers rate t as failing, 
then present t to the user for labeling.

Active Oracle Learning
Learn2Fix

• Our repair benchmark: Codeﬂaws

• 3902 C programs from programming contests of 1653 users @ Codeforces

• Selected 552 programs that contain at least 1 failing tests and  
only take numeric inputs.

• Each program comes with

• a buggy version and a golden version where the bug is ﬁxed,

• manually written training test cases, and

• manually written heldout test cases.
Experiment Setup

• Methodology:

• As input for Learn2Fix, select a random failing test case from training set.

• To label our generated tests, use the golden version (instead of the user).

• Experiment parameters:

• 10 to 30 labels maximally allowed (query budget).

• 20 members in the committee of classiﬁers (committee size).

• 10 minutes maximal repair time (APR timeout).

• 30 repetitions of each experiment (#trials).

• Reproduce our experiments!
• All artefacts @ https://github.com/mboehme/learn2ﬁx
Experiment Setup

• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?

Experiments


Experiments 
• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.

• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.


Experiments 


• Even for the minority class (failing tests),  
at least 78% of failing validation tests are
correctly labeled for majority of subjects.

• RQ.2 Human Eﬀort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments


Experiments 
• As the prediction accuracy of the
automated oracle increases,

• the user is asked to label a smaller
proportion of generated tests.
Proportion of (failing) tests that the user is asked to label.


Experiments 

• the user is asked to label a smaller
proportion of generated tests.

• Yet, the user is asked to label
roughly the same proportion of
actually failing test cases!
Proportion of (failing) tests that the user is asked to label.


Experiments 
Probability to generate / label a failing across all subjects.


Experiments 

• the probability, that the user is asked
to label a test that is indeed failing,
also increases.


Experiments 

• the probability, that the user is asked
to label a test that is indeed failing,
also increases.

• Meanwhile, the probability to
generate a test that is failing,
remains the same.

• RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?

Experiments



















• From our oracle, we construct an unbiased committee of classifiers.

• Given a test t, we estimate its failure probability 
as the proportion of classifiers that label t as failing.

• If more than 50% classifiers rate t as failing, 
then present t to the user for labeling.

Experiments 


• Even for the minority class (failing tests),  
at least 78% of failing validation tests are
correctly labeled for majority of subjects.
👩💻 https://github.com/mboehme/learn2fix

Human-In-The-Loop Automatic Program Repair

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Human-In-The-Loop Automatic Program Repair