Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Human-In-The-Loop Automatic Program Repair
1. Human-In-The-Loop
Automatic Program Repair
Marcel Böhme
Monash University
Van-Thuan Pham
University of Melbourne
Charaka Gheetal
Monash University
2. • Test-driven Automated Program Repair (APR)
• Given a failing test suite, change the program such that all test cases pass.
Test-driven APR without the tests?
1. Generate a repair candidate.
• Use test cases to localize potential repair sites.
• Change one or more repair sites to generate a repair candidate.
2. Validate repair candidate.
• Use test cases to measure the “quality” of the repair candidate.
• If more tests pass, use repair candidate and go back to 1.
3. • Test-driven Automated Program Repair (APR)
• Given a failing test suite, change the program such that all test cases pass.
• Huh? What about overfitting?
• The quality of the auto-repair strictly depends on the quality of the test suite.
Test-driven APR without the tests?
4. • Test-driven Automated Program Repair (APR)
• Given a failing test suite, change the program such that all test cases pass.
• Huh? What about overfitting?
• The quality of the auto-repair strictly depends on the quality of the test suite.
• Huh? What if no test suite is available, at all?
Test-driven APR without the tests?
5. • Test-driven Automated Program Repair (APR)
• Given a failing test suite, change the program such that all test cases pass.
• Huh? What about overfitting?
• The quality of the auto-repair strictly depends on the quality of the test suite.
• Huh? What if no test suite is available, at all?
• We have no automated oracle that judges whether the bug is observed.
• We do have access to the bug-reporting user, however.
Test-driven APR without the tests?
7. negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
• When a budget of queries is exhausted, it attempts to repair the bug.
8. negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
• When a budget of queries is exhausted, it attempts to repair the bug.
• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
9. negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
• When a budget of queries is exhausted, it attempts to repair the bug.
• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
“When executing this alternative test input,
the program produces the following output;
is the bug observed?”
A learn2fix query.
10. negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
• When a budget of queries is exhausted, it attempts to repair the bug.
• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
“The bug is observed only if the input is an
equilateral triangle with sides of unit-length.”
A learned automated bug oracle.
11. negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
• When a budget of queries is exhausted, it attempts to repair the bug.
• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
• Hypothesis: A machine that classifies more accurately which inputs
are bug-revealing, can also auto-generate repairs of better quality.
12. • When a budget of queries is exhausted, it attempts to repair the bug.
• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
• Hypothesis: A machine that classifies more accurately which inputs
are bug-revealing, can also auto-generate repairs of better quality.
negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
Our key challenge is to
maximize the oracle’s accuracy
given only a small budget of queries.
15. Learn2Fix Example
Returns an unexpected output
for all equilateral triangles except [1, 1, 1], and
for all isosceles triangles where c == 1.
Bug: classify(2,2,2) returns 2 while we expect 1.
18. Learn2Fix Ingredients
• Mutational fuzzing to generate more test cases
in the “neighborhood” of the failing test case.
• Active learning to construct an SMT(LRA) constraint
that is satisfied only by test cases the user would classify as failing.
• Test-driven APR to auto-generate the repair from the labeled test cases.
19. Learn2Fix Ingredients
• Mutational fuzzing to generate more test cases
in the “neighborhood” of the failing test case.
• Active learning to construct an SMT(LRA) constraint
that is satisfied only by test cases the user would classify as failing.
• Test-driven APR to auto-generate the repair from the labeled test cases.
20. Learn2Fix: Active Learning
• A test is a tuple: <input vector, output vector>.
• A labeled test is a user-classified test; labeled either passing or failing.
• Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is
satisfied by all failing tests but no passing tests.
21. Learn2Fix: Active Learning
• A test is a tuple: <input vector, output vector>.
• A labeled test is a user-classified test; labeled either passing or failing.
• Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is
satisfied by all failing tests but no passing tests.
• Learn2Fix extends INCAL’s passive learning into an active learning approach.
• The learner (Learn2Fix) queries the teacher (user)
about the labels of the most informative samples (tests).
23. Learn2Fix: Active Learning
• How to maximise oracle accuracy given a limited budget of queries?
• Learn2Fix maximizes the probability that the user labels failing tests.
• To reduce labelling effort.
• The user can’t label every generated test case.
• To address the class imbalance problem.
• Learn2Fix better learns to identify failing tests cases.
24. Learn2Fix: Active Learning
• How to maximise oracle accuracy given a limited budget of queries?
• Learn2Fix maximizes the probability that the user labels failing tests.
• How do we know failure probability before we even ask the user?
Our automated oracle is only binary: the bug is observed, or it is not.
25. Learn2Fix: Active Learning
• How to maximise oracle accuracy given a limited budget of queries?
• Learn2Fix maximizes the probability that the user labels failing tests.
• How do we know failure probability before we even ask the user?
Our automated oracle is only binary: the bug is observed, or it is not.
• From our oracle, we construct an unbiased committee of classifiers.
• Given a test t, we estimate its failure probability
as the proportion of classifiers that label t as failing.
• If more than 50% classifiers rate t as failing,
then present t to the user for labeling.
31. • Our repair benchmark: Codeflaws
• 3902 C programs from programming contests of 1653 users @ Codeforces
• Selected 552 programs that contain at least 1 failing tests and
only take numeric inputs.
• Each program comes with
• a buggy version and a golden version where the bug is fixed,
• manually written training test cases, and
• manually written heldout test cases.
Experiment Setup
32. • Methodology:
• As input for Learn2Fix, select a random failing test case from training set.
• To label our generated tests, use the golden version (instead of the user).
• Experiment parameters:
• 10 to 30 labels maximally allowed (query budget).
• 20 members in the committee of classifiers (committee size).
• 10 minutes maximal repair time (APR timeout).
• 30 repetitions of each experiment (#trials).
• Reproduce our experiments!
• All artefacts @ https://github.com/mboehme/learn2fix
Experiment Setup
33. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?
Experiments
34. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?
Experiments
35. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?
Experiments
• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.
• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.
36. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?
Experiments
• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.
• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.
• Even for the minority class (failing tests),
at least 78% of failing validation tests are
correctly labeled for majority of subjects.
37. • RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?
Experiments
38. • RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?
Experiments
• As the prediction accuracy of the
automated oracle increases,
• the user is asked to label a smaller
proportion of generated tests.
Proportion of (failing) tests that the user is asked to label.
39. • RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?
Experiments
• As the prediction accuracy of the
automated oracle increases,
• the user is asked to label a smaller
proportion of generated tests.
• Yet, the user is asked to label
roughly the same proportion of
actually failing test cases!
Proportion of (failing) tests that the user is asked to label.
40. • RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?
Experiments
Probability to generate / label a failing across all subjects.
41. • RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?
Experiments
Probability to generate / label a failing across all subjects.
• As the prediction accuracy of the
automated oracle increases,
• the probability, that the user is asked
to label a test that is indeed failing,
also increases.
42. • RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?
Experiments
Probability to generate / label a failing across all subjects.
• As the prediction accuracy of the
automated oracle increases,
• the probability, that the user is asked
to label a test that is indeed failing,
also increases.
• Meanwhile, the probability to
generate a test that is failing,
remains the same.
43. • RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?
Experiments
44. • RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?
Experiments
45. • RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?
Experiments
46. • Test-driven Automated Program Repair (APR)
• Given a failing test suite, change the program such that all test cases pass.
• Huh? What about overfitting?
• The quality of the auto-repair strictly depends on the quality of the test suite.
• Huh? What if no test suite is available, at all?
• We have no automated oracle that judges whether the bug is observed.
• We do have access to the bug-reporting user, however.
Test-driven APR without the tests?
• Test-driven Automated Program Repair (APR)
• Given a failing test suite, change the program such that all test cases pass.
• Huh? What about overfitting?
• The quality of the auto-repair strictly depends on the quality of the test suite.
• Huh? What if no test suite is available, at all?
• We have no automated oracle that judges whether the bug is observed.
• We do have access to the bug-reporting user, however.
Test-driven APR without the tests?
negotiates with the bug-reporting user the
condition under which the bug is observed.
Learn2Fix Overview
• When a budget of queries is exhausted, it attempts to repair the bug.
• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
• Hypothesis: A machine that classifies more accurately which inputs
are bug-revealing, can also auto-generate repairs of better quality.
Learn2Fix: Active Learning
• How to maximise oracle accuracy given a limited budget of queries?
• Learn2Fix maximizes the probability that the user labels failing tests.
• How do we know failure probability before we even ask the user?
Our automated oracle is only binary: the bug is observed, or it is not.
• From our oracle, we construct an unbiased committee of classifiers.
• Given a test t, we estimate its failure probability
as the proportion of classifiers that label t as failing.
• If more than 50% classifiers rate t as failing,
then present t to the user for labeling.
• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?
Experiments
• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.
• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.
• Even for the minority class (failing tests),
at least 78% of failing validation tests are
correctly labeled for majority of subjects.
👩💻 https://github.com/mboehme/learn2fix