SlideShare a Scribd company logo
1 of 46
Download to read offline
Human-In-The-Loop 

Automatic Program Repair
Marcel Böhme
Monash University
Van-Thuan Pham

University of Melbourne
Charaka Gheetal
Monash University
• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.
Test-driven APR without the tests?

1. Generate a repair candidate.

• Use test cases to localize potential repair sites.

• Change one or more repair sites to generate a repair candidate.

2. Validate repair candidate.

• Use test cases to measure the “quality” of the repair candidate.

• If more tests pass, use repair candidate and go back to 1.
• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.

• Huh? What about overfitting?

• The quality of the auto-repair strictly depends on the quality of the test suite.

Test-driven APR without the tests?

• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.

• Huh? What about overfitting?

• The quality of the auto-repair strictly depends on the quality of the test suite.

• Huh? What if no test suite is available, at all?
Test-driven APR without the tests?

• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.

• Huh? What about overfitting?

• The quality of the auto-repair strictly depends on the quality of the test suite.

• Huh? What if no test suite is available, at all?

• We have no automated oracle that judges whether the bug is observed.

• We do have access to the bug-reporting user, however.
Test-driven APR without the tests?

Learn2Fix Overview

negotiates with the bug-reporting user the 

condition under which the bug is observed.
negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

• When a budget of queries is exhausted, it attempts to repair the bug.
negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

• When a budget of queries is exhausted, it attempts to repair the bug.

• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

• When a budget of queries is exhausted, it attempts to repair the bug.

• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
“When executing this alternative test input,
the program produces the following output;
is the bug observed?”
A learn2fix query.
negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

• When a budget of queries is exhausted, it attempts to repair the bug.

• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.
“The bug is observed only if the input is an
equilateral triangle with sides of unit-length.”
A learned automated bug oracle.
negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

• When a budget of queries is exhausted, it attempts to repair the bug.

• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.

• Hypothesis: A machine that classifies more accurately which inputs

are bug-revealing, can also auto-generate repairs of better quality.
• When a budget of queries is exhausted, it attempts to repair the bug.

• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.

• Hypothesis: A machine that classifies more accurately which inputs

are bug-revealing, can also auto-generate repairs of better quality.
negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

Our key challenge is to
maximize the oracle’s accuracy
given only a small budget of queries.
Learn2Fix Example

Learn2Fix Example

Bug: classify(2,2,2) returns 2 while we expect 1.
Learn2Fix Example

Returns an unexpected output 

for all equilateral triangles except [1, 1, 1], and 

for all isosceles triangles where c == 1.
Bug: classify(2,2,2) returns 2 while we expect 1.
Learn2Fix Example

Our automated oracle:
Bug: classify(2,2,2) returns 2 while we expect 1.
Learn2Fix Example

Our automated oracle:
Bug: classify(2,2,2) returns 2 while we expect 1.
Learn2Fix Ingredients

• Mutational fuzzing to generate more test cases 

in the “neighborhood” of the failing test case.

• Active learning to construct an SMT(LRA) constraint 

that is satisfied only by test cases the user would classify as failing.

• Test-driven APR to auto-generate the repair from the labeled test cases.
Learn2Fix Ingredients

• Mutational fuzzing to generate more test cases 

in the “neighborhood” of the failing test case.

• Active learning to construct an SMT(LRA) constraint 

that is satisfied only by test cases the user would classify as failing.

• Test-driven APR to auto-generate the repair from the labeled test cases.
Learn2Fix: Active Learning

• A test is a tuple: <input vector, output vector>.

• A labeled test is a user-classified test; labeled either passing or failing.

• Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is
satisfied by all failing tests but no passing tests.
Learn2Fix: Active Learning

• A test is a tuple: <input vector, output vector>.

• A labeled test is a user-classified test; labeled either passing or failing.

• Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is
satisfied by all failing tests but no passing tests.

• Learn2Fix extends INCAL’s passive learning into an active learning approach.

• The learner (Learn2Fix) queries the teacher (user)

about the labels of the most informative samples (tests).
Learn2Fix: Active Learning

• How to maximise oracle accuracy given a limited budget of queries?
Learn2Fix: Active Learning

• How to maximise oracle accuracy given a limited budget of queries?

• Learn2Fix maximizes the probability that the user labels failing tests.

• To reduce labelling effort.

• The user can’t label every generated test case.

• To address the class imbalance problem.

• Learn2Fix better learns to identify failing tests cases.
Learn2Fix: Active Learning

• How to maximise oracle accuracy given a limited budget of queries?

• Learn2Fix maximizes the probability that the user labels failing tests.

• How do we know failure probability before we even ask the user?

Our automated oracle is only binary: the bug is observed, or it is not.
Learn2Fix: Active Learning

• How to maximise oracle accuracy given a limited budget of queries?

• Learn2Fix maximizes the probability that the user labels failing tests.

• How do we know failure probability before we even ask the user?

Our automated oracle is only binary: the bug is observed, or it is not.

• From our oracle, we construct an unbiased committee of classifiers.

• Given a test t, we estimate its failure probability

as the proportion of classifiers that label t as failing.

• If more than 50% classifiers rate t as failing,

then present t to the user for labeling.
Should the
user label?
Ask the
oracle.
Ask the
committee of
oracles.
Active Oracle Learning
Learn2Fix

• Our repair benchmark: Codeflaws

• 3902 C programs from programming contests of 1653 users @ Codeforces

• Selected 552 programs that contain at least 1 failing tests and 

only take numeric inputs.

• Each program comes with 

• a buggy version and a golden version where the bug is fixed, 

• manually written training test cases, and

• manually written heldout test cases.
Experiment Setup

• Methodology:

• As input for Learn2Fix, select a random failing test case from training set.

• To label our generated tests, use the golden version (instead of the user).

• Experiment parameters:

• 10 to 30 labels maximally allowed (query budget).

• 20 members in the committee of classifiers (committee size).

• 10 minutes maximal repair time (APR timeout).

• 30 repetitions of each experiment (#trials).

• Reproduce our experiments!
• All artefacts @ https://github.com/mboehme/learn2fix
Experiment Setup

• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?

Experiments

• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?

Experiments

• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?

Experiments

• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.

• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.
• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?

Experiments

• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.

• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.

• Even for the minority class (failing tests), 

at least 78% of failing validation tests are
correctly labeled for majority of subjects.
• RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments

• RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments

• As the prediction accuracy of the
automated oracle increases, 

• the user is asked to label a smaller
proportion of generated tests.
Proportion of (failing) tests that the user is asked to label.
• RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments

• As the prediction accuracy of the
automated oracle increases, 

• the user is asked to label a smaller
proportion of generated tests.

• Yet, the user is asked to label
roughly the same proportion of
actually failing test cases!
Proportion of (failing) tests that the user is asked to label.
• RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments

Probability to generate / label a failing across all subjects.
• RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments

Probability to generate / label a failing across all subjects.
• As the prediction accuracy of the
automated oracle increases, 

• the probability, that the user is asked
to label a test that is indeed failing,
also increases.
• RQ.2 Human Effort. What is the proportion of generated test cases
that are sent to the user for labelling?

Experiments

Probability to generate / label a failing across all subjects.
• As the prediction accuracy of the
automated oracle increases, 

• the probability, that the user is asked
to label a test that is indeed failing,
also increases.

• Meanwhile, the probability to
generate a test that is failing,
remains the same.
• RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?

Experiments

• RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?

Experiments

• RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to
auto-patches using the manually constructed training test suite?

Experiments

• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.

• Huh? What about overfitting?

• The quality of the auto-repair strictly depends on the quality of the test suite.

• Huh? What if no test suite is available, at all?

• We have no automated oracle that judges whether the bug is observed.

• We do have access to the bug-reporting user, however.
Test-driven APR without the tests?

• Test-driven Automated Program Repair (APR)

• Given a failing test suite, change the program such that all test cases pass.

• Huh? What about overfitting?

• The quality of the auto-repair strictly depends on the quality of the test suite.

• Huh? What if no test suite is available, at all?

• We have no automated oracle that judges whether the bug is observed.

• We do have access to the bug-reporting user, however.
Test-driven APR without the tests?

negotiates with the bug-reporting user the 

condition under which the bug is observed.

Learn2Fix Overview

• When a budget of queries is exhausted, it attempts to repair the bug.

• By systematic queries, Learn2Fix learns an automated bug oracle that
becomes increasingly more accurate in predicting the user’s response.

• Hypothesis: A machine that classifies more accurately which inputs

are bug-revealing, can also auto-generate repairs of better quality.
Learn2Fix: Active Learning

• How to maximise oracle accuracy given a limited budget of queries?

• Learn2Fix maximizes the probability that the user labels failing tests.

• How do we know failure probability before we even ask the user?

Our automated oracle is only binary: the bug is observed, or it is not.

• From our oracle, we construct an unbiased committee of classifiers.

• Given a test t, we estimate its failure probability

as the proportion of classifiers that label t as failing.

• If more than 50% classifiers rate t as failing,

then present t to the user for labeling.
• RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned
automatic oracle label the repair benchmark’s (failing) tests?

Experiments

• Learn2Fix has only ever seen just a single
failing test case from the manually labeled
validation test cases.

• Yet, Learn2Fix’s automatic oracle is able to
accurately predict the label of 75–84%
validation tests for the median subject.

• Even for the minority class (failing tests), 

at least 78% of failing validation tests are
correctly labeled for majority of subjects.
👩💻 https://github.com/mboehme/learn2fix

More Related Content

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Human-In-The-Loop Automatic Program Repair

  • 1. Human-In-The-Loop 
 Automatic Program Repair Marcel Böhme Monash University Van-Thuan Pham
 University of Melbourne Charaka Gheetal Monash University
  • 2. • Test-driven Automated Program Repair (APR) • Given a failing test suite, change the program such that all test cases pass. Test-driven APR without the tests?
 1. Generate a repair candidate. • Use test cases to localize potential repair sites. • Change one or more repair sites to generate a repair candidate. 2. Validate repair candidate. • Use test cases to measure the “quality” of the repair candidate. • If more tests pass, use repair candidate and go back to 1.
  • 3. • Test-driven Automated Program Repair (APR) • Given a failing test suite, change the program such that all test cases pass. • Huh? What about overfitting? • The quality of the auto-repair strictly depends on the quality of the test suite. Test-driven APR without the tests?

  • 4. • Test-driven Automated Program Repair (APR) • Given a failing test suite, change the program such that all test cases pass. • Huh? What about overfitting? • The quality of the auto-repair strictly depends on the quality of the test suite. • Huh? What if no test suite is available, at all? Test-driven APR without the tests?

  • 5. • Test-driven Automated Program Repair (APR) • Given a failing test suite, change the program such that all test cases pass. • Huh? What about overfitting? • The quality of the auto-repair strictly depends on the quality of the test suite. • Huh? What if no test suite is available, at all? • We have no automated oracle that judges whether the bug is observed. • We do have access to the bug-reporting user, however. Test-driven APR without the tests?

  • 6. Learn2Fix Overview
 negotiates with the bug-reporting user the 
 condition under which the bug is observed.
  • 7. negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 • When a budget of queries is exhausted, it attempts to repair the bug.
  • 8. negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 • When a budget of queries is exhausted, it attempts to repair the bug. • By systematic queries, Learn2Fix learns an automated bug oracle that becomes increasingly more accurate in predicting the user’s response.
  • 9. negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 • When a budget of queries is exhausted, it attempts to repair the bug. • By systematic queries, Learn2Fix learns an automated bug oracle that becomes increasingly more accurate in predicting the user’s response. “When executing this alternative test input, the program produces the following output; is the bug observed?” A learn2fix query.
  • 10. negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 • When a budget of queries is exhausted, it attempts to repair the bug. • By systematic queries, Learn2Fix learns an automated bug oracle that becomes increasingly more accurate in predicting the user’s response. “The bug is observed only if the input is an equilateral triangle with sides of unit-length.” A learned automated bug oracle.
  • 11. negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 • When a budget of queries is exhausted, it attempts to repair the bug. • By systematic queries, Learn2Fix learns an automated bug oracle that becomes increasingly more accurate in predicting the user’s response. • Hypothesis: A machine that classifies more accurately which inputs
 are bug-revealing, can also auto-generate repairs of better quality.
  • 12. • When a budget of queries is exhausted, it attempts to repair the bug. • By systematic queries, Learn2Fix learns an automated bug oracle that becomes increasingly more accurate in predicting the user’s response. • Hypothesis: A machine that classifies more accurately which inputs
 are bug-revealing, can also auto-generate repairs of better quality. negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 Our key challenge is to maximize the oracle’s accuracy given only a small budget of queries.
  • 14. Learn2Fix Example
 Bug: classify(2,2,2) returns 2 while we expect 1.
  • 15. Learn2Fix Example
 Returns an unexpected output 
 for all equilateral triangles except [1, 1, 1], and 
 for all isosceles triangles where c == 1. Bug: classify(2,2,2) returns 2 while we expect 1.
  • 16. Learn2Fix Example
 Our automated oracle: Bug: classify(2,2,2) returns 2 while we expect 1.
  • 17. Learn2Fix Example
 Our automated oracle: Bug: classify(2,2,2) returns 2 while we expect 1.
  • 18. Learn2Fix Ingredients
 • Mutational fuzzing to generate more test cases 
 in the “neighborhood” of the failing test case. • Active learning to construct an SMT(LRA) constraint 
 that is satisfied only by test cases the user would classify as failing. • Test-driven APR to auto-generate the repair from the labeled test cases.
  • 19. Learn2Fix Ingredients
 • Mutational fuzzing to generate more test cases 
 in the “neighborhood” of the failing test case. • Active learning to construct an SMT(LRA) constraint 
 that is satisfied only by test cases the user would classify as failing. • Test-driven APR to auto-generate the repair from the labeled test cases.
  • 20. Learn2Fix: Active Learning
 • A test is a tuple: <input vector, output vector>. • A labeled test is a user-classified test; labeled either passing or failing. • Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is satisfied by all failing tests but no passing tests.
  • 21. Learn2Fix: Active Learning
 • A test is a tuple: <input vector, output vector>. • A labeled test is a user-classified test; labeled either passing or failing. • Given a set of labeled tests, INCAL learns an SMT(LRA) constraint that is satisfied by all failing tests but no passing tests. • Learn2Fix extends INCAL’s passive learning into an active learning approach. • The learner (Learn2Fix) queries the teacher (user)
 about the labels of the most informative samples (tests).
  • 22. Learn2Fix: Active Learning
 • How to maximise oracle accuracy given a limited budget of queries?
  • 23. Learn2Fix: Active Learning
 • How to maximise oracle accuracy given a limited budget of queries? • Learn2Fix maximizes the probability that the user labels failing tests. • To reduce labelling effort. • The user can’t label every generated test case. • To address the class imbalance problem. • Learn2Fix better learns to identify failing tests cases.
  • 24. Learn2Fix: Active Learning
 • How to maximise oracle accuracy given a limited budget of queries? • Learn2Fix maximizes the probability that the user labels failing tests. • How do we know failure probability before we even ask the user?
 Our automated oracle is only binary: the bug is observed, or it is not.
  • 25. Learn2Fix: Active Learning
 • How to maximise oracle accuracy given a limited budget of queries? • Learn2Fix maximizes the probability that the user labels failing tests. • How do we know failure probability before we even ask the user?
 Our automated oracle is only binary: the bug is observed, or it is not. • From our oracle, we construct an unbiased committee of classifiers. • Given a test t, we estimate its failure probability
 as the proportion of classifiers that label t as failing. • If more than 50% classifiers rate t as failing,
 then present t to the user for labeling.
  • 29.
  • 31. • Our repair benchmark: Codeflaws • 3902 C programs from programming contests of 1653 users @ Codeforces • Selected 552 programs that contain at least 1 failing tests and 
 only take numeric inputs. • Each program comes with • a buggy version and a golden version where the bug is fixed, • manually written training test cases, and • manually written heldout test cases. Experiment Setup

  • 32. • Methodology: • As input for Learn2Fix, select a random failing test case from training set. • To label our generated tests, use the golden version (instead of the user). • Experiment parameters: • 10 to 30 labels maximally allowed (query budget). • 20 members in the committee of classifiers (committee size). • 10 minutes maximal repair time (APR timeout). • 30 repetitions of each experiment (#trials). • Reproduce our experiments! • All artefacts @ https://github.com/mboehme/learn2fix Experiment Setup

  • 33. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned automatic oracle label the repair benchmark’s (failing) tests? Experiments

  • 34. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned automatic oracle label the repair benchmark’s (failing) tests? Experiments

  • 35. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned automatic oracle label the repair benchmark’s (failing) tests? Experiments
 • Learn2Fix has only ever seen just a single failing test case from the manually labeled validation test cases. • Yet, Learn2Fix’s automatic oracle is able to accurately predict the label of 75–84% validation tests for the median subject.
  • 36. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned automatic oracle label the repair benchmark’s (failing) tests? Experiments
 • Learn2Fix has only ever seen just a single failing test case from the manually labeled validation test cases. • Yet, Learn2Fix’s automatic oracle is able to accurately predict the label of 75–84% validation tests for the median subject. • Even for the minority class (failing tests), 
 at least 78% of failing validation tests are correctly labeled for majority of subjects.
  • 37. • RQ.2 Human Effort. What is the proportion of generated test cases that are sent to the user for labelling? Experiments

  • 38. • RQ.2 Human Effort. What is the proportion of generated test cases that are sent to the user for labelling? Experiments
 • As the prediction accuracy of the automated oracle increases, • the user is asked to label a smaller proportion of generated tests. Proportion of (failing) tests that the user is asked to label.
  • 39. • RQ.2 Human Effort. What is the proportion of generated test cases that are sent to the user for labelling? Experiments
 • As the prediction accuracy of the automated oracle increases, • the user is asked to label a smaller proportion of generated tests. • Yet, the user is asked to label roughly the same proportion of actually failing test cases! Proportion of (failing) tests that the user is asked to label.
  • 40. • RQ.2 Human Effort. What is the proportion of generated test cases that are sent to the user for labelling? Experiments
 Probability to generate / label a failing across all subjects.
  • 41. • RQ.2 Human Effort. What is the proportion of generated test cases that are sent to the user for labelling? Experiments
 Probability to generate / label a failing across all subjects. • As the prediction accuracy of the automated oracle increases, • the probability, that the user is asked to label a test that is indeed failing, also increases.
  • 42. • RQ.2 Human Effort. What is the proportion of generated test cases that are sent to the user for labelling? Experiments
 Probability to generate / label a failing across all subjects. • As the prediction accuracy of the automated oracle increases, • the probability, that the user is asked to label a test that is indeed failing, also increases. • Meanwhile, the probability to generate a test that is failing, remains the same.
  • 43. • RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to auto-patches using the manually constructed training test suite? Experiments

  • 44. • RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to auto-patches using the manually constructed training test suite? Experiments

  • 45. • RQ.3 Patch Quality. How do Learn2Fix auto-patches compare to auto-patches using the manually constructed training test suite? Experiments

  • 46. • Test-driven Automated Program Repair (APR) • Given a failing test suite, change the program such that all test cases pass. • Huh? What about overfitting? • The quality of the auto-repair strictly depends on the quality of the test suite. • Huh? What if no test suite is available, at all? • We have no automated oracle that judges whether the bug is observed. • We do have access to the bug-reporting user, however. Test-driven APR without the tests?
 • Test-driven Automated Program Repair (APR) • Given a failing test suite, change the program such that all test cases pass. • Huh? What about overfitting? • The quality of the auto-repair strictly depends on the quality of the test suite. • Huh? What if no test suite is available, at all? • We have no automated oracle that judges whether the bug is observed. • We do have access to the bug-reporting user, however. Test-driven APR without the tests?
 negotiates with the bug-reporting user the 
 condition under which the bug is observed. Learn2Fix Overview
 • When a budget of queries is exhausted, it attempts to repair the bug. • By systematic queries, Learn2Fix learns an automated bug oracle that becomes increasingly more accurate in predicting the user’s response. • Hypothesis: A machine that classifies more accurately which inputs
 are bug-revealing, can also auto-generate repairs of better quality. Learn2Fix: Active Learning
 • How to maximise oracle accuracy given a limited budget of queries? • Learn2Fix maximizes the probability that the user labels failing tests. • How do we know failure probability before we even ask the user?
 Our automated oracle is only binary: the bug is observed, or it is not. • From our oracle, we construct an unbiased committee of classifiers. • Given a test t, we estimate its failure probability
 as the proportion of classifiers that label t as failing. • If more than 50% classifiers rate t as failing,
 then present t to the user for labeling. • RQ.1 Oracle Quality. How accurately does Learn2Fix’s learned automatic oracle label the repair benchmark’s (failing) tests? Experiments
 • Learn2Fix has only ever seen just a single failing test case from the manually labeled validation test cases. • Yet, Learn2Fix’s automatic oracle is able to accurately predict the label of 75–84% validation tests for the median subject. • Even for the minority class (failing tests), 
 at least 78% of failing validation tests are correctly labeled for majority of subjects. 👩💻 https://github.com/mboehme/learn2fix