Successfully reported this slideshow.
Your SlideShare is downloading. ×

You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems

You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems

Download to read offline

Properly benchmarking Automated Program Re- pair (APR) systems should contribute to the development and adoption of the research outputs by practitioners. To that end, the research community must ensure that it reaches significant milestones by reliably comparing state-of-the-art tools for a better understanding of their strengths and weaknesses. In this work, we identify and investigate a practical bias caused by the fault localization (FL) step in a repair pipeline. We propose to highlight the different fault localization configurations used in the literature, and their impact on APR systems when applied to the Defects4J benchmark. Then, we explore the performance variations that can be achieved by “tweaking” the FL step. Eventually, we expect to create a new momentum for (1) full disclosure of APR experimental procedures with respect to FL, (2) realistic expectations of repairing bugs in Defects4J, as well as (3) reliable performance comparison among the state-of-the- art APR systems, and against the baseline performance results of our thoroughly assessed kPAR repair tool. Our main findings include: (a) only a subset of Defects4J bugs can be currently localized by commonly-used FL techniques; (b) current practice of comparing state-of-the-art APR systems (i.e., counting the number of fixed bugs) is potentially misleading due to the bias of FL configurations; and (c) APR authors do not properly qualify their performance achievement with respect to the different tuning parameters implemented in APR systems.

Properly benchmarking Automated Program Re- pair (APR) systems should contribute to the development and adoption of the research outputs by practitioners. To that end, the research community must ensure that it reaches significant milestones by reliably comparing state-of-the-art tools for a better understanding of their strengths and weaknesses. In this work, we identify and investigate a practical bias caused by the fault localization (FL) step in a repair pipeline. We propose to highlight the different fault localization configurations used in the literature, and their impact on APR systems when applied to the Defects4J benchmark. Then, we explore the performance variations that can be achieved by “tweaking” the FL step. Eventually, we expect to create a new momentum for (1) full disclosure of APR experimental procedures with respect to FL, (2) realistic expectations of repairing bugs in Defects4J, as well as (3) reliable performance comparison among the state-of-the- art APR systems, and against the baseline performance results of our thoroughly assessed kPAR repair tool. Our main findings include: (a) only a subset of Defects4J bugs can be currently localized by commonly-used FL techniques; (b) current practice of comparing state-of-the-art APR systems (i.e., counting the number of fixed bugs) is potentially misleading due to the bias of FL configurations; and (c) APR authors do not properly qualify their performance achievement with respect to the different tuning parameters implemented in APR systems.

Advertisement
Advertisement

More Related Content

Advertisement

You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems

  1. 1. You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems Kui Liu, Anil Koyuncu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein and Yves Le Traon SnT, University of Luxembourg, Luxembourg Xi’an, China, 12th ICST 2019April 24, 2019
  2. 2. > CONTEXT 1
  3. 3. 2 > Automated Program Repair (APR) Basic Process Test Pass Fail Patch Validation Plausible Patch Patch Candidate APR Approach Patch Generation Fault Localization Tools A Ranked List of Suspicious Code Locations Fault Localization (FL) Buggy Program Passing tests Failing tests
  4. 4. 3 > Repair Performance Assessment of APR Tools GenProg, ACS, CapGen, SimFix, etc.
  5. 5. 4 > Repair Performance Assessment of APR Tools Benchmark Defects4J SimFix jGP jKali Nopol ACS HDR ssFix ELIXIR JAID Chart 4 0 0 1 2 -(2) 3 4 2(4) Closure 6 0 0 0 0 -(7) 2 0 5(9) Math 14 5 1 1 12 -(7) 10 12 1(7) Lang 9 0 0 3 3 -(6) 5 8 1(5) Time 1 0 0 0 1 -(1) 0 2 0(0) Total 34 5 1 5 18 13(23) 20 26 9(25) TABLE I. TABLE EXCERPTED FROM [33] WITH THE CAPTION “Correct patches generated by different techniques”. Is this kind of comparison fair enough? *The numbers in parenthesis(#) denote the number of bugs fixed by APR tools but ignoring the patch ranking.
  6. 6. > PROBLEM 5
  7. 7. 6 > Automated Program Repair (APR) Basic Process Test Pass Fail Patch Validation Plausible Patch Patch Candidate APR Approach Patch Generation Fault Localization Tools A Ranked List of Suspicious Code Locations Fault Localization (FL) Buggy Program Passing tests Failing tests Same SameContributionNo Discussion There is no any disclosed discussion on the impact of fault localization for APR tool repair performance.
  8. 8. 7 > Repair Performance Assessment of APR Tools Benchmark Defects4J SimFix jGP jKali Nopol ACS HDR ssFix ELIXIR JAID Chart 4 0 0 1 2 -(2) 3 4 2(4) Closure 6 0 0 0 0 -(7) 2 0 5(9) Math 14 5 1 1 12 -(7) 10 12 1(7) Lang 9 0 0 3 3 -(6) 5 8 1(5) Time 1 0 0 0 1 -(1) 0 2 0(0) Total 34 5 1 5 18 13(23) 20 26 9(25) TABLE I. TABLE EXCERPTED FROM [33] WITH THE CAPTION “Correct patches generated by different techniques”. BIASED If the fault localization of APR tools are different from each other, this kind of comparison could be biased or misleading.
  9. 9. > FAULT LOCALIZATION 8
  10. 10. 9 > Spectrum-based fault localization, e.g., 𝑆 𝑂𝑐ℎ𝑖𝑎𝑖 𝑠 = 𝑓𝑎𝑖𝑙𝑒𝑑(𝑠) 𝑓𝑎𝑖𝑙𝑒𝑑 𝑠 + 𝑝𝑎𝑠𝑠𝑒𝑑 𝑠 ∗ (𝑓𝑎𝑖𝑙𝑒𝑑 𝑠 + 𝑓𝑎𝑖𝑙𝑒𝑑(¬𝑠)) 𝑆 𝑇𝑎𝑟𝑎𝑛𝑡𝑢𝑙𝑎 𝑠 = 𝑆 𝐵ar𝑖𝑛𝑒𝑙 𝑠 = 1 − 𝑝𝑎𝑠𝑠𝑒𝑑(𝑠) 𝑓𝑎𝑖𝑙𝑒𝑑 𝑠 + 𝑝𝑎𝑠𝑠𝑒𝑑(𝑠) 𝑓𝑎𝑖𝑙𝑒𝑑(𝑠) 𝑓𝑎𝑖𝑙𝑒𝑑 𝑠 + 𝑓𝑎𝑖𝑙𝑒𝑑(¬𝑠) 𝑓𝑎𝑖𝑙𝑒𝑑(𝑠) 𝑓𝑎𝑖𝑙𝑒𝑑 𝑠 + 𝑓𝑎𝑖𝑙𝑒𝑑(¬𝑠) + 𝑝𝑎𝑠𝑠𝑒𝑑(𝑠) 𝑝𝑎𝑠𝑠𝑒𝑑 𝑠 + 𝑝𝑎𝑠𝑠𝑒𝑑(¬𝑠) Fault Localization used in Automated Program Repair for Java
  11. 11. 10 > <Suspicious Class Name> <Line> <Suspicious Value> org.jfree.chart.plot.CategoryPlot 1613 0.316227766 org.jfree.chart.plot.CategoryPlot 1684 0.25 org.jfree.chart.plot.CategoryPlot 1682 0.25 org.jfree.chart.plot.CategoryPlot 1665 0.25 org.jfree.chart.plot.CategoryPlot 1340 0.1825741858 org.jfree.chart.plot.CategoryPlot 1339 0.1825741858 org.jfree.chart.plot.CategoryPlot 1358 0.1507556723 org.jfree.chart.plot.CategoryPlot 1367 0.1474419562 org.jfree.chart.plot.CategoryPlot 1045 0.1443375673 org.jfree.chart.renderer.category.AbstractCategoryItemRenderer 1798 0.115 org.jfree.chart.renderer.category.AbstractCategoryItemRenderer 1797 0.115 org.jfree.chart.renderer.category.AbstractCategoryItemRenderer 1796 0.115 org.jfree.data.category.DefaultCategoryDataset 259 0.0733235575 Fault Localization used in Automated Program Repair for Java
  12. 12. 11 > Fault Localization Bias in APR Test Pass Fail Patch Validation Plausible Patch Patch Candidate APR Approach Patch Generation Fault Localization Tools A Ranked List of Suspicious Code Locations Fault Localization (FL) Buggy Program Passing tests Failing tests
  13. 13. > 12 > Research Questions • RQ1: How do APR systems leverage fault localization techniques? • RQ2: How many bugs from a common APR benchmark are actually localizable? • RQ3: To what extent APR performance can vary by adapting different fault localization configurations?
  14. 14. 13 RQ1: Integration of Fault Localization Tools in APR Pipelines
  15. 15. 14 > FL Techniques used in APR tools HDRepair, ssFix, ACS SimFix. APR Tools FL framework Framework version FL ranking metric Supplementary information jGenProg GZoltar 0.1.1 Ochiai ∅ jKali GZoltar 0.1.1 Ochiai ∅ jMutRepair GZoltar 0.1.1 Ochiai ∅ HDRepair ? ? ? Faulty method is known. Nopol GZoltar 0.0.10 Ochiai ∅ ACS GZoltar 0.1.1 Ochiai Predicate switching [53]. ELIXIR ? ? Ochiai ? JAID ? ? ? ? ssFix GZoltar 0.1.1 ? Statements in crashed stack trace. CapGen GZoltar 0.1.1 Ochiai ? SketchFix ? ? Ochiai ? FixMiner GZoltar 0.1.1 Ochiai ∅ LSRepair GZoltar 0.1.1 Ochiai ∅ SimFix GZoltar 1.6.0 Ochiai Test case purification [54].
  16. 16. 15 1. State-of-the-art APR systems in the literature add some adaptations to the usual FL process to improve its accuracy. Unfortunately, researchers have eluded so far the contribution of this improvement in the overall repair performance, leading to biased comparisons. 2. APR systems do not fully disclose their fault localization tuning parameters, thus preventing reliable replication and comparisons. > Take-away Message
  17. 17. 16 RQ2: Localizability of Defects4J Bugs
  18. 18. 17 > How many bugs in the Defects4J benchmark can actually be localized by current automated fault localization tools? Localizability: --- a/src/org/jfree/data/time/Week.java +++ b/src/org/jfree/data/time/Week.java @@ −173, 1 +173, 1 @@ public Week(Date time, TimeZone zone) { // defer argument checking... - this(time, RegularTimePeriod.DEFAULT_TIME_ZONE, Locale.getDefault()); + this(time, zone, Locale.getDefault); } File Level Method Level Line Level
  19. 19. 18 > How many bugs in the Defects4J benchmark can actually be localized by current automated fault localization tools? Project # Bugs File Method Line GZ-0.1.1 GZ-1.6.0 GZ-0.1.1 GZ-1.6.0 GZ-0.1.1 GZ-1.6.0 Chart 26 25 25 22 24 22 24 Closure 133 113 128 78 96 78 95 Lang 65 54 64 32 59 29 57 Math 106 101 105 92 100 91 100 Mockito 38 25 26 22 24 21 23 Time 27 26 26 22 22 22 22 Total 395 344 374 268 325 263 321 Number of Defects4J bugs localized with GZoltar + Ochiai. One third of bugs in the Defects4J dataset cannot be localized at line level by the commonly used automated fault localization tool.
  20. 20. 19 > How many bugs in the Defects4J benchmark can actually be localized by current automated fault localization tools? Ranking Metric GZoltar-0.1.1 GZoltar-1.6.0 File Method Line File Method Line Top-1 Position Tarantula 171 101 45 169 106 35 Ochiai 173 102 45 172 111 38 DStar 173 102 45 175 114 40 Barinel 171 101 45 169 107 36 Opt2 175 97 39 179 115 39 Muse 170 98 40 178 118 41 Jaccard 173 102 45 171 112 39 Top-10 Position Tarantula 240 180 135 242 189 144 Ochiai 244 184 140 242 191 145 DStar 245 184 139 242 190 142 Barinel 240 180 135 242 190 145 Opt2 237 168 128 239 184 135 Muse 234 169 129 239 186 140 Jaccard 245 184 139 241 188 142 A small part of bugs can be localized with high positions in the ranking list of suspicious positions.
  21. 21. 20 > Impact of Effective Ranking in Fault Localization APR tools are prone to correctly fix the subset of Defects4J bugs that can be accurately localized.
  22. 22. 21 1. One third of bugs in the Defects4J dataset cannot be localized by the commonly used automated fault localization tool. 2. APR tools are prone to correctly fix the subset of Defects4J bugs that can be accurately localized. > Take-away Message
  23. 23. 22 RQ3: Evaluation kPAR with Specific Fault Localization Configurations
  24. 24. 23 > kPAR with four different fault localization configurations Normal FL: File Assumption: Method Assumption: Line Assumption: It relies on the ranked list of suspicious code locations reported by a given FL tool. It assumes that the faulty code files are known. It assumes that the faulty methods are known. It assumes that the faulty code lines are known. No fault localization is then used. kPAR: Java implementation of PAR (Kim et al. ICSE 2013) + Gzoltar-0.1.1 + Ochiai.
  25. 25. 24 > Repair Performance of kPAR with Four FL Configurations FL Conf. Chart (C) Closure (Cl) Lang (L) Math (M) Mockito (Moc) Time (T) Total Normal FL 3/10 5/9 1/8 7/18 1/2 1/2 18/49 File Assumption 4/7 6/13 1/8 7/15 2/2 2/3 22/48 Method Assumption 4/6 7/16 1/7 7/15 2/2 2/3 23/49 Line Assumption 7/8 11/16 4/9 9/16 2/2 3/4 36/55 Number of Defects4J bugs fixed by kPAR with four FL configurations. C-1, 4, 7, L-59. Cl-2, 38, 62, 63, 73. M-15, 33, 58, 70, 75, 85, 89. Moc-38, T-7. File_AssumptionNormal_FL C-26, Cl-4, Moc-29, T-19. Cl-10 Method_Assumption C-8,14,19. Cl-31,38,40,70. M-4,82, T-26. L-6,22,24. Line_Assumption With better fault localization results, kPAR can correctly fix more bugs.
  26. 26. 25 > Take-away Message Accuracy of fault localization has a direct and substantial impact on the performance of APR repair pipelines.
  27. 27. 26 Discussion
  28. 28. 27 > Discussions 1.Full disclosure of FL parameters. 2.Qualification of APR performance. 3.Patch generation step vs Repair pipeline. 4.Sensitivity of search space.
  29. 29. 28 > Summary 7 > Fault Localization Bias in APR Test Pass Fail Patch Validation Plausible Patch Patch Candidate APR Approach Patch Generation Fault Localization Tools A Ranked List of Suspicious Code Locations Fault Localization (FL) Buggy Program Passing tests Failing tests

Editor's Notes

  • Good afternoon, every one, my name is Jacques Klein, I’m prof. At the University of Luxembourg.
    Today, I am going to present our recent work entitled


    The main outcome of this work is the demonstration of BIAS in Fault localization in APR pipeline

    Before to start let me introduce the authors of this paper
  • Now, let’s take a look at the basic process of automated program repair.

    The first process is Fault localization that identifies the code positions that are likely to be buggy.

    The second one is Patch generation, mutating the buggy code to generate patch candidates with proposed approach.

    The last process is to validate whether the generated patch can fix the bug or not.
  • Then, how to evaluate the repair performance of APR tools?

    To assess the repair performance of automated program repair tools,
    Researchers proposed to, in a dataset with limited number of bugs, compare APR tools with the number of bugs fixed by plausible patches and correct patches, respectively.
  • For example, this table is excerpted from SimFix Paper, which shows the number of bugs in Defects4J correctly fixed by different APR tools.

    OK, here is a question: is this kind of comparison fair enough.
  • OK, let’s go back to the APR basic process.

    The input of APR tools are the same, which are from the same benchmark defects4j.

    The last process is patch validation which is conducted by regression test, so it is also the same for APR tools.

    The third process is patch generation that is the key contribution of each APR tool which shows how to generate patches for bugs.

    The third process is highly dependent on the fault localization process which identify which code should be modified by the APR tool to generated patches.

    However, there is no any disclosed discussion on the impact of fault localization for ARP tool repair performance.
  • If the fault localization of APR tools are different from each other, this kind of comparison could be biased or misleading.
  • Why the proposed comparison could be biased, if APR tools use different fault localization tools.

    Now, let’s take a look at the fault localization techniques used in automated program repair for Java programs.

    In the state-of-the-art APR tools for Java programs, the widely used fault localization technique is spectrum-based fault localization, such as the GZoltar framework.
    Spectrum-based fault localization relies on the ranking metric to calculate the suspiciousness value of each code statement.
    Here, we list three ranking metrics: Ochiai, Tarantula, and Barinel, to calculate the suspiciousness value of each statement in buggy program.
    In these formulas, lower case letter s denotes one statement, failed(s) denotes the number of failed test cases that execute this statement,
    passed(s) denotes the number of passed test cases that execute this statement.
    failed(¬s) denotes the number of failed test cases that do not execute this statement.
    passed(¬s) denotes the number of passed test cases that do not execute this statement.

    Ok, with this kind of spectrum-based fault localization, suspicious code positions of buggy programs are reported in this way.
    The first column lists the suspicious class name, the second column lists the suspicious line, and the last column lists the suspiciousness value.
    APR tools try to mutate each suspicious line one by one until the first plausible patch is generated.
    So, this suspicious code position list can affect the repair performance of APR tools.
  • Why the proposed comparison could be biased, if APR tools use different fault localization tools.

    Now, let’s take a look at the fault localization techniques used in automated program repair for Java programs.

    In the state-of-the-art APR tools for Java programs, the widely used fault localization technique is spectrum-based fault localization, such as the GZoltar framework.
    Spectrum-based fault localization relies on the ranking metric to calculate the suspiciousness value of each code statement.
    Here, we list three ranking metrics: Ochiai, Tarantula, and Barinel, to calculate the suspiciousness value of each statement in buggy program.
    In these formulas, lower case letter s denotes one statement, failed(s) denotes the number of failed test cases that execute this statement,
    passed(s) denotes the number of passed test cases that execute this statement.
    failed(¬s) denotes the number of failed test cases that do not execute this statement.
    passed(¬s) denotes the number of passed test cases that do not execute this statement.

    Ok, with this kind of spectrum-based fault localization, suspicious code positions of buggy programs are reported in this way.
    The first column lists the suspicious class name, the second column lists the suspicious line, and the last column lists the suspiciousness value.
    APR tools try to mutate each suspicious line one by one until the first plausible patch is generated.
    So, this suspicious code position list can affect the repair performance of APR tools.
  • So, if the fault localization techniques are different, APR tools can have different repair performance even they use the same patch generation technique.
  • In this work, we investigate the fault localization bias in automated program repair systems with these three research questions.
  • To answer RQ1: We first investigate FL techniques used in APR systems in the literature.
    This reveals which FL tool and formula are integrated for each APR system.
  • This table illustrates the fault localization techniques used by APR tools.
    Most APR tools use the Gzoltar framework, 0.1.1 version and Ochiai ranking metric to identify bug positions.
    The unspecified/unconfirmed information of an APR tools is marked with ‘?’. If an APR tool does not use any supplementary information for FL, the corresponding table cell is marked with ‘∅’.

    Furthermore, HDRepair assumes that the faulty method is known to improve fault localization accuracy.
    ACS leverages predicate switching, ssFix uses statements in crashed stack trace, and SimFix leverages test case purification to improve the fault localization accuracy.
  • Now, let’s take a look at the second research question.
  • To answer the second research question, we define the localizability with three levels: file, method, and line level.

    File level: we consider that the faulty code is accurately localized if an FL tool reports at least one line from the buggy code file as suspicious.

    Method Level: we consider that the faulty code is accurately localized if at least one code line in the buggy method is reported by an FL tool as suspicious.

    Line level: we consider that the faulty code is accurately localized if suspicious code lines reported by an FL tool contain any of the buggy code lines.
  • This table shows the number of bugs that can be identified in the three granularities with two versions of Gzoltar.

    We note that some bugs cannot be located by Gzoltar.
  • APR tools repair bugs by mutating the suspicious statements one by one from the high suspicious value to low one.

    If we only consider the suspicious statements with the highest suspicious value at line granularity, less than 50 bugs can be localized.
    If we enlarge the search space of suspicious statements to top-10 at line granularity, only one third of bugs cane be localized.
  • To understand the fault localization effects on automated program, we further investigate the localization ranking positions of fixed and unfixed bugs.

    If the value is closer to 1.0, the exact buggy statements have higher suspiciousness value in the suspicious statements reported by fault localization techniques.

    This figure shows that, most bugs are fixed since they can be localized accurate than the unfixed bugs.


    where bugpos refers to the position of the actual bug location1
    in the ranked list of suspicious locations reported by the FL
    step. If the bug location can be found in the higher position
    of the ranked list, the value of Reciprocal is closer to 1.
  • To what extent APR performance can vary by adapting different fault localization configurations?
  • To answer this question, we implementd kPAR, the java implementation of PAR, with four different fault localization configurations.

    File Assumption: it means that only the suspicious statements in the buggy code file are considered as the input of APR tools to generate patch candidates.

    Method Assumption: it means that only the suspicious statements in the buggy methods are considered as the input of APR tools.
  • In this table, each cell has two numbers, the left one means the number of bugs correctly fixed by kPAR.
    The right one means the number of bugs plausibly fixed by kPAR.

    According to the results shown in this table, we note that, with better fault localization results, kPAR can correctly fix more bugs.

    Additionally, the bugs correctly fixed with lower accurate fault localization results can also be correctly fixed with higher accurate fault localization results.
  • Given that many APR systems do not release their source code, it is important that the experimental reports clearly indicate the protocol used for fault localization. (Preferably, authors should strive to assess their performance under a standard and replicable configuration of fault localization.)

    2. To ensure that novel approaches to APR are indeed improving over the state-of-the-art, authors must qualify the performance gain brought by the different ingredients of their approaches.

    3. There are two distinct directions of repair benchmarking that APR researchers should consider. In the first, a novel contribution to the patch generation problem must be assessed directly by assuming a perfect fault localization. In the second, for ensuring realistic assessment w.r.t. industry adoption, the full pipeline should be tested with no assumptions on fault localization accuracy.

    4. Given that fault localization produces a ranked list of suspicious locations, it is essential to characterize whether exact locations are strictly necessary for the APR approach to generate the correct patches. For example, an APR system may not focus only on a suspected location but on the context around this location. APR approaches may also use heuristics to curate the FL results.
  • Voilà
    To conclude, I repeat the key message, to not propose biased assement in APR pipeline, take care of the FL step

×