Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Recently uploaded(20)

Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)

  1. Automatically Generated Patches as Debugging Aids: A Human Study Yida Tao, Jindae Kim, Sunghun Kim Dept. of CSE, The Hong Kong University of Science and Technology Chang Xu State Key Lab for Novel Software Technology, Nanjing University
  2. • Promising research progress • ClearView1: Prevent all 10 Firefox exploits • GenProg2: Fix 55/105 real bugs [1] Automatically Patching Errors in Deployed Software. Perkins et al. SOSP’09 [2] A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each. Le Goues et al. ICSE’12 2 Automatic Program Repair
  3. 3 Automatic Program Repair
  4. “It won't get your bug patched any quicker. You’ll just have shifted the coders' attention away from their own app's bugs, and onto the repair tool’s bugs.” - Slashdot discussion: http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But- Bypassing-the-Source-Code 4 Automatic Program Repair
  5. #what-could-possibly-go-wrong • Blackbox repair • Increasing maintenance cost • Vulnerable to attack - Slashdot discussion: http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But- Bypassing-the-Source-Code - A human study of patch maintainability. ISSTA’12 5 - Automatic patch generation learned from human-written patches. ICSE’13
  6. #what-could-possibly-go-wrong #program-out-of-control - Slashdot discussion: http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But- Bypassing-the-Source-Code - A human study of patch maintainability. ISSTA’12 6 - Automatic patch generation learned from human-written patches. ICSE’13 • Blackbox repair • Increasing maintenance cost • Vulnerable to attack
  7. Use automatically generated patches as debugging aids 7
  8. Use automatically generated patches as debugging aids Our Human Study • Investigate the usefulness of generated patches as debugging aids • Discuss the impact of patch quality on debugging performance • Explore practitioners’ feedback on adopting automatic program repair 8
  9. Methodology 9
  10. Debugging aid Participants Bugs 10 is given to Debug
  11. Debugging aid Participants Bugs 11
  12. Low-quality generated patch Debugging aid Participants Bugs 12
  13. Low-quality generated patch High-quality generated patch Debugging aid Participants Bugs 13
  14. Low-quality generated patch High-quality generated patch Buggy method location Debugging aid Participants Bugs 14
  15. Grad: 44 MTurk: 23 Engr: 28 95 Participants CS graduate students Amazon Mechanical Turk workers Industrial software engineers Debugging aid Participants Bugs 15
  16. Debugging aid Participants Bugs 16
  17. 44 Graduate students • Between-group design 14 students 15 students 15 students Debugging aid Participants Bugs 17
  18. 44 Graduate students • Between-group design Low-quality generated patch High-quality generated patch Buggy method location 14 students 15 students 15 students Debugging aid Participants Bugs 18
  19. 44 Graduate students • Between-group design • Onsite setting • Eclipse IDE • Supervised session Low-quality generated patch High-quality generated patch Buggy method location 14 students 15 students 15 students Debugging aid Participants Bugs 19
  20. Low-quality generated patch High-quality generated patch Buggy method location Remote participants (28 Engr + 23 MTurk) • Within-group design Debugging aid Participants Bugs 20
  21. Remote participants (28 Engr + 23 MTurk) • Within-group design • Online debugging system Low-quality generated patch High-quality generated patch Buggy method location Debugging aid Participants Bugs 21
  22. Debugging aid Participants Bugs 22
  23. Bug Selection Criteria • Real bugs • The bug has accepted patches written by developers • Proper number of bugs • The bug has generated patches with different quality Debugging aid Participants Bugs 23
  24. Automatic patch generation learned from human-written patches. Kim et al. ICSE’13 Debugging aid Participants Bugs 24
  25. Automatic patch generation learned from human-written patches. Kim et al. ICSE’13 for (int i=0; i<parenCount; i++) SubString sub = (SubString)parens.get(i) if(sub!=null){ args[i+1] = sub.toString(); Auto-generated patch A Auto-generated patch B Debugging aid Participants Bugs 25 } } for (int i=0; i<parenCount; i++) SubString sub = (SubString)parens.get(i) args[parenCount+1] = new Integer(reImpl.leftContext.length); }
  26. Automatic patch generation learned from human-written patches. Kim et al. ICSE’13 for (int i=0; i<parenCount; i++) SubString sub = (SubString)parens.get(i) if(sub!=null){ args[i+1] = sub.toString(); Auto-generated patch A Auto-generated patch B avg. ranking from 85 devs and students Debugging aid Participants Bugs 26 } } for (int i=0; i<parenCount; i++) SubString sub = (SubString)parens.get(i) args[parenCount+1] = new Integer(reImpl.leftContext.length); } 1.6 2.8
  27. Automatic patch generation learned from human-written patches. Kim et al. ICSE’13 for (int i=0; i<parenCount; i++) SubString sub = (SubString)parens.get(i) if(sub!=null){ args[i+1] = sub.toString(); Auto-generated patch A Auto-generated patch B High-Quality Patch Low-Quality patch avg. ranking from 85 devs and students Debugging aid Participants Bugs 27 } } for (int i=0; i<parenCount; i++) SubString sub = (SubString)parens.get(i) args[parenCount+1] = new Integer(reImpl.leftContext.length); } 1.6 2.8
  28. Debugging aid Participants Bugs 28
  29. Participants submit 337 patches as their debugging outcome Debugging aid Participants Bugs 29
  30. Location 109 LowQ 112 HighQ # submitted patches 116 w.r.t debugging aid Participants submit 337 patches as their debugging outcome Debugging aid Participants Bugs 30
  31. Location 109 LowQ 112 HighQ # submitted patches 116 w.r.t debugging aid Bug1 66 Bug2 74 Bug5 62 Bug3 59 Bug4 76 # submitted patches w.r.t bugs Participants submit 337 patches as their debugging outcome Debugging aid Participants Bugs 31
  32. Evaluation of debugging performance 32
  33. Patch Correctness Correctness 33
  34. Patch Correctness • Passing test cases Correctness 34
  35. Patch Correctness • Passing test cases • Matching the semantics of original accepted patches Correctness 35
  36. Patch Correctness • Passing test cases • Matching the semantics of original accepted patches • 3 evaluators Correctness 36
  37. Debugging Time • Eclipse Plug-in •Website Timer Correctness Debugging time 37
  38. Correctness Debugging time • Independent variables • Debugging aids • Bugs • Participant types • Programming experience 38
  39. Multiple Regression Analysis Correctness Debugging time • Independent variables • Debugging aids • Bugs • Participant types • Programming experience correctness = α0 + α1 ∙ x1 + α2 ∙ x2 + α3 ∙ x3 + α4 ∙ x4 debugging time = β0 + β1 ∙ x1 + β2 ∙ x2 + β3 ∙ x3 + β4 ∙ x4 39
  40. Post-study Survey • Helpfulness of debugging aids • Difficulty of bugs • Opinions on using generated patches as debugging aids Correctness Debugging time Survey feedback 40
  41. Results 41
  42. High-quality patches significantly improve debugging correctness 1 48% 33% 71% 42
  43. High-quality patches significantly improve debugging correctness 1 % of correct patches 48% 33% 71% 43 Location LowQ HighQ
  44. High-quality patches significantly improve debugging correctness % of correct patches Location LowQ HighQ 1 Positive Coefficient = 1.25 p-value= 0.00 < 0.05 48% 71% 44
  45. Low-quality patches slightly undermine debugging correctness % of correct patches Location LowQ HighQ 2 48% 33% 71% 45
  46. Low-quality patches slightly undermine debugging correctness % of correct patches Location LowQ HighQ 2 Negative Coefficient = -0.55 p-value= 0.09 48% 33% 71% 46
  47. Low-quality patches can undermine debugging correctness % of correct patches Location LowQ HighQ 2 Negative Coefficient = -0.55 p-value= 0.09 48% 33% 71% 47
  48. High-quality patches are more useful for 3 difficult bugs 48
  49. High-quality patches are more useful for 3 difficult bugs 49 5 4 3 2 Bug Difficulty Bug1 Math-280 Bug2 Rhino-114493 Bug3 Rhino-192226 Bug4 Rhino-217379 Bug5 Rhino-76683
  50. High-quality patches are more useful for 3 difficult bugs 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% % of correct patches Bug1 Bug2 Bug3 Bug4 Bug5 Location LowQ HighQ 50 5 4 3 2 Bug Difficulty Bug1 Math-280 Bug2 Rhino-114493 Bug3 Rhino-192226 Bug4 Rhino-217379 Bug5 Rhino-76683
  51. 4 The type of debugging aid does not affect debugging time 51
  52. 4 The type of debugging aid does not affect debugging time 80 60 40 20 0 Debugging time (min) Location LowQ HighQ 52
  53. 5 Other factors’ impact on debugging performance Difficult bugs significantly slow down debugging Engr and MTurk are more likely to debug correctly Novices tend to benefit more from HighQ patches 53
  54. Helpfulness of debugging aids Very helpful Helpful Medium Slightly Helpful Not Helpful 6 54 Participants consider high-quality generated patches much more helpful than low-quality patches Low-quality generated patch High-quality generated patch Mann-Whitney U test p-value = 0.001
  55. Feedback 55
  56. 56
  57. Quick starting point • Point to the buggy area • Brainstorm “They would seem to be useful in helping find various ideas around fixing the issue, even if the patch isn’t always correct on its own.” 57
  58. Quick starting point • Point to the buggy area • Brainstorm Confusing, incomplete, misleading • Wrong lead, especially for novices • Require further human perfection “They would seem to be useful in helping find various ideas around fixing the issue, even if the patch isn’t always correct on its own.” 58
  59. “Generated patches would be good at recognizing obvious problems” “…but may not recognize more involved defects.” 59
  60. “Generated patches would be good at recognizing obvious problems” “…but may not recognize more involved defects.” 60 “Generated patches simplify the problem” “…but they may over-simplify it by not addressing the root cause.”
  61. “I would use generated patches as debugging aids, as they provide extra diagnostic information” 61
  62. “I would use generated patches as debugging aids, as they provide extra diagnostic information” “…along with access to standard debugging tools.” 62
  63. Threats to Validity 63
  64. Threats to Validity • Bugs and generated patches may not be representative • Quality measure of generated patches may not generalize • May not generalize to domain experts • Possibility of blindly reusing generated patches • Remove patches that are submitted less than 1 minute 64
  65. Takeaway 65 • Auto-generated patches can be useful as debugging aids • Participants fix bugs more correctly with auto-generated patches • Quality control is required • Participants’ debugging correctness is compromised with low-quality generated patches • Maximize the benefits • Difficult bugs • Novice developers

Editor's Notes

  1. This is a work with …
  2. Automatic program repair has been a very hot topic in recent years. We’ve seen quite promising research progress in this area. For example, Perkins et al. proposed a self-defending software ClearView, which successfully prevents all of the 10 Firefox exploits created by a red team and generated patches for 7 of them. As another successful example, Le Goues et al. proposed GenProg and used it to fix 55 out of 105 real bugs
  3. However, there are also skeptics and worries about automatic program repair. Here is a quote from online discussion.
  4. Here is a quote from online discussion.
  5. Followed with this general concern, we’ve observed from online community and literatures worries about things that could possibly go wrong with the program repair technique. For example, whether it creates sort of blackbox repair that hardly make sense. Whether it increase maintenance cost, and whether machine generated patches are vulnerable to attack.
  6. In general, people are worried about whether a program, after being repaired automatically, still work as intended, or will become unexpectable, and out of control. Because of these concerns, direct deployment of automatic program repair seems problematic at this point. But, can we still benefit from this techinuqe?
  7. How about using ..? In this case, developers can refer to generated patches when they debug, but they don’t necessarily have to use it. In other words, they still take full control over the content of the patch. This sounds like a more comfortable usage scenario.
  8. Which is also the focus of our human study. First … And because some of the controversy of program repair comes from the quality of automatically generated patches, we also want to disc… Finally, we explore…
  9. Here is our methodology
  10. Which is actually quite intuitive. Basically, we conducted controlled experiments, where we give certain type of debugging aids to participants, who use them to debug. Next, I’ll introduce these 3 parts in detail.
  11. First, we have 3 different types of debugging aids.
  12. And for the last type of debugging aid, we need some kind of baseline. Because the first two debugging aids already suggest candidate fix.
  13. For fair comparison, for the baseline, or the control group, we provide only the buggy method location as the debugging aid. which is common in practice, where developers typically know the general buggy area from bug reports or stack trace, before they start to debug. That’s the 3 types of debugging aids we’re gonna give to pariticipants.
  14. We recruited 95 participants from a wide population. Which includes 44 cs graduate students, 28 software engineers from industry, and 23 workers of Amazon mechanical turk, which is a crowdsourcing marketplace. Average years: Grad: 4.1, Engr: 2.4 (1-10), Mturk: 5.7 (1-14)
  15. Now the question is, how we assign debugging aid to participants?
  16. For the 44 graduate students, we adopt a between-group design by evenly dividing students into 3 groups of similar programming experience
  17. Each group is given only one of debugging aids.
  18. These students use Eclipse to debug in a supervised session.
  19. For remote participants, namely 28 engr and 23 mturk workers, it’s unlikely for us to determine their numbers and expertise beforehand, so between-group design is not appropriate here if we want to ensure the fairness of group division. Instead, we adopt with-in group design, such that participants can be exposed to different debugging aids. To balance the experimental conditions, whenever participants select a bug, we assign the type of debugging aids to this particular bug in a round-robin fashion s.t. each aid was equally likely to be given to each bug.
  20. We developed an online … for them to complete debugging tasks.
  21. Next, how do we select bugs?
  22. Accordingly, we selected all 5 bugs reported in this work…..
  23. For each of the 5 bugs, this work reported two patches generated by different program repair techniques,
  24. And they presented these different patches of the same bug to 85 … , and asked them to rank the patch based on the question, “which one is more acceptable?” In the end, this work reported this ranking of different patches for the same bug
  25. And, for the purpose of our human study, we label the patch with higher ranking as the “high-quality patch”, and its peer patch for the same bug, but with lower ranking, as the “low-quality” patch
  26. That’s basically how we design this debugging human study.
  27. In total, participants submit 337 patches ……
  28. Here is the # of submitted patches that are created with each of the debugging aid.
  29. And here is the # of submitted patches that are created for each bug. Our design basically ensures that the these two distributions are well balanced.
  30. Next, I’ll describe how we evaluate participants’ debugging performance.
  31. First, we evaluate the correctness of participants’ submitted patches.
  32. A patch is labeled correct only if it passes our test cases
  33. … and match the …
  34. For this part we have 3 evaluators to check and discuss the semantic matching.
  35. We also measure participants’ debugging time by developing an eclipse plug-in and a website timer to record the time they spent on each bug
  36. Up to this point, several factors can affect debugging correctness and time. For example, the type of debugging aids, of course, and also bugs, participant types, and their expertise.
  37. So, we use multiple regression analysis to quantify the relation between these independent variables and the outcome. That is, we use multiple regression to compute the coefficient values and statistical significance, so that we can understand whether the corresponding factors really have positive or negative impact on debugging performance, and if so, how much the impact is.
  38. Our evaluation also includes a post-study survey, in which we asked participants to rate the …, the …, and offer opinions.
  39. Results
  40. First, high-q patches DO improve debugging correctness, SIGNIFICANTLY
  41. Here is the % of correct patches made by these two groups. It’s pretty straightforward that group with highq patches has made a MUCH higher % of correct patches.
  42. The regression analysis also shows that high-q patch has a statistically significant positive coefficient on debugging correctness
  43. Surprisingly, the group with low-quality patches has made less correct patches, EVEN when compared to the control group.
  44. Regression also shows negative coefficient for low-quality patches, although it’s not statistically significant.
  45. But we do observe that low… can indeed …
  46. Next, we find…
  47. Here’s participants’ survey feedback on bug difficult. We can see that they consider the third bug, Rhino … to be the most difficult one to debug
  48. And when we check for each bug, the percentage of correct patches made by each group, we observe an obvious trend For the 3rd bug, no one except for the participants using high-quality patches can fix the bug correctly.
  49. On the other hand, we also found that …
  50. We can see from this figure, that the debugging time of these three groups is not that different. And regression analysis also suggests the same.
  51. We also found other … . For example, the last bullet We found that novices, whose programming experience is below the average among all participants, tend to
  52. Next, when we analyze the survey results, where we ask participants to rate how help each debugging aid is, we found that they consider highQ generated patches much more helpful than lowQ generated patches,
  53. Now let’s listen to what participants said about their human study experience in using generated patches in debugging.
  54. As usual, things always have positive and the negative side.
  55. Quote…
  56. But, on the other hand, such a quick starting point may be confusing… And, they might require further perfection from human developers
  57. Since we distinguish highQ and lowQ patches based on their acceptability ranking reported in another work, this may not generalize to other quality measures, such as metric-based ones Another threat is that participants may blindly… Actually we took several measures to prevent such behaviors. … When participants submit their patches, we’ll ask them to justify their patches in an input box.
  58. Finally, the take-away of this work. BUT, strict quality … If we gave …, it could be misleading and indeed compromise their debugging performance. Finally, the benefits of using auto-generated patches as debugging aids could be much more obvious for difficult debugging tasks, or for novice developers
Advertisement