Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)
Automatically Generated Patches as
Debugging Aids: A Human Study
Yida Tao, Jindae Kim, Sunghun Kim
Dept. of CSE, The Hong Kong University of Science and Technology
Chang Xu
State Key Lab for Novel Software Technology, Nanjing University
• Promising research progress
• ClearView1: Prevent all 10 Firefox exploits
• GenProg2: Fix 55/105 real bugs
[1] Automatically Patching Errors in Deployed Software.
Perkins et al. SOSP’09
[2] A systematic study of automated program repair: fixing
55 out of 105 bugs for $8 each. Le Goues et al. ICSE’12
2
Automatic Program Repair
“It won't get your bug patched any quicker.
You’ll just have shifted the coders' attention away from
their own app's bugs, and onto the repair tool’s bugs.”
- Slashdot discussion:
http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But-
Bypassing-the-Source-Code
4
Automatic Program Repair
#what-could-possibly-go-wrong
• Blackbox repair
• Increasing maintenance cost
• Vulnerable to attack
- Slashdot discussion:
http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But-
Bypassing-the-Source-Code
- A human study of patch maintainability. ISSTA’12
5
- Automatic patch generation learned from human-written patches. ICSE’13
#what-could-possibly-go-wrong
#program-out-of-control
- Slashdot discussion:
http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But-
Bypassing-the-Source-Code
- A human study of patch maintainability. ISSTA’12
6
- Automatic patch generation learned from human-written patches. ICSE’13
• Blackbox repair
• Increasing maintenance cost
• Vulnerable to attack
Use automatically
generated patches as
debugging aids
Our Human Study
• Investigate the usefulness of
generated patches as debugging aids
• Discuss the impact of patch quality
on debugging performance
• Explore practitioners’ feedback on
adopting automatic program repair
8
Bug Selection Criteria
• Real bugs
• The bug has accepted patches written by developers
• Proper number of bugs
• The bug has generated patches with different quality
Debugging aid Participants Bugs 23
Automatic patch generation learned from human-written patches.
Kim et al. ICSE’13
Debugging aid Participants Bugs 24
Automatic patch generation learned from human-written patches.
Kim et al. ICSE’13
for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
if(sub!=null){
args[i+1] = sub.toString();
Auto-generated patch A Auto-generated patch B
Debugging aid Participants Bugs 25
}
}
for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
args[parenCount+1] =
new Integer(reImpl.leftContext.length);
}
Automatic patch generation learned from human-written patches.
Kim et al. ICSE’13
for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
if(sub!=null){
args[i+1] = sub.toString();
Auto-generated patch A Auto-generated patch B
avg. ranking from 85 devs and students
Debugging aid Participants Bugs 26
}
}
for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
args[parenCount+1] =
new Integer(reImpl.leftContext.length);
}
1.6
2.8
Automatic patch generation learned from human-written patches.
Kim et al. ICSE’13
for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
if(sub!=null){
args[i+1] = sub.toString();
Auto-generated patch A Auto-generated patch B
High-Quality Patch Low-Quality patch
avg. ranking from 85 devs and students
Debugging aid Participants Bugs 27
}
}
for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
args[parenCount+1] =
new Integer(reImpl.leftContext.length);
}
1.6
2.8
Post-study Survey
• Helpfulness of debugging aids
• Difficulty of bugs
• Opinions on using generated patches as debugging aids
Correctness
Debugging time
Survey feedback
40
4
The type of debugging aid does not affect
debugging time
51
4
The type of debugging aid does not affect
debugging time
80
60
40
20
0
Debugging time (min)
Location LowQ HighQ
52
5
Other factors’ impact on debugging
performance
Difficult bugs significantly slow down debugging
Engr and MTurk are more likely to debug correctly
Novices tend to benefit more from HighQ patches
53
Helpfulness of debugging aids
Very helpful
Helpful
Medium
Slightly Helpful
Not Helpful
6
54
Participants consider high-quality generated patches
much more helpful than low-quality patches
Low-quality
generated patch
High-quality
generated patch
Mann-Whitney U test
p-value = 0.001
Quick starting point
• Point to the buggy area
• Brainstorm
“They would seem to be useful
in helping find various ideas
around fixing the issue, even
if the patch isn’t always
correct on its own.”
57
Quick starting point
• Point to the buggy area
• Brainstorm
Confusing, incomplete, misleading
• Wrong lead, especially for novices
• Require further human perfection
“They would seem to be useful
in helping find various ideas
around fixing the issue, even
if the patch isn’t always
correct on its own.”
58
“Generated patches would be
good at recognizing obvious
problems”
“…but may not recognize more
involved defects.”
59
“Generated patches would be
good at recognizing obvious
problems”
“…but may not recognize more
involved defects.”
60
“Generated patches simplify
the problem”
“…but they may over-simplify it by
not addressing the root cause.”
“I would use generated
patches as debugging aids, as
they provide extra diagnostic
information”
61
“I would use generated
patches as debugging aids, as
they provide extra diagnostic
information”
“…along with access to standard
debugging tools.”
62
Threats to Validity
• Bugs and generated patches may not be representative
• Quality measure of generated patches may not generalize
• May not generalize to domain experts
• Possibility of blindly reusing generated patches
• Remove patches that are submitted less than 1 minute
64
Takeaway
65
• Auto-generated patches can be useful as
debugging aids
• Participants fix bugs more correctly with auto-generated
patches
• Quality control is required
• Participants’ debugging correctness is
compromised with low-quality generated patches
• Maximize the benefits
• Difficult bugs
• Novice developers
Editor's Notes
This is a work with …
Automatic program repair has been a very hot topic in recent years.
We’ve seen quite promising research progress in this area.
For example, Perkins et al. proposed a self-defending software ClearView, which successfully prevents all of the 10 Firefox exploits created by a red team and generated patches for 7 of them.
As another successful example, Le Goues et al. proposed GenProg and used it to fix 55 out of 105 real bugs
However, there are also skeptics and worries about automatic program repair. Here is a quote from online discussion.
Here is a quote from online discussion.
Followed with this general concern, we’ve observed from online community and literatures worries about things that could possibly go wrong with the program repair technique. For example, whether it creates sort of blackbox repair that hardly make sense. Whether it increase maintenance cost, and whether machine generated patches are vulnerable to attack.
In general, people are worried about whether a program, after being repaired automatically, still work as intended, or will become unexpectable, and out of control.
Because of these concerns, direct deployment of automatic program repair seems problematic at this point.
But, can we still benefit from this techinuqe?
How about using ..?
In this case, developers can refer to generated patches when they debug, but they don’t necessarily have to use it. In other words, they still take full control over the content of the patch.
This sounds like a more comfortable usage scenario.
Which is also the focus of our human study. First …
And because some of the controversy of program repair comes from the quality of automatically generated patches, we also want to disc…
Finally, we explore…
Here is our methodology
Which is actually quite intuitive. Basically, we conducted controlled experiments, where we give certain type of debugging aids to participants, who use them to debug.
Next, I’ll introduce these 3 parts in detail.
First, we have 3 different types of debugging aids.
And for the last type of debugging aid, we need some kind of baseline.
Because the first two debugging aids already suggest candidate fix.
For fair comparison, for the baseline, or the control group, we provide only the buggy method location as the debugging aid.
which is common in practice, where developers typically know the general buggy area from bug reports or stack trace, before they start to debug.
That’s the 3 types of debugging aids we’re gonna give to pariticipants.
We recruited 95 participants from a wide population.
Which includes 44 cs graduate students, 28 software engineers from industry, and 23 workers of Amazon mechanical turk, which is a crowdsourcing marketplace.
Average years: Grad: 4.1, Engr: 2.4 (1-10), Mturk: 5.7 (1-14)
Now the question is, how we assign debugging aid to participants?
For the 44 graduate students, we adopt a between-group design by evenly dividing students into 3 groups of similar programming experience
Each group is given only one of debugging aids.
These students use Eclipse to debug in a supervised session.
For remote participants, namely 28 engr and 23 mturk workers, it’s unlikely for us to determine their numbers and expertise beforehand, so between-group design is not appropriate here if we want to ensure the fairness of group division.
Instead, we adopt with-in group design, such that participants can be exposed to different debugging aids.
To balance the experimental conditions, whenever participants select a bug, we assign the type of debugging aids to this particular bug in a round-robin fashion s.t. each aid was equally likely to be given to each bug.
We developed an online … for them to complete debugging tasks.
Next, how do we select bugs?
Accordingly, we selected all 5 bugs reported in this work…..
For each of the 5 bugs, this work reported two patches generated by different program repair techniques,
And they presented these different patches of the same bug to 85 … , and asked them to rank the patch based on the question, “which one is more acceptable?”
In the end, this work reported this ranking of different patches for the same bug
And, for the purpose of our human study, we label the patch with higher ranking as the “high-quality patch”, and its peer patch for the same bug, but with lower ranking, as the “low-quality” patch
That’s basically how we design this debugging human study.
In total, participants submit 337 patches ……
Here is the # of submitted patches that are created with each of the debugging aid.
And here is the # of submitted patches that are created for each bug. Our design basically ensures that the these two distributions are well balanced.
Next, I’ll describe how we evaluate participants’ debugging performance.
First, we evaluate the correctness of participants’ submitted patches.
A patch is labeled correct only if it passes our test cases
… and match the …
For this part we have 3 evaluators to check and discuss the semantic matching.
We also measure participants’ debugging time by developing an eclipse plug-in and a website timer to record the time they spent on each bug
Up to this point, several factors can affect debugging correctness and time. For example, the type of debugging aids, of course, and also bugs, participant types, and their expertise.
So, we use multiple regression analysis to quantify the relation between these independent variables and the outcome. That is, we use multiple regression to compute the coefficient values and statistical significance, so that we can understand whether the corresponding factors really have positive or negative impact on debugging performance, and if so, how much the impact is.
Our evaluation also includes a post-study survey, in which we asked participants to rate the …, the …, and offer opinions.
Results
First, high-q patches DO improve debugging correctness, SIGNIFICANTLY
Here is the % of correct patches made by these two groups. It’s pretty straightforward that group with highq patches has made a MUCH higher % of correct patches.
The regression analysis also shows that high-q patch has a statistically significant positive coefficient on debugging correctness
Surprisingly, the group with low-quality patches has made less correct patches, EVEN when compared to the control group.
Regression also shows negative coefficient for low-quality patches, although it’s not statistically significant.
But we do observe that low… can indeed …
Next, we find…
Here’s participants’ survey feedback on bug difficult. We can see that they consider the third bug, Rhino … to be the most difficult one to debug
And when we check for each bug, the percentage of correct patches made by each group, we observe an obvious trend
For the 3rd bug, no one except for the participants using high-quality patches can fix the bug correctly.
On the other hand, we also found that …
We can see from this figure, that the debugging time of these three groups is not that different.
And regression analysis also suggests the same.
We also found other … . For example, the last bullet
We found that novices, whose programming experience is below the average among all participants, tend to
Next, when we analyze the survey results, where we ask participants to rate how help each debugging aid is, we found that they consider highQ generated patches much more helpful than lowQ generated patches,
Now let’s listen to what participants said about their human study experience in using generated patches in debugging.
As usual, things always have positive and the negative side.
Quote…
But, on the other hand, such a quick starting point may be confusing…
And, they might require further perfection from human developers
Since we distinguish highQ and lowQ patches based on their acceptability ranking reported in another work, this may not generalize to other quality measures, such as metric-based ones
Another threat is that participants may blindly…
Actually we took several measures to prevent such behaviors. …
When participants submit their patches, we’ll ask them to justify their patches in an input box.
Finally, the take-away of this work.
BUT, strict quality … If we gave …, it could be misleading and indeed compromise their debugging performance.
Finally, the benefits of using auto-generated patches as debugging aids could be much more obvious for difficult debugging tasks, or for novice developers