Identifying Bug-Prone API Methods using Crowdsourced Knowledge
1. IDENTIFICATION OF BUG-PRONE API
METHODS USING CROWDSOURCED
KNOWLEDGE
Mohammad Masudur Rahman
Department of Computer Science
University of Saskatchewan, Canada
CMPT-842: Mobile and Cloud Computing
Course Instructor: Dr. Ralph Deter
2. AN EXAMPLE BUGGY CODE!
2
7 API classes from 2 packages
7 Constructors
7 API method invocations
Fig: Zip file creation
4. GOOD NEWS---STACK OVERFLOW!
4
4M users
10M questions 21M answers
Massive body of information
(2008)
Programming
languages
Code
example
API issues & bugs
Relevant knowledge
6. OUTLINE OF THE TALK
6
Stack Overflow Q & A
BRACK
Evaluation
using
8 systems
Take-home messages
Validation
with
2 studies
Exploratory study 2 Research questions
API method
invocation database
8. EXPLORATORY STUDY: CONSTRUCTION OF
API METHOD INVOCATION DATABASE
8
Defective
method calls
Corrected
method calls
SO Q & A
thread
Defective code
Rectified code
Island parsing
API invocation
database
SO Q&A threads Preprocessing Topic
modeling
Bug/error
related topics
Bug/error
related threads
165,580
49,425
9. EXPLORATORY STUDY: RESEARCH QUESTIONS
RQ1: Are programming issues, errors or exceptions
reported at Stack Overflow frequently associated with
API method invocations?
9
RQ2: Are certain APIs and their methods more prone to
programming errors or bugs than the others?
12. EXPLORATORY STUDY SUMMARY
12
Programming issues, errors or exceptions reported at
Stack Overflow frequently are associated with API
method invocations.
Some APIs and their methods more prone to
programming errors or bugs than the others?
14. BRACK: API BUG-PRONENESS
HEURISTICS—H1
API Context-Susceptibility (ACS)
14
Defective code
Dependency an of API invocation on the context
Context can alter the expected behaviour of the invocation
ACS-- estimates how vulnerable an API method invocation (e.g.,
BufferedReader.readLine()) to its context
Based on reported programming errors at Stack Overflow
15. BRACK: API BUG-PRONENESS
HEURISTICS—H2
API Error-Associativity (AEA)
15 Code segments from bug related Q & A of SO.
AEA– calculates co-occurrence of an API method invocation in
both defective and rectified code segments
Defective code
Rectified code
16. BRACK: API BUG-PRONENESS RANKING
16
Defective code
Island parsing
API invocations
API invocation
database
Heuristic collector
Bug-proneness
score calculator
Bug-proneness
ranking
Bug-prone
API method
invocations
Input: Defective code
Output: Ranked bug-prone API method invocations
Detailed algorithm in the paper.
18. EXPERIMENTAL DESIGN
18
8 OSS systems 3,821 Bug-fixing commits Bug reports
Island parsing
Test cases & Gold setEvaluation Validation
19. EXPERIMENT: RESEARCH QUESTIONS
19
RQ1: How does BRACK perform in identifying bug-prone API method
invocations from a given code segment?
RQ2: How effective are those heuristics—ACS and AEA-- in
identifying bug-prone API method invocations?
RQ3: Does BRACK show any bias to any particular subject systems
or API packages in such identification?
RQ4: Is BRACK comparable to state-of-the-art in identifying bug-
prone API method invocations from the buggy code?
20. Performance Metric Top-3
Top-3 Accuracy 75.93%
Mean Reciprocal Rank@3 0.47
Mean Average Precision@3 59.04%
Mean Recall@3 34.44%
PERFORMANCE: ANSWER TO RQ1
20
Fig: Performance for different Top-K
21. EFFECTIVENESS: ANSWER TO RQ2
21
Metric ACS (H1) AEA (H2) Combined (H1+H2)
Top-3 Accuracy 75.54% 61.77% 75.93%
MRR@3 0.47 0.44 0.47
MAP@3 58.47% 51.47% 59.04%
MR@3 33.18% 21.20% 34.44%
ACS is found more effective than AEA
Combination marginally improves the performance
Detailed analysis in the paper.
22. BIAS: ANSWER TO RQ3
22
Metric Small Systems (4) Medium Systems (4)
Top-3 Accuracy 77.23% 74.63%
MRR@3 0.50 0.44
MAP@3 61.41% 56.65%
MR@3 34.95% 33.93%
Small Systems <150 commits.
Medium Systems > 400 commits.
MWU-test on Top-3 accuracy: p-value=0.75>0.05, performance
difference is NOT significant
Similar findings about API packages (in the paper)
23. STATE-OF-THE-ART
Chen and Kim, FSE 2015
Detects defective code in Stack Overflow and suggests
corresponding rectified code.
Subject to the availability of code clones.
Kim et al. FSE 2015
Applies 28 source code metrics and 12 software
process metrics.
Random Forest based machine learning classifier.
Less generalization.
23
25. THREATS TO VALIDITY
Internal Validity: Replication of existing studies in
our environment.
Best performing settings applied.
External Validity: Generalization of BRACK.
API invocation convention similar across various
languages.
Construct Validity: Appropriateness of the
performance metrics.
Metrics taken from existing literature.
Bias in gold set: Overlapping method invocation
assumption
JDK bug fixing history should be added.
25
Hello everyone!
My name is Mohammad Masudur Rahman
I am a 2nd year PhD student from University of Saskatchewan, Canada.
Today, I am going to talk about an automated technique for identifying bug-prone API methods from a given buggy code.
Lets take a look at this code. This code compiles, runs without any error and produces a zip file.
But the only problem is the zip file is corrupted. So, the code contains a bug, that means the code is buggy.
Now this code contains 7 API classes, 7 constructors and 7 method invocations.
A developer’s responsibility is to debug this code line by line, check different parameter values and check for suspicious patterns.
Now, a debugging context could be bigger and might involve more API invocations.
Now, if there exists a tool that can predict which API invocations are more bug-prone, that could be a very helpful information for the developer during debugging.
Then, the developer inspection could be little but effective.
Our project actually provides such type of support to the developer.
Now the task is not easy. Such prediction about API methods involves several challenges.
First: lack of sufficient and reliable sources for such information
-- No repository provides direct info on API method bug-proneness.
--API documentation does not contain such info, they just explain the simple usages.
-- Bug reports are a possible alternative source, but they might not be sufficient. Because, just from bug-report, one cannot simply determine which API methods are responsible for the bug.
Second: the knowledge on API-bug proneness is a matter of long work experience. It cannot be learned over night.
So, this knowledge is not trivial and it cannot be gained quickly.
Good news is Stack Overflow. It’s a programming Q & A site launched in 2008.
It contains a massive body of relevant information for our task.
It has 4M registered users.
10 millions questions and 20M+ answers.
The questions are mostly related to programming languages such as Java, C#, Javascript, PHP, Android and so on.
The questions and the answers contain thousands of code examples.
Most importantly, they discuss about various API issues, errors and bugs which can be mined to provide support to the developers.
Now lets take a look at this buggy code example related to Java reflection API.
The question shows the defective code, and invoke method is the source of the bug or error.
Then, in the rectified code, that error is corrected by another developer from the community, and this is the accepted answer.
Now, if we can collect such defective and rectified code segment pairs, and find that same API invocations are causing errors in various contexts,
Then that suggests that target API invocation is bug-prone or prone to errors, misunderstanding or confusion.
This is the outline of my today’s talk.
I would first discuss our exploratory study.
Then based on the findings, we propose our technique—BRACK—for bug-prone method identification.
Then we discuss our experiments, evaluation and validations.
And then we finally conclude with discussions.
Now, this is what we do during the exploratory study.
Since we are interested about API errors and bugs, we collect bug/error related questions from Stack Overflow.
For that we collect 500K question titles, perform natural language preprocessing and then perform topic modeling on them using LDA.
This provides a list of 200 topics from which we manually analyze and select 48 topics related to programming errors and bugs.
Then we separate questions discussing those topics—we got 165K questions like that.
Then in the second phase, we analyze each of those bug related questions and answers, and extract the defective and rectified code segments.
We then perform island parsing on the code segments, and extract the API method invocations.
Based on our observation, we conjecture that the invocations that overlap between defective and rectified code are basically connected to the bug. So, we store all the invocations from both code and develop a API invocation database.
Then in the exploratory study, we ask two research questions.
Are programming issues or errors related to API method invocations?
--If yes, then our support will make sense.
Do different API class/methods have different level of bug-proneness?
--If yes, then a ranking of bug-proneness will make sense.
We analyze the API invocation database to answer these research questions.
Now, this is frequency distribution of the API invocations in the bug related questions.
Both from probability mass function and cumulative density function, its clear that the distribution is heave-tailed.
That means a small number of Q & A contain the maximum density.
From the box plot, we can see a median invocation frequency of 3.
More importantly, the overlapped invocation frequency between defective and rectified code is close to 2.
So, yes, API invocations are pretty much associated with programming errors and bugs.
Java packages contain about 3K classes in various packages, and different packages have different no. of API classes.
In order to determine package level relative bug-proneness, we thus randomly choose 20 API classes from each package.
Then we determine their API method invocation frequency from the database we developed.
We continued this random selection and counting process 10 times, we got this statistics.
This shows that API classes from different packages have different proneness to errors or bugs.
In this case, we found Java IO and SQL classes have the maximum proneness to errors.
So, we can summarize the findings from the exploratory study.
--Programming errors/bugs are associated with API method invocations.
--Some APIs and their classes are more bug-prone than others.
Based on such exploratory findings, we propose our technique—BRACK– that identifies bug-prone API methods using crowdsourced knowledge.
Now, we use two heuristics to capture bug-proneness of an API method invocation.
For example lets look at this buggy code, this code returns a NullPointerException.
Now, if you consider these two invocations, which one is likely to cause such exception? Obviously this one, right?
Its because, it is too much dependent on this context– the other API invocations.
We capture this concept as API Context-Susceptibility.
That means how vulnerable an invocation is to the error due to their context, surrounding API invocations.
Another heuristic we consider is called API error-associativity– that is how likely an invocation will be associated with error.
In Stack Overflow, we saw certain API invocations in the defective code that are also repeated in the rectified code, are mostly associated with the reported error.
So, this heuristic calculates such occurrences from Stack Overflow code segments.
The next steps are pretty much straightforward.
So, for an input buggy code, we perform island parsing, extract the API invocations and collect those two heuristics for each of the invocations.
We then produce a bug-proneness score based on those heuristics as well as code contextual similarity for each invocation.
Then we rank those invocations based on bug-proneness, and recommend the Top-3 invocations.
Well, as said, besides heuristics, we consider code contextual similarity.
When we calculate heuristics of the invocations from SO code, we also determine code similarity between input code and the defective code.
Thus, our bug-proneness is based on two heuristics and the contextual similarity.
Now, this is how we design our experiment.
We consult with 8 OSS software systems and their bug-reports.
Then we collect the bug-fixation commits and apply island parsing on the diff from the commit.
This provides the experiment test case and gold set which are used for evaluation and validation.
In our experiment we ask these four research questions.
How does our technique perform in identifying the bug-prone API invocations in terms of traditional performance metrics?
How effective are our proposed heuristics?
Does it show any bias to subject systems or API packages?
How does it perform compared to the state-of-the-art?
Well, this is our performance.
For, Top-3 recommendation, we get 76% accuracy with 59% precision which are quite promising.
For example, when we check for various Top-K values, we see that accuracy and precision rise logarithmically.
The recall is a bit low 35%, but, still 76% accuracy shows the promise.
When we consider the heuristics, we found the Context-Susceptibility is found more effective.
The second heuristic marginally improves the performance, but that actually justifies their combination in our ranking algorithm.
We then divide our subject systems into two groups—small systems that have < 150 commits and the medium systems > 400 commits in our dataset.
These are average performance for both groups. Interestingly, we see their performance is pretty much similar.
From, the statistical tests, we also found that their performance is not significantly different.
We also found similar findings for API packages.
So, based on our experiments, our technique does not show bias to any subject system or API packages.
Then we compare with 2 existing systems.
The first one applies code clone detection on SO defective and rectified codes, and returns the rectified code as solution.
-- This is limited, because, SO need to have the code clones in this case.
The second study applies machine learning on source code and process metrics to determine bug-proneness of API classes.
Now, this is findings.
We see for each of the subject systems, our proposed technique provides quite better result, especially accuracy.
The close competitor is Kim et al– the technique based on metrics and machine learning.
Then when we consider the box plots, we see our performance is significantly higher than the state-of-the-art.
The recall is a bit lower.
But, still, the experiment demonstrates the potential of our technique.
We also identifies a few threats to validity of our findings.
Replication of the existing systems. We used their best settings for experiment.
Generalization of our technique. The API invocation convention is pretty much similar for various languages. We did for Java language, but it can be done for other languages as well.
Use of appropriate metrics– Yes, we used metrics from relevant literature such as precision and recall. So, they are appropriate.
Bias in gold set: Yes, there might be some bias in gold set development, but we are working on it.
So, to summarize, we propose a technique that identifies bug-prone API method invocations from a buggy code.
We used defective and rectified code from SO, develop an invocation database and answer 2 research questions.
Then we propose BRACK, conduct experiments using bug-fixing commits from OSS projects.
Then we evaluate and validate against the state-of-the-art.
All findings suggest that our technique has the potential.
That’s all I have to say.
Thanks for your attention. Questions?