Automatic Identification of Informative
Code in Stack Overflow Posts
0
Preetha Chatterjee
1st International Workshop on Natural Language-based Software Engineering (NLBSE)
Preprint: https://preethac.github.io/files/NLBSE_22.pdf
@PreethaChatterj
preethac@drexel.edu https://preethac.github.io
Stack Overflow Usage: Challenges
Information Overload
o The top 5 SO questions for a query
“Differences Hashmap Hashtable
Java”, consist of 51 answers with a
total of 6,771 words [Xu et. al. 2017]
o Software engineers focus on only
27% of code in a post to understand
and apply the relevant information to
their context [Chatterjee et. al. 2020]
We aim to:
Extract informative code (problematic and suggested code pairs)
within a Stack Overflow post
 Reformulation of queries
[Li et al. ‘16, Nie et al. ‘16]
 High-quality post detection
[Ponzanelli et al.’14, Yao et al.’15]
 Automatic tag recommendation
[Beyer and Pinzger’15 , Saha et al. ‘13]
 Identification of cue sentences for
navigation [Nadi and Treude ’20]
2
Problematic and Suggested Code Pairs
3
Problematic and Suggested Code Pairs from Answer
Textual Patterns
change/modify <prob code> to <sugg code>
replace <prob code> with <sugg code>
instead of <prob code> add/try/use <sugg code>
switch <prob code> to <sugg code>
<sugg code> instead of <prob code>
4
Problematic Code from Question and Suggested Code from Answer
Identification of sugg code from answer
- DECA [Di Sorbo et al ‘15]
- Identify short imperative sentences that
suggest a solution
(e.g., “Perform an interactive rebase”).
Identification of prob code from question
- Extract code elements from sugg code.
- Determine the approximate location of
prob code.
- Measure textual similarity between
prob code in the question with sugg
code in answer.
Evaluation: Preliminary Survey
Q2. Feedback – Is highlighted code
helpful for users to understand the
problem and choose a potential
solution?
Q1. Focus - Which parts of a code
software engineers focus on to
reduce their time in locating
information?
 14/25 (56%) indicated ‘Very useful’
 8/25 (32%) indicated ‘Useful’
 3/25 (12%) indicated ‘Somewhat
useful’
 Lines of code in the question
causing the error or exception
 Targeted fix from the suggested
code snippets in the answers
Summary and Overview
Q2. Feedback – Is highlighted code
helpful for users to understand the
problem and choose a potential
solution?
Q1. Focus - Which parts of a code
software engineers focus on to
reduce their time in locating
information?
NL-based approaches show promise to extract prob and sugg code from a post
Significance:
 improve the Q&A forum interface
 guide tools for mining forums
 improve granularity of traceability mappings involving forum posts
Preprint: https://preethac.github.io/files/NLBSE_22.pdf
@PreethaChatterj
preethac@drexel.edu https://preethac.github.io
 14/25 (56%) indicated ‘Very useful’
 8/25 (32%) indicated ‘Useful’
 3/25 (12%) indicated ‘Somewhat
useful’
 Lines of code in the question
causing the error or exception
 Targeted fix from the suggested
code snippets in the answers

Automatic Identification of Informative Code in Stack Overflow Posts

  • 1.
    Automatic Identification ofInformative Code in Stack Overflow Posts 0 Preetha Chatterjee 1st International Workshop on Natural Language-based Software Engineering (NLBSE) Preprint: https://preethac.github.io/files/NLBSE_22.pdf @PreethaChatterj preethac@drexel.edu https://preethac.github.io
  • 2.
    Stack Overflow Usage:Challenges Information Overload o The top 5 SO questions for a query “Differences Hashmap Hashtable Java”, consist of 51 answers with a total of 6,771 words [Xu et. al. 2017] o Software engineers focus on only 27% of code in a post to understand and apply the relevant information to their context [Chatterjee et. al. 2020] We aim to: Extract informative code (problematic and suggested code pairs) within a Stack Overflow post  Reformulation of queries [Li et al. ‘16, Nie et al. ‘16]  High-quality post detection [Ponzanelli et al.’14, Yao et al.’15]  Automatic tag recommendation [Beyer and Pinzger’15 , Saha et al. ‘13]  Identification of cue sentences for navigation [Nadi and Treude ’20]
  • 3.
  • 4.
    3 Problematic and SuggestedCode Pairs from Answer Textual Patterns change/modify <prob code> to <sugg code> replace <prob code> with <sugg code> instead of <prob code> add/try/use <sugg code> switch <prob code> to <sugg code> <sugg code> instead of <prob code>
  • 5.
    4 Problematic Code fromQuestion and Suggested Code from Answer Identification of sugg code from answer - DECA [Di Sorbo et al ‘15] - Identify short imperative sentences that suggest a solution (e.g., “Perform an interactive rebase”). Identification of prob code from question - Extract code elements from sugg code. - Determine the approximate location of prob code. - Measure textual similarity between prob code in the question with sugg code in answer.
  • 6.
    Evaluation: Preliminary Survey Q2.Feedback – Is highlighted code helpful for users to understand the problem and choose a potential solution? Q1. Focus - Which parts of a code software engineers focus on to reduce their time in locating information?  14/25 (56%) indicated ‘Very useful’  8/25 (32%) indicated ‘Useful’  3/25 (12%) indicated ‘Somewhat useful’  Lines of code in the question causing the error or exception  Targeted fix from the suggested code snippets in the answers
  • 7.
    Summary and Overview Q2.Feedback – Is highlighted code helpful for users to understand the problem and choose a potential solution? Q1. Focus - Which parts of a code software engineers focus on to reduce their time in locating information? NL-based approaches show promise to extract prob and sugg code from a post Significance:  improve the Q&A forum interface  guide tools for mining forums  improve granularity of traceability mappings involving forum posts Preprint: https://preethac.github.io/files/NLBSE_22.pdf @PreethaChatterj preethac@drexel.edu https://preethac.github.io  14/25 (56%) indicated ‘Very useful’  8/25 (32%) indicated ‘Useful’  3/25 (12%) indicated ‘Somewhat useful’  Lines of code in the question causing the error or exception  Targeted fix from the suggested code snippets in the answers

Editor's Notes

  • #3 SO often contain long threads of discussions. A previous study suggests that reading answers of a popular SO post could take about 30 minutes!! Because of this information overload, Finding help from SO is difficult for software engineers. In our previous work, we found … Several techniques and tools have been developed to help information seekers on Stack Overflow, However, to our knowledge, no one has investigated how to help information seekers focus their attention in identifying relevant targeted code changes In this paper, we aim to automatically identify informative code segments in a Stack Overflow post. Specifically, we focus on identifying the problematic and suggested code pairs from questions and answers in a post.
  • #4 We propose a set of natural language-based approaches to automatically extract problematic and suggested code pairs from Stack Overflow posts in two scenarios: (1) presence of problematic and suggested code in the same answer (e.g., Figure 1a), and (2) presence of problematic code in the question and suggested code in the answer (e.g., Figure 1b).
  • #5 When developers suggest short targeted fixes in answers, they tend to use some recurrent textual patterns. These patterns consist of short phrases or words accompanied with code segments. For instance, in Figure a, the code suggestion (highlighted in blue) follows the pattern "change <problematic code> to <suggested code>". We observed 5 common patterns in our data set for identifying the problematic and suggested code pairs in Stack Overflow answers.
  • #6 we decompose the problem of code-pair extraction into two sub-problems. First, we identify suggested code from answers using linguistic patterns. Next, we extract the problematic code from the question using textual heuristics. We start by applying an existing tool DECA, which was able to identify some of the instances in our data set. To improve performance, we identify short imperative sentences that suggest a solution, i.e., sentences that start with a verb and followed by its object (e.g., “Perform an interactive rebase”). Next, to identify the problematic code we . - First extract a list of code elements from the previously identified suggested code in the answer. If one or more code elements found in the suggested code co-occur in the question code, we extract that line of code in question as a potential problematic code. Finally, we measure the textual similarity between each line of approximate problematic code in the question with each line of suggested code in answer. If the similarily crosses a certain threshold, we consider that LOC as prob code
  • #7 To obtain early feedback on the potential of our idea to be beneficial for software developers, we conducted a pilot survey to understand "How effective is our approach in helping developers find a relevant solution to a bug-related issue from a Stack Overflow post?". … Responses to question (1) in the survey suggest … Next, we manually annotated a set of 100 posts selected randomly from our data set. We asked …
  • #8 In summary NL-based approaches .. This technique can be leveraged to guide tools for mining forums, and improve granularity of traceability mappings involving forum posts.