Preetha Chatterjee
PhD Student
preethac@udel.edu
http://sites.udel.edu/preethac/
Mining Code Examples with Descriptive
Text from Software Artifacts
Lori Pollock
Professor
pollock@udel.edu
https://www.eecis.udel.edu/~pollock/
Kostadin Damevski
Manziba Akanda Nishi
Virginia Commonwealth University
Nicholas A. Kraft
Vinay Augustine
ABB Corporate Research
Benjamin Gause
Hunter Hedinger
University of Delaware
Collaborators
Code is everywhere!
Mining Code Segments for Software Tools
3
Emails and bug-reports:
• Re-documenting source code [Panichella 2012]
• Recommending mentors in software projects [Canfora 2012]
Tutorials: API learning [Jiang 2017, Petrosyan 2015]
Q & A forums:
• IDE recommendation [DeSouza 2014, Rahman 2014, Cordeiro 2012, Ponzanelli
2014, Bacchelli 2012, Amintaber 2015]
• Learning and recommendation of API [Chen 2016, Rahman 2016, Wang 2013]
• Automatic generation of comments for source code [Wong 2013, Rahman 2015]
• Building thesaurus of software-specific terms [Tian 2014, Chen 2017]
Research Articles?
Chats?
Today’s Talk
An Exploratory Study: What Information about
Code Snippets Is Available in Different Software-
Related Documents?
Case Study: Extracting Code Segments and
Their Descriptions from Research Articles
Empirical Study: Chat Communities as a Mining
Source for SE Tasks
5
An Exploratory Study: What Information
about Code Snippets Is Available in
Different Software-Related Documents?
6
P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft,
"What information about code snippets is available in different software-related
documents? An exploratory study," 2017 IEEE 24th International Conference on Software
Analysis, Evolution and Reengineering (SANER), Klagenfurt, 2017, pp. 382-386.
Research Questions
What kinds of information about the embedded code
snippets is available in different document types?
What are the characteristics of the code snippets
embedded in different types of software-related documents?
What are the cues that indicate code snippet related
information? How do the cues differ across different
document types?
Software-Related Documents
Document Type Example Origin
Benchmarks OpenMP, NAS
Blog Posts MSDN, Wordpress
Bug Repositories GitHub Issues, Bugzilla
Code Reviews GitHub Pull Requests, Gerrit
Course Materials cs.*.edu
Documentation readthedocs.org
E-Books WikiBooks
Mailing Lists lkml.org
Presentations SlideShare, Speaker Deck
Public Chats Gitter, Slack
Q&A Forums StackOverflow, MSDN
Research Papers IEEE Xplore, arxiv.org
Videos YouTube
Study Methodology
Annotation Example
Labels and Sub-Labels
Label Sub-Labels
Explanatory
Rationale
Functionality
Methodology
Output of Code
Similarity
Modification
Origin
Structure
Data Structure
Control Flow
Data Flow
Lines of Code
Label Sub-Labels
Design
Programming Language
Framework
Time/Space Complexity
Efficiency
Efficient
Inefficient
Assumptions
Testing
Clarity
High
Low
Erroneous
Compilation
Runtime
Example: Documentation
Observation
1. Code to perform a 3D
Fourier Transform
Labels/Sub-labels
Explanatory (Functionality)
Observation
2. Code uses a
multidimensional array
Labels/Sub-labels
Structure (Data Structure)
Frequency of occurrence of different kinds of
information for the code snippets by document type
13
Code Snippet And Description Availability By
Document Type
Document Type # Docs Mean # Code
Snippets
Mean # LOC Per
Snippet
Mean # Lines of
Text
Benchmarks 2 4.5 13.0 86.7
Blog Posts 10 8.6 13.1 88.3
Bug Reports 6 2.7 20.1 17.2
Code Reviews 7 7.1 33.3 64.9
Course Materials 3 3 12.8 12.6
Documentation 6 3.5 7.8 22.8
E-books 5 2.6 21.4 37.4
Mailing Lists 5 1.6 46.6 17.6
Papers 5 8.6 10.3 439.9
Presentations 3 8.3 11.6 18.7
Public Chat 5 1.2 8.3 15.6
Q&A Sites 3 4.7 12.6 32.8
Total 60
14
Case Study: Extracting Code Segments and
Their Descriptions from Research Articles
15
P. Chatterjee, B. Gause, H. Hedinger and L. Pollock, "Extracting Code Segments and Their
Descriptions from Research Articles," 2017 IEEE/ACM 14th International Conference on
Mining Software Repositories (MSR), Buenos Aires, 2017, pp. 91-101.
Bug Reports
Emails
Blog Posts
Q & A forums
Code Reviews
Documentation
E-books
Research Papers
Course Materials
Presentations
Public Chats
Benchmarks
Research Papers
DL Domain # of
articles
ACM DL Computer
Science
> 300,000
IEEE
Xplore
Computer
Science
> 3,500,000
DBLP Mostly
Computer
Science
> 3,729,582
https://en.wikipedia.org/wiki/IEEE_Xplore
https://cacm.acm.org/magazines/2011/7/109905-acm-aggregates-publication-
statistics-in-the-acm-digital-library/fulltext
http://dblp.uni-trier.de/
70% of the articles contain
one or more code segments,
with an average of 3-4 code
segments per article.
What information can be learned from the text?
Indication of a problem in the code
The functionality of the code
Individual method functionalities
The type of data structure used
Cause of the issue presented earlier
To understand the difficulty of fixing a memory leak, let
us take a look at an example program in Fig. 1. This is a
contrived example mimicking recurring leak patterns we
found in real C programs. Procedure check_records
checks whether there is any bad records in a large file,
and the caller could either check all records, or specify a
search condition to check only part of records. In this
example, both get_next and search_for_next will
allocate and return a heap structure, which is expected
to be freed at line 12. However, the execution may
break out the loop at line 10, causing a memory leak.
The programming language
Challenges
To understand the difficulty of fixing a memory leak, let
us take a look at an example program in Fig. 1. This is a
contrived example mimicking recurring leak patterns
we found in real C programs. Procedure check_records
checks whether there is any bad records in a large file,
and the caller could either check all records, or specify
a search condition to check only part of records. In this
example, both get_next and search_for_next will
allocate and return a heap structure, which is expected
to be freed at line 12. However, the execution may
break out the loop at line 10, causing a memory leak.
Identify code segments in
unstructured documents
Bacchelli et al. (ICPC’10)
Tang et al. (KDD’05)
Bettenburg et al. (MSR’08)
Subramanian et al. (MSR’13)
Rigby et al. (ICSE’13)
Our focus:
Identify text describing each code
segment
Contributions
Automatically identifying & mapping text describing code
segments in research articles
A prototype, CoDesNPub Miner, that outputs XML-based
representation associating code segments with their
descriptions
Evaluation of the effectiveness of code description
identification techniques
• Seeds
• Neighbors
Overview of CoDesNPub Miner
Code Description Identification
Identify Seeds: ReferencesCodeFigure Heuristic
Fig.5 shows a typical test method of this
pattern. The method tests a set of basic
functionality of API class BasicAuthCache,
including the method put, get, remove and
clear. There are three test scenarios in the
method: line 4-5, line 6-7, line 8-10. They
share two data objects, cache and
authScheme. Their method invocation
sequences are not same and there is no
unified test target method. But there is a
common subsequence among three method
invocation sequences, i.e., the invocations of
get and HttpHost.
*****Code Segment appears here*****
Listing 9 shows an example of three
statements that were single statement blocks
after the first phases, but can be merged into
a single block because they have similar RHSs.
*****Code Segment appears here*****
Identify Seeds: TextBefore and TextAfter Heuristic
Fig.5 shows a typical test method of this
pattern. The method tests a set of basic
functionality of API class BasicAuthCache,
including the method put, get, remove and
clear. There are three test scenarios in the
method: line 4-5, line 6-7, line 8-10. They
share two data objects, cache and
authScheme. Their method invocation
sequences are not same and there is no
unified test target method. But there is a
common subsequence among three method
invocation sequences, i.e., the invocations of
get and HttpHost.
*****Code Segment appears here*****
*****Code Segment appears here*****
A major obstacle to extracting API examples
from test code is the multiple test scenarios in
a test method. Fig. 1 depicts such a test
method. Lines 2-4 are the declaration of some
data objects. Lines 5-13 depict a test scenario
that contains the usage of some API methods,
such as keySetByValue, put, and getKey. Lines
14-22 depict another test scenario, which
contains a similar usage to the previous one.
Such multiple test scenarios are quite
reasonable when aiming at covering testing
input domains. But they bring redundant code
for API users to read. In fact, there are actually
200+ code lines containing similar test
scenarios in the test method in Fig.1. It is
necessary to separate different test scenarios
from one test method and cluster the similar
usages to remove redundancy.
Identify Seeds: ContainsCodeIdentifiers Heuristic
Fig.5 shows a typical test method of this
pattern. The method tests a set of basic
functionality of API class BasicAuthCache,
including the method put, get, remove and
clear. There are three test scenarios in the
method: line 4-5, line 6-7, line 8-10. They
share two data objects, cache and
authScheme. Their method invocation
sequences are not same and there is no
unified test target method. But there is a
common subsequence among three method
invocation sequences, i.e., the invocations of
get and HttpHost.
*****Code Segment appears here*****
Identify Seeds: ReferencesCodeByPosition Heuristic
Fig.5 shows a typical test method of this
pattern. The method tests a set of basic
functionality of API class BasicAuthCache,
including the method put, get, remove
and clear. There are three test scenarios in
the method: line 4-5, line 6-7, line 8-10.
They share two data objects, cache and
authScheme. Their method invocation
sequences are not same and there is no
unified test target method. But there is a
common subsequence among three
method invocation sequences, i.e., the
invocations of get and HttpHost.
This code snippet obtains a user name (user- Name)
by invoking request.getParameter(“name”)and uses
it to construct a query to be passed to a database for
execution (con.execute (query)). This seemingly
innocent piece of code may allow an attacker to gain
access to unauthorized information: if an attacker
has full control of string userName obtained from an
HTTP request, he can for example set it to 'OR 1 = 1;-
-. Two dashes are used to indicate comments in the
Oracle dialect of SQL, so the WHERE clause of the
query effectively becomes the tautologyname = ' '
OR 1 = 1. This allows the attackerto circumvent the
name check and get access to all user records in the
database.
Putting All Seed Heuristics Together
• Scoring sentences
• Equal
• Accuracy-based
• Threshold Analysis
Fig.5 shows a typical test method of this pattern.
The method tests a set of basic functionality of API
class BasicAuthCache, including the method put,
get, remove and clear. There are three test
scenarios in the method: line 4-5, line 6-7, line 8-
10. They share two data objects, cache and
authScheme. Their method invocation sequences
are not same and there is no unified test target
method. But there is a common subsequence
among three method invocation sequences, i.e.,
the invocations of get and HttpHost.
ReferencesCodeFigure Score=3
ContainsCodeIdentifiers Score=2
ContainsCodeIdentifiers Score=2
ReferencesCodeByPosition Score=2
ContainsCodeIdentifiers, Score=2
ReferencesCodeByPosition Score=2
TextBefore Score=1
*****Code Segment appears here*****
Identifying Neighboring Code-related Text
• Heuristic 1: At least 1 sentence is a seed
• Heuristic 2: At least (25%, 50%, or 75%, respectively) sentences in the
paragraph are seeds
Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality
of API class BasicAuthCache, including the method put, get, remove and clear. There are three
test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache
and authScheme. Their method invocation sequences are not same and there is no unified test
target method. But there is a common subsequence among three method invocation
sequences, i.e., the invocations of get and HttpHost.
5 out of 6 (75%) sentences are seeds
Heuristic 1
Heuristic 2
whole paragraph is a description
Evaluation Methodology
• Research Question: How effective is our approach to automatically
identify code descriptions in natural language text of research
articles?
• Subjects: 100 code segments from ACM DL and IEEE Xplore journal
and conference software engineering papers
• Gold Set:
• 10 Human annotators (non-authors)
• Measures:
• Overall code description identification: Precision and Recall
• Seed identification: Precision
Evaluation Results
Minimum # of Seeds Precision Recall
1-24% 39.05 70.20
>= 25% 53.41 50.33
>= 50% 66.04 28.45
>= 75% 68.30 20.53
Overall system effectiveness
Main Threats to Validity
• Unable to distinguish between pseudocode and code fragments
• Papers with no pseudocode, plan to extend the approach to identify both.
• Evaluation relies on human judges
• Human judges with experiences in programming and research paper reading.
• Each code segment judged by at least two judges.
• Scaling to extensive evaluation set might lead to different results
• Plan to expand the evaluation with more participants, and research papers
containing more code segments.
Empirical Study of Chat Communities as a
Mining Source for SE Tasks
32
P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft, "Empirical
Study of Chat Communities as a Mining Source for SE Tasks," In Progress
Public Chats
Chat
Communiity
# of active
users per week
Slack > 1 million
IRC > 1 million
https://blog.standuply.com/the-full-list-of-1000-slack-communities-
2c412054ea30
https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1570&context=open_ac
cess_theses
Bug Reports
Emails
Blog Posts
Q & A forums
Code Reviews
Documentation
E-books
Research Papers
Course Materials
Presentations
Public Chats
Benchmarks
Research Questions
34
Is the same information successfully mined in Q&A forums
available and prevalent in chat communities?
What information is available in chat communities that is
not typically found in Q&A forums and how might this
information be useful for SE tool improvement?
What characteristics of chat communities are similar and
different from Q&A forums?
How do the differences impact the transfer of automatic
Q&A forum mining techniques to chat communities?
Methodology
• Characteristics of conversations?
• Length
• Noise
• Opinions
• Topics
• Characteristics of code examples?
• Prevalence
• Length
• Characteristics of code descriptions?
• Functionality
• API related information
• Error and exceptions
• Software specific terms?
• Prevalence
35
Q & A
Chats
Q & A
Forums
Similar Intent of Learning and Sharing
Information among Developers
Dataset
Slack:
• Access to data: Request API token to admin to read and store data
• Download limit: Free tier store most recent 10,000 messages. Scripts to
download data from each channel every day
Stack Overflow: Download data with specific tags from Stack Exchange data dump
Slack-Stack Overflow Comparison Dataset:
• LDA topic analysis on samples from specific communities
• Select data for topics strongly exhibited in both datasets
36
Slack Team Slack Channel # messages # days # users
clojurians clojure 14,126 107 576
elmlang beginners 40,186 146 976
elmlang general 28,840 145 788
pythondev help 17,880 146 622
racket general 5,196 156 73
Tag # Questions #Answers
Clojure 13,920 25,019
Elm 1,019 1,416
Python 806,763 1,270,948
Racket 3,592 5,692
Cost: Challenges in mining chats
Conversations:
• Interleaving
• Varied length
• Informal (incomplete sentences, emoji, colloquial terms)
Availability:
• Only available to people who joined the community channel, free tiers store most
recent 10,000 messages
Quality:
• No metric to determine if the answer is correct, as opposed to best voted answers
in Stack Overflow.
• No easy way to determine duplicate questions.
Topic of discussion:
• Varied topics from bug request, asking for solution to a problem, sharing of
learning resources, casual chats, etc.
• Finding information related to a specific query in programming:
• Slack - keyword search.
• Stack Overflow - Google search, keyword search, tag search, related questions.
37
Summary: Mining Software Artifacts
• Availability of information
• Challenges and cost of mining information
• Automatic extraction of information
• Help software developers and improve tools
38
Implications and Future Work
Quality of information
• Designing quality measures
• Assessment of quality
• Improving quality

Mining Code Examples with Descriptive Text from Software Artifacts

  • 1.
    Preetha Chatterjee PhD Student preethac@udel.edu http://sites.udel.edu/preethac/ MiningCode Examples with Descriptive Text from Software Artifacts Lori Pollock Professor pollock@udel.edu https://www.eecis.udel.edu/~pollock/ Kostadin Damevski Manziba Akanda Nishi Virginia Commonwealth University Nicholas A. Kraft Vinay Augustine ABB Corporate Research Benjamin Gause Hunter Hedinger University of Delaware Collaborators
  • 2.
  • 3.
    Mining Code Segmentsfor Software Tools 3 Emails and bug-reports: • Re-documenting source code [Panichella 2012] • Recommending mentors in software projects [Canfora 2012] Tutorials: API learning [Jiang 2017, Petrosyan 2015] Q & A forums: • IDE recommendation [DeSouza 2014, Rahman 2014, Cordeiro 2012, Ponzanelli 2014, Bacchelli 2012, Amintaber 2015] • Learning and recommendation of API [Chen 2016, Rahman 2016, Wang 2013] • Automatic generation of comments for source code [Wong 2013, Rahman 2015] • Building thesaurus of software-specific terms [Tian 2014, Chen 2017] Research Articles? Chats?
  • 4.
    Today’s Talk An ExploratoryStudy: What Information about Code Snippets Is Available in Different Software- Related Documents? Case Study: Extracting Code Segments and Their Descriptions from Research Articles Empirical Study: Chat Communities as a Mining Source for SE Tasks 5
  • 5.
    An Exploratory Study:What Information about Code Snippets Is Available in Different Software-Related Documents? 6 P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft, "What information about code snippets is available in different software-related documents? An exploratory study," 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), Klagenfurt, 2017, pp. 382-386.
  • 6.
    Research Questions What kindsof information about the embedded code snippets is available in different document types? What are the characteristics of the code snippets embedded in different types of software-related documents? What are the cues that indicate code snippet related information? How do the cues differ across different document types?
  • 7.
    Software-Related Documents Document TypeExample Origin Benchmarks OpenMP, NAS Blog Posts MSDN, Wordpress Bug Repositories GitHub Issues, Bugzilla Code Reviews GitHub Pull Requests, Gerrit Course Materials cs.*.edu Documentation readthedocs.org E-Books WikiBooks Mailing Lists lkml.org Presentations SlideShare, Speaker Deck Public Chats Gitter, Slack Q&A Forums StackOverflow, MSDN Research Papers IEEE Xplore, arxiv.org Videos YouTube
  • 8.
  • 9.
  • 10.
    Labels and Sub-Labels LabelSub-Labels Explanatory Rationale Functionality Methodology Output of Code Similarity Modification Origin Structure Data Structure Control Flow Data Flow Lines of Code Label Sub-Labels Design Programming Language Framework Time/Space Complexity Efficiency Efficient Inefficient Assumptions Testing Clarity High Low Erroneous Compilation Runtime
  • 11.
    Example: Documentation Observation 1. Codeto perform a 3D Fourier Transform Labels/Sub-labels Explanatory (Functionality) Observation 2. Code uses a multidimensional array Labels/Sub-labels Structure (Data Structure)
  • 12.
    Frequency of occurrenceof different kinds of information for the code snippets by document type 13
  • 13.
    Code Snippet AndDescription Availability By Document Type Document Type # Docs Mean # Code Snippets Mean # LOC Per Snippet Mean # Lines of Text Benchmarks 2 4.5 13.0 86.7 Blog Posts 10 8.6 13.1 88.3 Bug Reports 6 2.7 20.1 17.2 Code Reviews 7 7.1 33.3 64.9 Course Materials 3 3 12.8 12.6 Documentation 6 3.5 7.8 22.8 E-books 5 2.6 21.4 37.4 Mailing Lists 5 1.6 46.6 17.6 Papers 5 8.6 10.3 439.9 Presentations 3 8.3 11.6 18.7 Public Chat 5 1.2 8.3 15.6 Q&A Sites 3 4.7 12.6 32.8 Total 60 14
  • 14.
    Case Study: ExtractingCode Segments and Their Descriptions from Research Articles 15 P. Chatterjee, B. Gause, H. Hedinger and L. Pollock, "Extracting Code Segments and Their Descriptions from Research Articles," 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), Buenos Aires, 2017, pp. 91-101.
  • 15.
    Bug Reports Emails Blog Posts Q& A forums Code Reviews Documentation E-books Research Papers Course Materials Presentations Public Chats Benchmarks Research Papers DL Domain # of articles ACM DL Computer Science > 300,000 IEEE Xplore Computer Science > 3,500,000 DBLP Mostly Computer Science > 3,729,582 https://en.wikipedia.org/wiki/IEEE_Xplore https://cacm.acm.org/magazines/2011/7/109905-acm-aggregates-publication- statistics-in-the-acm-digital-library/fulltext http://dblp.uni-trier.de/ 70% of the articles contain one or more code segments, with an average of 3-4 code segments per article.
  • 16.
    What information canbe learned from the text? Indication of a problem in the code The functionality of the code Individual method functionalities The type of data structure used Cause of the issue presented earlier To understand the difficulty of fixing a memory leak, let us take a look at an example program in Fig. 1. This is a contrived example mimicking recurring leak patterns we found in real C programs. Procedure check_records checks whether there is any bad records in a large file, and the caller could either check all records, or specify a search condition to check only part of records. In this example, both get_next and search_for_next will allocate and return a heap structure, which is expected to be freed at line 12. However, the execution may break out the loop at line 10, causing a memory leak. The programming language
  • 17.
    Challenges To understand thedifficulty of fixing a memory leak, let us take a look at an example program in Fig. 1. This is a contrived example mimicking recurring leak patterns we found in real C programs. Procedure check_records checks whether there is any bad records in a large file, and the caller could either check all records, or specify a search condition to check only part of records. In this example, both get_next and search_for_next will allocate and return a heap structure, which is expected to be freed at line 12. However, the execution may break out the loop at line 10, causing a memory leak. Identify code segments in unstructured documents Bacchelli et al. (ICPC’10) Tang et al. (KDD’05) Bettenburg et al. (MSR’08) Subramanian et al. (MSR’13) Rigby et al. (ICSE’13) Our focus: Identify text describing each code segment
  • 18.
    Contributions Automatically identifying &mapping text describing code segments in research articles A prototype, CoDesNPub Miner, that outputs XML-based representation associating code segments with their descriptions Evaluation of the effectiveness of code description identification techniques • Seeds • Neighbors
  • 19.
    Overview of CoDesNPubMiner Code Description Identification
  • 20.
    Identify Seeds: ReferencesCodeFigureHeuristic Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. *****Code Segment appears here***** Listing 9 shows an example of three statements that were single statement blocks after the first phases, but can be merged into a single block because they have similar RHSs. *****Code Segment appears here*****
  • 21.
    Identify Seeds: TextBeforeand TextAfter Heuristic Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. *****Code Segment appears here***** *****Code Segment appears here***** A major obstacle to extracting API examples from test code is the multiple test scenarios in a test method. Fig. 1 depicts such a test method. Lines 2-4 are the declaration of some data objects. Lines 5-13 depict a test scenario that contains the usage of some API methods, such as keySetByValue, put, and getKey. Lines 14-22 depict another test scenario, which contains a similar usage to the previous one. Such multiple test scenarios are quite reasonable when aiming at covering testing input domains. But they bring redundant code for API users to read. In fact, there are actually 200+ code lines containing similar test scenarios in the test method in Fig.1. It is necessary to separate different test scenarios from one test method and cluster the similar usages to remove redundancy.
  • 22.
    Identify Seeds: ContainsCodeIdentifiersHeuristic Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. *****Code Segment appears here*****
  • 23.
    Identify Seeds: ReferencesCodeByPositionHeuristic Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. This code snippet obtains a user name (user- Name) by invoking request.getParameter(“name”)and uses it to construct a query to be passed to a database for execution (con.execute (query)). This seemingly innocent piece of code may allow an attacker to gain access to unauthorized information: if an attacker has full control of string userName obtained from an HTTP request, he can for example set it to 'OR 1 = 1;- -. Two dashes are used to indicate comments in the Oracle dialect of SQL, so the WHERE clause of the query effectively becomes the tautologyname = ' ' OR 1 = 1. This allows the attackerto circumvent the name check and get access to all user records in the database.
  • 24.
    Putting All SeedHeuristics Together • Scoring sentences • Equal • Accuracy-based • Threshold Analysis Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8- 10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. ReferencesCodeFigure Score=3 ContainsCodeIdentifiers Score=2 ContainsCodeIdentifiers Score=2 ReferencesCodeByPosition Score=2 ContainsCodeIdentifiers, Score=2 ReferencesCodeByPosition Score=2 TextBefore Score=1 *****Code Segment appears here*****
  • 25.
    Identifying Neighboring Code-relatedText • Heuristic 1: At least 1 sentence is a seed • Heuristic 2: At least (25%, 50%, or 75%, respectively) sentences in the paragraph are seeds Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. 5 out of 6 (75%) sentences are seeds Heuristic 1 Heuristic 2 whole paragraph is a description
  • 26.
    Evaluation Methodology • ResearchQuestion: How effective is our approach to automatically identify code descriptions in natural language text of research articles? • Subjects: 100 code segments from ACM DL and IEEE Xplore journal and conference software engineering papers • Gold Set: • 10 Human annotators (non-authors) • Measures: • Overall code description identification: Precision and Recall • Seed identification: Precision
  • 27.
    Evaluation Results Minimum #of Seeds Precision Recall 1-24% 39.05 70.20 >= 25% 53.41 50.33 >= 50% 66.04 28.45 >= 75% 68.30 20.53 Overall system effectiveness
  • 28.
    Main Threats toValidity • Unable to distinguish between pseudocode and code fragments • Papers with no pseudocode, plan to extend the approach to identify both. • Evaluation relies on human judges • Human judges with experiences in programming and research paper reading. • Each code segment judged by at least two judges. • Scaling to extensive evaluation set might lead to different results • Plan to expand the evaluation with more participants, and research papers containing more code segments.
  • 29.
    Empirical Study ofChat Communities as a Mining Source for SE Tasks 32 P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft, "Empirical Study of Chat Communities as a Mining Source for SE Tasks," In Progress
  • 30.
    Public Chats Chat Communiity # ofactive users per week Slack > 1 million IRC > 1 million https://blog.standuply.com/the-full-list-of-1000-slack-communities- 2c412054ea30 https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1570&context=open_ac cess_theses Bug Reports Emails Blog Posts Q & A forums Code Reviews Documentation E-books Research Papers Course Materials Presentations Public Chats Benchmarks
  • 31.
    Research Questions 34 Is thesame information successfully mined in Q&A forums available and prevalent in chat communities? What information is available in chat communities that is not typically found in Q&A forums and how might this information be useful for SE tool improvement? What characteristics of chat communities are similar and different from Q&A forums? How do the differences impact the transfer of automatic Q&A forum mining techniques to chat communities?
  • 32.
    Methodology • Characteristics ofconversations? • Length • Noise • Opinions • Topics • Characteristics of code examples? • Prevalence • Length • Characteristics of code descriptions? • Functionality • API related information • Error and exceptions • Software specific terms? • Prevalence 35 Q & A Chats Q & A Forums Similar Intent of Learning and Sharing Information among Developers
  • 33.
    Dataset Slack: • Access todata: Request API token to admin to read and store data • Download limit: Free tier store most recent 10,000 messages. Scripts to download data from each channel every day Stack Overflow: Download data with specific tags from Stack Exchange data dump Slack-Stack Overflow Comparison Dataset: • LDA topic analysis on samples from specific communities • Select data for topics strongly exhibited in both datasets 36 Slack Team Slack Channel # messages # days # users clojurians clojure 14,126 107 576 elmlang beginners 40,186 146 976 elmlang general 28,840 145 788 pythondev help 17,880 146 622 racket general 5,196 156 73 Tag # Questions #Answers Clojure 13,920 25,019 Elm 1,019 1,416 Python 806,763 1,270,948 Racket 3,592 5,692
  • 34.
    Cost: Challenges inmining chats Conversations: • Interleaving • Varied length • Informal (incomplete sentences, emoji, colloquial terms) Availability: • Only available to people who joined the community channel, free tiers store most recent 10,000 messages Quality: • No metric to determine if the answer is correct, as opposed to best voted answers in Stack Overflow. • No easy way to determine duplicate questions. Topic of discussion: • Varied topics from bug request, asking for solution to a problem, sharing of learning resources, casual chats, etc. • Finding information related to a specific query in programming: • Slack - keyword search. • Stack Overflow - Google search, keyword search, tag search, related questions. 37
  • 35.
    Summary: Mining SoftwareArtifacts • Availability of information • Challenges and cost of mining information • Automatic extraction of information • Help software developers and improve tools 38 Implications and Future Work Quality of information • Designing quality measures • Assessment of quality • Improving quality

Editor's Notes

  • #2 Hello everyone, my name is Preetha Chatterjee and I am a PhD student working with Dr. Pollock in the research area of software engineering. Today, I am going to talk about our projects in Mining Code Examples and their descriptive text from s/w artifacts, conducted in collaboration with ABB and Virginia Commonwealth University.
  • #3 TA: With the increased online sharing, code is now available in many places beyond repositories and documentation. (click) Software developers can learn from other developers by searching for examples and advice left by others in blog posts, Q&A forums, emails, bug reports, research papers, tutorials, and other online sites. TS: Now, let us take a look at how different resources of source code have been mined to develop software tools.
  • #4 Emails are bug reports are mined by researchers for re-documenting source code and recommending mentors in s/w projects. Information mined from online tutorials are used in API learning. Q&A forums such as SO has been mined for a variety of software engineering tasks, including IDE recommendation, automatic generation of comments and building thesaurus for s/w specific and commonly used terms in s/w engg. Collectively, these prior works suggest that information embedded in all the afore-mentioned software-related documents could be used in building or improving software engineering tools. However, less work has focused on mining unconventional resources such as research articles and developer chats.
  • #5 TA: As others have shown in their research, code segments and their descriptions can be used to Improve (read the 4 bullets).
  • #6 In today’s talk I am going to present an exploratory study we conducted to gather insights on the different types of information about code segments available in different types of docs, an approach to extract code segments and their description from research articles and an ongoing empirical study on investigating the potential of developer chats as a mining resource for s/w engg tasks
  • #8 For the purpose of the study, we investigate 3 RQs, read bullets.
  • #9 12 document types that we studied, some example sources from which we sampled the documents, and the unit of granularity that we studied for each document type. We collected 60 document instances, with at least two instances for each type. We randomly selected document instances from 51 distinct sources, all well known, popular sites or high-profile projects. We excluded documents that had no code snippets.
  • #10 After selecting a set of documents of each document type, we redacted the code snippets, and asked human annotators to make as many observations about the missing code snippets strictly from the text and also to highlight the text on which they based their observations. We then developed a labeling scheme to code the observations. We analyzed both the codings and other information gained from the frequency and sizes of the code snippets across different document types. 35 annotators (20 undergraduate, 13 graduate, and 2 professional researchers, all with prior programming experience) 3 docs to each person
  • #11 Here is an example of a document being analyzed from start to finish. In step 1, an annotator highlights the parts of the document related to the redacted code snippet and numbers each highlighted text segment. In step 2, the annotator writes and numbers observations for each highlight. In step 3, we determine a label, sub-label, and the phrases or words in the original highlighted text that cued the labeling.
  • #12 We defined eight major categories of labels, or codes, for the observed code properties. We further defined subcategories for each of these labels, which we call sublabels, to provide more detailed categorization of the observations for qualitative analysis.
  • #13 Let us look at a sample annotation from the document type documentation. Without analyzing the code example we can learn the following from the description.
  • #14 After coding the annotator responses into the discussed labels and sub-lables, we created a heatmap on the annotated data. This indicates that Explanatory information is prevalent across most of the document types, and is the dominant category for several document types, such as blog posts, documentation, e-books, papers, and public chats. This aligns with that fact that the main purpose of these document types is to explain aspects of the implementation in the code snippets. Another kind of information that shows up fairly often across different document types is design information, which includes programming language, framework, and time/space complexity of the code snippet. from the perspective of an individual document type, bug reports, code reviews, documentation, mailing lists, papers and public chats appear to contain the highest amount of diversity in information about code snippets, relative to other document types.
  • #15 Table shows the number of instances in the study (with a total of 60 document instances), the mean number of code snippets in the unit of analysis for that document type, the mean number of lines of code (LOC) in those code snippets, and the mean number of lines of natural language text in the unit of analysis for that document type. Blog posts, code reviews, research papers, and presentations have the higher number of code snippets (> 7), The mean length of code snippets varies from 8 lines of code in public chats and documentation to 47 lines of code in mailing lists. In this case of mailing lists, developers often included entire classes or methods to provide context. Many of the document types contain code snippets in the range of 10 to 13 lines of code. As expected, research papers are outliers in size of natural language text. benchmarks, blog posts, and code reviews provide from 65 to 88 mean lines of text, while all the other document types range between 12 to 33 lines of text. The results of this study helped us motivate looking into less conventional mining resources, which we discuss in the following slides.
  • #17 Among all the resources for code segments we discussed before, research article is one example of a less conventional resource for mining.(click) In this paper, we look at the potential for learning from code examples in research papers as there are now large libraries of them available online. (click) Notice there are over 3,000,000 articles in IEEE Xplore, with around 20,000 being added each month. Among these, 70% of the papers contain an avg of 3-4 code segments.
  • #18 TA: Here, we show an example to demonstrate the kinds of information that can be gained with respect to a code example in a research paper. Without analyzing the code, we can learn: (go through the information as animation goes.) Considerable information is gained from just a paragraph.
  • #19 TA: Previous work on extracting code segments focuses on extraction from developer emails, bug reports, and StackOverflow. Several of these researchers have developed techniques to identify code segments in unstructured documents such as these, where the code segments are intermingled with natural language text. Thus, in this paper, we focus on identifying the natural language text in the document that is associated with a particular identified code segment.
  • #20 TA: In particular, this paper’s contributions include: (read bullets) Based on a manual observation of research papers, we divide the problem of identification into two subproblems (click): Identifying seeds, which are text sentences able to be identified through direct clues, and identifying neighboring sentences that also provide information about the code, without direct clues for their relation to the code.
  • #21 TA: Here is the overall process of our approach, which we call CoDesNPub Miner. First, The research articles need to be preprocessed to identify the code segments, which are sometimes in figures and sometimes in-lined with the text. The output is an XML representation of the article with the text and code tagged. This processing uses existing techniques. Our main work, the code description identification, is based on a set of heuristics that we developed based on manual observation of many examples. The output is an xml representation with each code segment and its related text TS: Let us look at the heuristics in detail
  • #22 TA: The simplest heuristic References_Code_Figure, identifies sentences that contain the word “figure” or “listing”, Then it uses the figure or listing number to check whether the referenced figure has been classified as code. The preprocessing has already identified whether a given figure contains a code segment.
  • #23 TA: The Text_Before and Text_After heuristics identify the sentences immediately before and immediately after any in-lined code segment as potential code descriptions. This heuristic assumes that an in-lined code segments is described just before or after the code. Some constant number of sentences can be used here.****
  • #24 TA: The Contains_Code_Identifiers heuristic identifies all the sentences that contain a string that also appears in any of the code segments as a code identifier. The heuristic maps the sentence to the appropriate code segment based on the identifier.
  • #25 TA: The References_Code_By_Position heuristic identifies the sentences that have specific cue words or phrases that suggest that a sentence is describing a code segment in the document. For example, you see "Their method invocation" and "this code snippet" which are phrases that indicate the sentences are describing code. TS: These are the four current heuristics that we use to identify and map sentences directly related to code segments in research papers. Now let us look at one example of a paragraph of text extracted using all the heuristics.
  • #26 TA: This paragraph shows sentences extracted individually using all the heuristics. Note that a given sentence may be identified by more than one heuristic. For example, a given sentence might contain more than one identifier, which is a strong indication that it is describing something about a given code segment. TS: Thus, we combine the heuristics by assigning a score to a sentence each time a heuristic indicates that it is potentially a code description. All seeds are identified by performing the heuristics, computing the sentence scores, and using a threshold. We performed a threshold analysis to determine the best threshold to use with the score ******************************************************************************************* TA: We investigated two possible approaches to using scores for sentences. Equal Scores: All cues are treated as equally contributing to the potential for a sentence to be a code description. Accuracy-based Scores: assigns different scores to each instance of different heuristics depending on the basis of our observations of relative accuracy during our work with the development set.
  • #27 TA: Next, we examine whether an neighboring sentences are also code descriptions. If a part of a paragraph of text contains sentences identified by the heuristics to describe a code segment, then the entire paragraph often describes the code segment extensively. However, not every paragraph with an identified seed sentence is a code description in its entirety. Therefore, we explored several thresholds needed to consider the whole paragraph as code description text. In our example, we would capture the whole paragraph under both heuristics since heuristic one only needs one seed, but it also meets the 75% threshold for heuristic 2, with 5 out of 6 sentences being seeds TS: The seed heuristics, scoring and threshold were combined with the neighbor heuristics to provide a complete system for automatic identification. Next, we evaluate how the variations work.
  • #28 TA: To evaluate the effectiveness of our approach, we took a random set of 100 code segments from research articles. We then created a gold set with human annotators who annotated 745 sentences, ran our tool, and computed the overall precision and recall of the code description identification. We also computed the precision of the seed identification. We do not compute recall of the seed identification because we did not want to reveal details of our approach to the human annotators and bias them.
  • #29 TA: As part of evaluating the effectiveness, we considered several configurations for code description identification: Based on these results, we can conclude that for the overall system effectiveness we achieve the highest precision of around 68% with the minimum no. of seeds >=75%. However the recall falls to 21%. If you are interested to have a more balanced configuration, you can choose the minimum no. of seeds to be >=25% to achieve each of the precision and recall at around 50%. Note that we are most interested in higher precision than recall because we want the identified descriptions to indeed be descriptive, whereas missing some descriptions is not critical. This is the first system of this kind, so we have no comparison. We refer you to our paper to look for examples from our qualitative analysis to gain more insight on how our system works.
  • #30 TA: research articles rarely contain overly complex code examples, since they mostly describe novel ways to address a problem rather than going into the details of code complexity. We see that a wide variety of information can be gained from descriptions associated with code segments in digital libraries. The most significant information we can retrieve from research articles is “Methodology”, which describes how a piece of code is implemented. The other significant ones are the Rationale, Data Structure and the Control Flow.
  • #31 Next we figured out how do authors typically reference code segments within their code description text in research articles. TA: References Figure Containing Code is the most prevalent heuristic. The next prevalent heuristic is Neighboring Code- related Text which helps in identifying sentences that describe less obvious details about code segments.
  • #32 TA: Finally there are a number of potential threats to validity to our evaluation and other data collection. The main threats are on this slide, we tried to mitigate them or plan to improve on how we address them in the future work of the paper.
  • #34 Beyond research papers, we investigate another less conventional resource for mining for s/w engg tasks.(click) In this paper, we look at the potential for chats to be a mining resource since developers increasingly coordinate and collaborate using different chat communitites such as Slack, IRC, Hipchat, Flowdock, etc. (click) Notice there are over 1 million active users per week on Slack and IRC.
  • #35 We report on an exploratory study to compare the content in Q \& A focused public chat communities (e.g. Slack) with Q \& A based discussion forums (e.g. Stack Overflow), since both the resources are believed to share similar intent of learning and sharing information among developers. We explore the availability and prevalence of information in chat messages, which provides us with the first insight into the prospect of chat communities as a source of mining. Additionally, we compared the characteristics and the different types of information present in Q\&A forums vs chat communities, to investigate the feasibility of applying automatic information extraction techniques previously developed for Q\&A forums, on chat communities
  • #36 To investigate the RQs discussed before, we dig deeper to analyze. (1) characteristics of embedded code snippets, (2) characteristics of text describing the code snippets, (3) characteristics of conversations and (4) prevalence of s/w specific terms in the nl text overall.
  • #37 determining granularity of data samples from SO and Slack for comparison is challenging due to differences between the two forums.  So, the goal is to get enough data for analysis based on topics that are shared between the two forums and are also popular topics.