Mining Code Examples with Descriptive Text from Software Artifacts

Preetha Chatterjee
PhD Student
preethac@udel.edu
http://sites.udel.edu/preethac/
Mining Code Examples with Descriptive
Text from Software Artifacts
Lori Pollock
Professor
pollock@udel.edu
https://www.eecis.udel.edu/~pollock/
Kostadin Damevski
Manziba Akanda Nishi
Virginia Commonwealth University
Nicholas A. Kraft
Vinay Augustine
ABB Corporate Research
Benjamin Gause
Hunter Hedinger
University of Delaware
Collaborators

Mining Code Segments for Software Tools
3
Emails and bug-reports:
• Re-documenting source code [Panichella 2012]
• Recommending mentors in software projects [Canfora 2012]
Tutorials: API learning [Jiang 2017, Petrosyan 2015]
Q & A forums:
• IDE recommendation [DeSouza 2014, Rahman 2014, Cordeiro 2012, Ponzanelli
2014, Bacchelli 2012, Amintaber 2015]
• Learning and recommendation of API [Chen 2016, Rahman 2016, Wang 2013]
• Automatic generation of comments for source code [Wong 2013, Rahman 2015]
• Building thesaurus of software-specific terms [Tian 2014, Chen 2017]
Research Articles?
Chats?

Today’s Talk
An Exploratory Study: What Information about
Code Snippets Is Available in Different Software-
Related Documents?
Case Study: Extracting Code Segments and
Their Descriptions from Research Articles
Empirical Study: Chat Communities as a Mining
Source for SE Tasks
5

An Exploratory Study: What Information
about Code Snippets Is Available in
Different Software-Related Documents?
6
P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft,
"What information about code snippets is available in different software-related
documents? An exploratory study," 2017 IEEE 24th International Conference on Software
Analysis, Evolution and Reengineering (SANER), Klagenfurt, 2017, pp. 382-386.

Research Questions
What kinds of information about the embedded code
snippets is available in different document types?
What are the characteristics of the code snippets
embedded in different types of software-related documents?
What are the cues that indicate code snippet related
information? How do the cues differ across different
document types?

Software-Related Documents
Document Type Example Origin
Benchmarks OpenMP, NAS
Blog Posts MSDN, Wordpress
Bug Repositories GitHub Issues, Bugzilla
Code Reviews GitHub Pull Requests, Gerrit
Course Materials cs.*.edu
Documentation readthedocs.org
E-Books WikiBooks
Mailing Lists lkml.org
Presentations SlideShare, Speaker Deck
Public Chats Gitter, Slack
Q&A Forums StackOverflow, MSDN
Research Papers IEEE Xplore, arxiv.org
Videos YouTube

Labels and Sub-Labels
Label Sub-Labels
Explanatory
Rationale
Functionality
Methodology
Output of Code
Similarity
Modification
Origin
Structure
Data Structure
Control Flow
Data Flow
Lines of Code
Label Sub-Labels
Design
Programming Language
Framework
Time/Space Complexity
Efficiency
Efficient
Inefficient
Assumptions
Testing
Clarity
High
Low
Erroneous
Compilation
Runtime

Example: Documentation
Observation
1. Code to perform a 3D
Fourier Transform
Labels/Sub-labels
Explanatory (Functionality)
Observation
2. Code uses a
multidimensional array
Labels/Sub-labels
Structure (Data Structure)

Frequency of occurrence of different kinds of
information for the code snippets by document type
13

Code Snippet And Description Availability By
Document Type
Document Type # Docs Mean # Code
Snippets
Mean # LOC Per
Snippet
Mean # Lines of
Text
Benchmarks 2 4.5 13.0 86.7
Blog Posts 10 8.6 13.1 88.3
Bug Reports 6 2.7 20.1 17.2
Code Reviews 7 7.1 33.3 64.9
Course Materials 3 3 12.8 12.6
Documentation 6 3.5 7.8 22.8
E-books 5 2.6 21.4 37.4
Mailing Lists 5 1.6 46.6 17.6
Papers 5 8.6 10.3 439.9
Presentations 3 8.3 11.6 18.7
Public Chat 5 1.2 8.3 15.6
Q&A Sites 3 4.7 12.6 32.8
Total 60
14

Case Study: Extracting Code Segments and
Their Descriptions from Research Articles
15
P. Chatterjee, B. Gause, H. Hedinger and L. Pollock, "Extracting Code Segments and Their
Descriptions from Research Articles," 2017 IEEE/ACM 14th International Conference on
Mining Software Repositories (MSR), Buenos Aires, 2017, pp. 91-101.

Bug Reports
Emails
Blog Posts
Q & A forums
Code Reviews
Documentation
E-books
Research Papers
Course Materials
Presentations
Public Chats
Benchmarks
Research Papers
DL Domain # of
articles
ACM DL Computer
Science
> 300,000
IEEE
Xplore
Computer
Science
> 3,500,000
DBLP Mostly
Computer
Science
> 3,729,582
https://en.wikipedia.org/wiki/IEEE_Xplore
https://cacm.acm.org/magazines/2011/7/109905-acm-aggregates-publication-
statistics-in-the-acm-digital-library/fulltext
http://dblp.uni-trier.de/
70% of the articles contain
one or more code segments,
with an average of 3-4 code
segments per article.

What information can be learned from the text?
Indication of a problem in the code
The functionality of the code
Individual method functionalities
The type of data structure used
Cause of the issue presented earlier
To understand the difficulty of fixing a memory leak, let
us take a look at an example program in Fig. 1. This is a
contrived example mimicking recurring leak patterns we
found in real C programs. Procedure check_records
checks whether there is any bad records in a large file,
and the caller could either check all records, or specify a
search condition to check only part of records. In this
example, both get_next and search_for_next will
allocate and return a heap structure, which is expected
to be freed at line 12. However, the execution may
break out the loop at line 10, causing a memory leak.
The programming language

Challenges
To understand the difficulty of fixing a memory leak, let
us take a look at an example program in Fig. 1. This is a
contrived example mimicking recurring leak patterns
we found in real C programs. Procedure check_records
checks whether there is any bad records in a large file,
and the caller could either check all records, or specify
a search condition to check only part of records. In this
example, both get_next and search_for_next will
allocate and return a heap structure, which is expected
to be freed at line 12. However, the execution may
break out the loop at line 10, causing a memory leak.
Identify code segments in
unstructured documents
Bacchelli et al. (ICPC’10)
Tang et al. (KDD’05)
Bettenburg et al. (MSR’08)
Subramanian et al. (MSR’13)
Rigby et al. (ICSE’13)
Our focus:
Identify text describing each code
segment

Contributions
Automatically identifying & mapping text describing code
segments in research articles
A prototype, CoDesNPub Miner, that outputs XML-based
representation associating code segments with their
descriptions
Evaluation of the effectiveness of code description
identification techniques
• Seeds
• Neighbors

Overview of CoDesNPub Miner
Code Description Identification

Identify Seeds: ReferencesCodeFigure Heuristic
Fig.5 shows a typical test method of this
pattern. The method tests a set of basic
functionality of API class BasicAuthCache,
including the method put, get, remove and
clear. There are three test scenarios in the
method: line 4-5, line 6-7, line 8-10. They
share two data objects, cache and
authScheme. Their method invocation
sequences are not same and there is no
unified test target method. But there is a
common subsequence among three method
invocation sequences, i.e., the invocations of
get and HttpHost.
*****Code Segment appears here*****
Listing 9 shows an example of three
statements that were single statement blocks
after the first phases, but can be merged into
a single block because they have similar RHSs.

Identify Seeds: TextBefore and TextAfter Heuristic
get and HttpHost.
A major obstacle to extracting API examples
from test code is the multiple test scenarios in
a test method. Fig. 1 depicts such a test
method. Lines 2-4 are the declaration of some
data objects. Lines 5-13 depict a test scenario
that contains the usage of some API methods,
such as keySetByValue, put, and getKey. Lines
14-22 depict another test scenario, which
contains a similar usage to the previous one.
Such multiple test scenarios are quite
reasonable when aiming at covering testing
input domains. But they bring redundant code
for API users to read. In fact, there are actually
200+ code lines containing similar test
scenarios in the test method in Fig.1. It is
necessary to separate different test scenarios
from one test method and cluster the similar
usages to remove redundancy.

Identify Seeds: ContainsCodeIdentifiers Heuristic
get and HttpHost.

Identify Seeds: ReferencesCodeByPosition Heuristic
including the method put, get, remove
and clear. There are three test scenarios in
the method: line 4-5, line 6-7, line 8-10.
They share two data objects, cache and
common subsequence among three
method invocation sequences, i.e., the
invocations of get and HttpHost.
This code snippet obtains a user name (user- Name)
by invoking request.getParameter(“name”)and uses
it to construct a query to be passed to a database for
execution (con.execute (query)). This seemingly
innocent piece of code may allow an attacker to gain
access to unauthorized information: if an attacker
has full control of string userName obtained from an
HTTP request, he can for example set it to 'OR 1 = 1;-
-. Two dashes are used to indicate comments in the
Oracle dialect of SQL, so the WHERE clause of the
query effectively becomes the tautologyname = ' '
OR 1 = 1. This allows the attackerto circumvent the
name check and get access to all user records in the
database.

Putting All Seed Heuristics Together
• Scoring sentences
• Equal
• Accuracy-based
• Threshold Analysis
Fig.5 shows a typical test method of this pattern.
The method tests a set of basic functionality of API
class BasicAuthCache, including the method put,
get, remove and clear. There are three test
scenarios in the method: line 4-5, line 6-7, line 8-
10. They share two data objects, cache and
authScheme. Their method invocation sequences
are not same and there is no unified test target
method. But there is a common subsequence
among three method invocation sequences, i.e.,
the invocations of get and HttpHost.
ReferencesCodeFigure Score=3
ContainsCodeIdentifiers Score=2
ContainsCodeIdentifiers Score=2
ReferencesCodeByPosition Score=2
ContainsCodeIdentifiers, Score=2
ReferencesCodeByPosition Score=2
TextBefore Score=1

Identifying Neighboring Code-related Text
• Heuristic 1: At least 1 sentence is a seed
• Heuristic 2: At least (25%, 50%, or 75%, respectively) sentences in the
paragraph are seeds
Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality
of API class BasicAuthCache, including the method put, get, remove and clear. There are three
test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache
and authScheme. Their method invocation sequences are not same and there is no unified test
target method. But there is a common subsequence among three method invocation
sequences, i.e., the invocations of get and HttpHost.
5 out of 6 (75%) sentences are seeds
Heuristic 1
Heuristic 2
whole paragraph is a description

Evaluation Methodology
• Research Question: How effective is our approach to automatically
identify code descriptions in natural language text of research
articles?
• Subjects: 100 code segments from ACM DL and IEEE Xplore journal
and conference software engineering papers
• Gold Set:
• 10 Human annotators (non-authors)
• Measures:
• Overall code description identification: Precision and Recall
• Seed identification: Precision

Evaluation Results
Minimum # of Seeds Precision Recall
1-24% 39.05 70.20
>= 25% 53.41 50.33
>= 50% 66.04 28.45
>= 75% 68.30 20.53
Overall system effectiveness

Main Threats to Validity
• Unable to distinguish between pseudocode and code fragments
• Papers with no pseudocode, plan to extend the approach to identify both.
• Evaluation relies on human judges
• Human judges with experiences in programming and research paper reading.
• Each code segment judged by at least two judges.
• Scaling to extensive evaluation set might lead to different results
• Plan to expand the evaluation with more participants, and research papers
containing more code segments.

Empirical Study of Chat Communities as a
Mining Source for SE Tasks
32
P. Chatterjee, M. A. Nishi, K. Damevski, V. Augustine, L. Pollock and N. A. Kraft, "Empirical
Study of Chat Communities as a Mining Source for SE Tasks," In Progress

Public Chats
Chat
Communiity
# of active
users per week
Slack > 1 million
IRC > 1 million
https://blog.standuply.com/the-full-list-of-1000-slack-communities-
2c412054ea30
https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1570&context=open_ac
cess_theses
Bug Reports
Emails
Blog Posts
Q & A forums
Code Reviews
Documentation
E-books
Research Papers
Course Materials
Presentations
Public Chats
Benchmarks

Research Questions
34
Is the same information successfully mined in Q&A forums
available and prevalent in chat communities?
What information is available in chat communities that is
not typically found in Q&A forums and how might this
information be useful for SE tool improvement?
What characteristics of chat communities are similar and
different from Q&A forums?
How do the differences impact the transfer of automatic
Q&A forum mining techniques to chat communities?

Methodology
• Characteristics of conversations?
• Length
• Noise
• Opinions
• Topics
• Characteristics of code examples?
• Prevalence
• Length
• Characteristics of code descriptions?
• Functionality
• API related information
• Error and exceptions
• Software specific terms?
• Prevalence
35
Q & A
Chats
Q & A
Forums
Similar Intent of Learning and Sharing
Information among Developers

Dataset
Slack:
• Access to data: Request API token to admin to read and store data
• Download limit: Free tier store most recent 10,000 messages. Scripts to
download data from each channel every day
Stack Overflow: Download data with specific tags from Stack Exchange data dump
Slack-Stack Overflow Comparison Dataset:
• LDA topic analysis on samples from specific communities
• Select data for topics strongly exhibited in both datasets
36
Slack Team Slack Channel # messages # days # users
clojurians clojure 14,126 107 576
elmlang beginners 40,186 146 976
elmlang general 28,840 145 788
pythondev help 17,880 146 622
racket general 5,196 156 73
Tag # Questions #Answers
Clojure 13,920 25,019
Elm 1,019 1,416
Python 806,763 1,270,948
Racket 3,592 5,692

Cost: Challenges in mining chats
Conversations:
• Interleaving
• Varied length
• Informal (incomplete sentences, emoji, colloquial terms)
Availability:
• Only available to people who joined the community channel, free tiers store most
recent 10,000 messages
Quality:
• No metric to determine if the answer is correct, as opposed to best voted answers
in Stack Overflow.
• No easy way to determine duplicate questions.
Topic of discussion:
• Varied topics from bug request, asking for solution to a problem, sharing of
learning resources, casual chats, etc.
• Finding information related to a specific query in programming:
• Slack - keyword search.
• Stack Overflow - Google search, keyword search, tag search, related questions.
37

Summary: Mining Software Artifacts
• Availability of information
• Challenges and cost of mining information
• Automatic extraction of information
• Help software developers and improve tools
38
Implications and Future Work
Quality of information
• Designing quality measures
• Assessment of quality
• Improving quality

Mining Code Examples with Descriptive Text from Software Artifacts

More Related Content

What's hot

Similar to Mining Code Examples with Descriptive Text from Software Artifacts

More from Preetha Chatterjee

Recently uploaded

Mining Code Examples with Descriptive Text from Software Artifacts

Editor's Notes