Msr2021 tutorial-di penta

CRAFTING

YOUR NEXT MSR PAPER
SUGGESTIONS FROM MY (GOOD AND BAD)
EXPERIENCE
Massimiliano Di Pent
a

University of Sannio, Italy

MY MSR EXPERIENCE
• 19 papers published (17 full research papers
)

• Program committee member in 9 edition
s

• Program co-chair in 2012 and 201
3

• General chair in 201
5

• Steering committee member 2011-2018

GOALS OFTHISTUTORIAL
• Explain different ways for contributing to MSR
researc
h

• Go over the paper’s evaluation criteria and try to
satisfy them

NOTES
• I will refer to some exemplar paper
s

• Those are just examples, but some of them quite
representative one
s

• All are MSR-related papers, not only from the
MSR conference

ANALYSIS OF MSR REPORTING
• I’m studying this with Davide Falessi and Alexander Serebrenik
• We are interested to hear your opinion, especially if you are a senior
member of the community (SurveyHero, takes 15 min.
)

https://tinyurl.com/MiningReporting

CHAPTER I -

HOW CAN I CONTRIBUTE

TO MSR RESEARCH?

DIFFERENT WAYS FOR
CONTRIBUTINGTO MSR

METHODOLOGICAL PAPERS
Providing techniques that will hopefully help
future mining research

FIX INDUCING CHANGES
(SZZ ALGORITHM)
When Do Changes Induce Fixes?
(On Fridays.)
Jacek Śliwerski
International Max Planck Research School
Max Planck Institute for Computer Science
Saarbrücken, Germany
sliwers@mpi-sb.mpg.de
Thomas Zimmermann Andreas Zeller
Department of Computer Science
Saarland University
{tz, zeller}@acm.org
ABSTRACT
As a software system evolves, programmers make changes that
sometimes cause problems. We analyze CVS archives for fix-in-
ducing changes—changes that lead to problems, indicated by fixes.
We show how to automatically locate fix-inducing changes by link-
ing a version archive (such as CVS) to a bug database (such as
BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE
history, it turns out that fix-inducing changes show distinct patterns
with respect to their size and the day of week they were applied.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance, and
Enhancement—corrections, version control; D.2.8 [Metrics]: Com-
plexity measures
General Terms
Management, Measurement
1. INTRODUCTION
Which change properties may lead to problems? We can inves-
tigate which properties of a change correlate with inducing
fixes, for instance, changes made on a specific day or by a
specific group of developers.
How error-prone is my product? We can assign a metric to the
product—on average, how likely is it that a change induces a
later fix?
How can I filter out problematic changes? When extracting the
architecture via co-changes from a version archive, there is
no need to consider fix-inducing changes, as they get undone
later.
Can I improve guidance along related changes? When using co-
changes to guide programmers along related changes, we
would like to avoid fix-inducing changes in our suggestions.
This paper describes our first experiences with fix-inducing chang-
es. We discuss how to extract data from version and bug archives
(Section 2), and how we link bug reports to changes (Section 3).
In Section 4, we describe how to identify and locate fix-inducing
changes. Section 5 shows the results of our investigation of the

LINKING ISSUES TO
COMMITS
“
fi
x 367920 setting pop3 Messages as junk/not junk ignored when Message
quarantining turned on sr=mscott
”

Solution: Regular expression matching e.g.

$l=~/BR (d+)/ || $l=~/fixs+(d+)/i || $l=~/PRs+(d+)/ ||  
$l=~/Bugzillas+(d+)/i || 
$l=~/Bugs+(d+)/i || $l=~/^#(d+)/i

IDENTIFYING FIX INDUCING
CHANGES
bug

fi
xing
fi
x inducing

changes
fi
x inducing

change
Affected

lines
Affected

lines
fi
le
source
code
lines
cn
^cn
before

bug

fi
xing
ci cj ck

INFRASTRUCTURE
• Setting up tools or data for other researcher
s

• Sometimes a consequence of a methodological
contribution

SRCML
An XML-Based Lightweight C++ Fact Extractor
Michael L. Collard, Huzefa H. Kagdi, Jonathan I. Maletic
Department of Computer Science
Kent State University
Kent Ohio 44242
330 672 9039
collard@cs.kent.edu, hkagdi@cs.kent.edu, jmaletic@cs.kent.edu
Abstract
A lightweight fact extractor is presented that utilizes
XML tools, such as XPath and XSLT, to extract static
information from C++ source code programs. The
source code is first converted into an XML
representation, srcML, to facilitate the use of a wide
variety of XML tools. The method is deemed lightweight
because only a partial parsing of the source is done.
Additionally, the technique is quite robust and can be
applied to incomplete and non-compile-able source code.
The trade off to this approach is that queries on some low
level details cannot be directly addressed. This approach
is applied to a fact extractor benchmark as comparison
with other, abet heavier weight, fact extractors. Fact
extractors are widely used to support understanding
tasks associated with maintenance, reverse engineering
and various other software engineering tasks.
a lightweight, robust, and tolerant C++ fact extractor.
We use the term lightweight to highlight the fact that
only lightweight parsing is done and a number of very
low-level type facts can not be directly derived from the
data source (i.e., srcML markup of the C++ source).
Our method allows the extraction of high-level entities
such as functions, classes, namespaces, and templates, as
well as middle-level entities such as individual
statements (if, while, etc.), declarations and expressions.
Lower-level entities such as variables and function calls
can also be queried. Additionally, it allows the extraction
of entities that are typically discarded during pre-
processing such as comments, pre-processor directives,
and macros. The entities are extracted with full lexical
information such as white space and all original source
code information.
The following section will address some of the
problems encountered during fact extraction and address
the related work in the field of fact extraction. We then
describe srcML and our C++ to srcML translator.

SRCML
https://www.srcml.org
• Parses source code and produces the output in XM
L

• Multi-languag
e

• Also supports transformations, lightweight slicing/
data
fl
ow analysis

PERCEVAL
Perceval: Software Project Data at Your Will
Santiago Dueñas
Bitergia
sduenas@bitergia.com
Valerio Cosentino
Bitergia
valcos@bitergia.com
Gregorio Robles
Universidad Rey Juan Carlos
grex@gsyc.urjc.es
Jesus M. Gonzalez-Barahona
Universidad Rey Juan Carlos
jgb@gsyc.urjc.es
ABSTRACT
Software development projects, in particular open source ones,
heavily rely on the use of tools to support, coordinate and promote
development activities. Despite their paramount value, they con-
tribute to fragment the project data, thus challenging practitioners
and researchers willing to derive insightful analytics about software
projects. In this demo we present Perceval, a loyal helper able to
perform automatic and incremental data gathering from almost any
tool related with contributing to open source development, among
others, source code management, issue tracking systems, mailing
lists, forums, and social media. Perceval is an industry strong free
software tool that has been widely used in Bitergia, a company
devoted to offer commercial software analytics of software projects.
It hides the technical complexities related to data acquisition and
eases the definition of analytics. A video showcasing the main
features of Perceval can be found at https://youtu.be/eH1sYF0Hdc8.
KEYWORDS
Software mining, empirical software engineering, open source soft-
However, accessing and gathering this data is often a time-
consuming and an error-prone task, that entails many considera-
tions and technical expertise [1, 12, 16]. It may require to understand
how to obtain an OAuth [11] token (e.g., StackExchange, GitHub)
or prepare storage to download the data (e.g., Git repositories, mail-
ing list archives); when dealing with development support tools
that expose their data via APIs, special attention has to be paid to
the terms of service (e.g., an excessive number of requests could
lead to temporary or permanent bans); recovery solutions to tackle
connection problems when fetching remote data should also taken
into account; storing the data already received and retrying failed
API calls may speed up the overall gathering process and reduce
the risk of corrupted data. Nonetheless, even if these problems are
known, many scholars and practitioners tend to re-invent the wheel
by retrieving the data themselves with ad-hoc scripts.
In this paper, we present Perceval, a tool that simplifies the col-
lection of project data by covering more than 20 well-known tools
and platforms related to contributing to open source development,
thus enabling the definition of software analytics. It rebuilts and
2018 ACM/IEEE 40th International Conference on Software Engineering: Companion Proceedings

PERCEVAL
https://github.com/chaoss/grimoirelab-perceval

•Gathers data from a wide number of software
repositories

•git, GitHub, issue trackers, Slack, Gerrit, Docker hub,
and many others

PYDRILLER
PyDriller: Python Framework for Mining So�ware Repositories
Davide Spadini
Delft University of Technology
Software Improvement Group
Delft, The Netherlands
d.spadini@sig.eu
Maurício Aniche
Delft, The Netherlands
m.f.aniche@tudelft.nl
Alberto Bacchelli
University of Zurich
Zurich, Switzerland
bacchelli@i�.uzh.ch
ABSTRACT
Software repositories contain historical and valuable information
about the overall development of software systems. Mining software
repositories (MSR) is nowadays considered one of the most inter-
esting growing �elds within software engineering. MSR focuses
on extracting and analyzing data available in software repositories
to uncover interesting, useful, and actionable information about
the system. Even though MSR plays an important role in software
engineering research, few tools have been created and made public
to support developers in extracting information from Git reposi-
tory. In this paper, we present P��, a Python Framework that
eases the process of mining Git. We compare our tool against the
state-of-the-art Python Framework GitPython, demonstrating that
P�� can achieve the same results with, on average, 50% less
LOC and signi�cantly lower complexity.
URL: https://github.com/ishepard/pydriller,
Materials: https://doi.org/10.5281/zenodo.1327363,
Pre-print: https://doi.org/10.5281/zenodo.1327411
CCS CONCEPTS
• Software and its engineering;
actionable insights for software engineering, such as understanding
the impact of code smells [13–15], exploring how developers are
doing code reviews [2, 4, 10, 21] and which testing practices they
follow [20], predicting classes that are more prone to change/de-
fects [3, 6, 16, 17], and identifying the core developers of a software
team to transfer knowledge [12].
Among the di�erent sources of information researchers can use,
version control systems, such as Git, are among the most used ones.
Indeed, version control systems provide researchers with precise
information about the source code, its evolution, the developers of
the software, and the commit messages (which explain the reasons
for changing).
Nevertheless, extracting information from Git repositories is
not trivial. Indeed, many frameworks can be used to interact with
Git (depending on the preferred programming language), such as
GitPython [1] for Python, or JGit for Java [8]. However, these tools
are often di�cult to use. One of the main reasons for such di�culty
is that they encapsulate all the features from Git, hence, developers
are forced to write long and complex implementations to extract
even simple data from a Git repository.
In this paper, we present P��, a Python framework that
helps developers to mine software repositories. P�� provides

PYDRILLER
https://github.com/ishepard/pydrille
r

• Python-based mining framewor
k

• Changed
fi
les, diffs, metric
s

• Watch back this morning Tutorial
 
by Mauricio Aniche and Alberto Bacchelli

PERSPECTIVE PAPERS
Provide insights on how (not to) mine certain
repositorie
s

Lessons learned, things to avoid

ON MINING GIT…
The Promises and Perils of Mining Git
Christian Bird⇤, Peter C. Rigby†, Earl T. Barr⇤, David J. Hamilton⇤, Daniel M. German†, Prem Devanbu⇤
⇤University of California, Davis, USA
†University of Victoria, Canada
{bird,barr,hamiltod,devanbu}@cs.ucdavis.edu {pcr,dmg}@cs.uvic.ca
Abstract
We are now witnessing the rapid growth of decentralized
source code management (DSCM) systems, in which every
developer has her own repository. DSCMs facilitate a style
of collaboration in which work output can flow sideways
(and privately) between collaborators, rather than always
up and down (and publicly) via a central repository. Decen-
tralization comes with both the promise of new data and the
peril of its misinterpretation. We focus on git, a very popular
DSCM used in high-profile projects. Decentralization, and
other features of git, such as automatically recorded con-
500
1000
1500
2000
2500
3000
Number
of
Projects
Subversion
Git
Bazaar
CVS
Darcs
Hg

… AND GITHUB
The Promises and Perils of Mining GitHub
Eirini Kalliamvakou
University of Victoria
ikaliam@uvic.ca
Georgios Gousios
G.Gousios@tudelft.nl
Kelly Blincoe
kblincoe@acm.org
Leif Singer
lsinger@uvic.ca
Daniel M. German⇤
dmg@uvic.ca
Daniela Damian
danielad@cs.uvic.ca
ABSTRACT
With over 10 million git repositories, GitHub is becoming
one of the most important source of software artifacts on
the Internet. Researchers are starting to mine the infor-
mation stored in GitHub’s event logs, trying to understand
how its users employ the site to collaborate on software.
However, so far there have been no studies describing the
quality and properties of the data available from GitHub.
We document the results of an empirical study aimed at un-
derstanding the characteristics of the repositories in GitHub
and how users take advantage of GitHub’s main features—
namely commits, pull requests, and issues. Our results indi-
cate that, while GitHub is a rich source of data on software
development, mining GitHub for research purposes should
take various potential perils into consideration. We show,
for example, that the majority of the projects are personal
and inactive; that GitHub is also being used for free storage
and as a Web hosting service; and that almost 40% of all pull
requests do not appear as merged, even though they were.
We provide a set of recommendations for software engineer-
ing researchers on how to approach the data in GitHub.
D.2.8 [Software Engineering]: Management—Software con-
“fork & pull” model in which developers create their own
copy of a repository and submit a pull request when they
want the project maintainer to pull their changes into the
main branch. In addition to code hosting, collaborative code
review, and integrated issue tracking, GitHub has integrated
social features. Users are able to subscribe to information by
“watching” projects and “following” users, resulting in a feed
of information on those projects and users of interest. Users
also have profiles that can be populated with identifying
information and contain their recent activity within the site.
With over 10.6 million repositories1
hosted as of January
2014, GitHub is currently the largest code hosting site in the
world. Its popularity, the integrated social features, and the
availability of metadata through an accessible api have made
GitHub very attractive for software engineering researchers.
Existing research has been both qualitative [4, 7, 16, 17, 19]
and quantitative [10, 24, 25, 26]. Qualitative studies have fo-
cused on how developers use GitHub’s social features to form
impressions and draw conclusions on other developers’ and
projects’ activity to assess success, performance, and possi-
ble collaboration opportunities. Quantitative studies have
aimed to systematically archive GitHub’s publicly available
data and use that to investigate development practices and
network structure in the GitHub environment.
As part of our research on collaboration on GitHub [15],

LOOK ATTHE FIRST MSR!
https://dblp.org/db/conf/msr/msr2004.html

ABOUT EMPIRICAL RESEARCH
Quantitative, Qualitative, or both
Observing patterns in a project
Finding correlations between variables

QUANTITATIVE STUDY
An Empirical Analysis of the
Docker Container Ecosystem on GitHub
Jürgen Cito∗, Gerald Schermann∗, John Erik Wittern†, Philipp Leitner∗, Sali Zumberi∗, Harald C. Gall∗
∗ Software Evolution and Architecture Lab
University of Zurich, Switzerland
{lastname}@ifi.uzh.ch
† IBM T. J. Watson Research Center
Yorktown Heights, NY, USA
witternj@us.ibm.com
Abstract—Docker allows packaging an application with its
dependencies into a standardized, self-contained unit (a so-called
container), which can be used for software development and to
run the application on any system. Dockerfiles are declarative
definitions of an environment that aim to enable reproducible
builds of the container. They can often be found in source code
repositories and enable the hosted software to come to life in
its execution environment. We conduct an exploratory empirical
study with the goal of characterizing the Docker ecosystem,
prevalent quality issues, and the evolution of Dockerfiles. We base
our study on a data set of over 70000 Dockerfiles, and contrast
this general population with samplings that contain the Top-100
and Top-1000 most popular Docker-using projects. We find that
most quality issues (28.6%) arise from missing version pinning
(i.e., specifying a concrete version for dependencies). Further, we
were not able to build 34% of Dockerfiles from a representative
sample of 560 projects. Integrating quality checks, e.g., to issue
version pinning warnings, into the container build process could
result into more reproducible builds. The most popular projects
change more often than the rest of the Docker population, with
5.81 revisions per year and 5 lines of code changed on average.
ity [4], we study the Docker ecosystem with respect to quality
of Dockerfiles and their change and evolution behavior within
software repositories. We developed a tool chain that trans-
forms Dockerfiles and their evolution in Git repositories into
a relational database model. We mined the entire population
of Dockerfiles on GitHub as of October 2016, and summarize
our findings on the ecosystem in general, quality aspects,
and evolution behavior. The results of our study can inform
standard bodies around containers and tool developers to
develop better support to improve quality and drive ecosystem
change.
We make the following contributions through our ex-
ploratory study:
Ecosystem Overview. We characterize the ecosystem of
Docker containers on GitHub by analyzing the distribution of
projects using Docker, broken down by primary programming
language, project size, and the base infrastructure (base image)
2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)

QUALITATIVE STUDY

(ONE PROJECT)
Communication in Open Source Software
Development Mailing Lists
Anja Guzzi1
, Alberto Bacchelli2
, Michele Lanza2
, Martin Pinzger3
, Arie van Deursen1
1: Department of Software and Computer Technology - Delft University of Technology, The Netherlands
2: REVEAL @ Faculty of Informatics - University of Lugano, Switzerland
3: Institute for Informatics Systems - University of Klagenfurt, Austria
Abstract—Open source software (OSS) development teams use
electronic means, such as emails, instant messaging, or forums,
to conduct open and public discussions. Researchers investigated
mailing lists considering them as a hub for project communica-
tion. Prior work focused on specific aspects of emails, for example
the handling of patches, traceability concerns, or social networks.
This led to insights pertaining to the investigated aspects, but not
to a comprehensive view of what developers communicate about.
Our objective is to increase the understanding of development
mailing lists communication.
We quantitatively and qualitatively analyzed a sample of 506
email threads from the development mailing list of a major OSS
project, Lucene. Our investigation reveals that implementation
details are discussed only in about 35% of the threads, and that
a range of other topics is discussed. Moreover, core developers
participate in less than 75% of the threads. We observed that the
development mailing list is not the main player in OSS project
communication, as it also includes other channels such as the
issue repository.
I. Introduction
Open source software (OSS) development teams use elec-
tronic means, such as emails, instant messaging, or forums,
Nevertheless, there is no clear, updated, and well-rounded
picture of the communication taking place in open source
development mailing lists that supports these assumptions. In
fact, at our disposal, we only have either abstract and outdated
knowledge (e.g., obtained as a side e↵ect of the analysis of
the Linux project), which does not consider the recent shift
of interest to new social platforms (e.g., GitHub and Jira),
or a very specialized understanding (e.g., regarding specific
information, such as the process of code review [25]), which
does not take into account all the information that can be
distilled from development emails.
Our goal is to increase our understanding of development
mailing lists communication: What do participants talk about?
How much do they discuss each topic? What is the role of
the development mailing lists for OSS project communication?
Answering these questions can confirm or cast doubts on the
previous assumptions, and it can provide insights for future
research on mining developers’ communication and for building
tools to help project teams communicate e↵ectively.
To answer these questions, we conducted an in-depth analysis
of the communication taking place in the development mailing

TECHNOLOGICAL
• Those should be the ice on the cak
e

• Consequence of all previous researc
h

• Exploiting software repositories to help
developers

RECOMMENDING RELEVANT
STACKOVERFLOW DISCUSSIONS
Mining StackOverflow to Turn the IDE into a
Self-Confident Programming Prompter
Luca Ponzanelli1, Gabriele Bavota2, Massimiliano Di Penta2, Rocco Oliveto3, Michele Lanza1
1: REVEAL @ Faculty of Informatics – University of Lugano, Switzerland
2: University of Sannio, Benevento, Italy 3: University of Molise, Pesche (IS), Italy
ABSTRACT
Developers often require knowledge beyond the one they possess,
which often boils down to consulting sources of information like
Application Programming Interfaces (API) documentation, forums,
Q&A websites, etc. Knowing what to search for and how is non-
trivial, and developers spend time and energy to formulate their
problems as queries and to peruse and process the results.
We propose a novel approach that, given a context in the IDE,
automatically retrieves pertinent discussions from Stack Overflow,
evaluates their relevance, and, if a given confidence threshold is
surpassed, notifies the developer about the available help. We
have implemented our approach in Prompter, an Eclipse plug-in.
Prompter has been evaluated through two studies. The first was
aimed at evaluating the devised ranking model, while the second
was conducted to evaluate the usefulness of Prompter.
problems, the main one being the absence of automation: Every time
developers need to look for information, they interrupt their work
flow, leave the IDE, and use a Web browser to perform and refine
searches, and assess the results. Finally, they transfer the obtained
knowledge to the problem context in the IDE. The information is
retrieved from di↵erent sources, such as forums, mailing lists [2],
blogs, Q&A websites, bug trackers [1], etc. A prominent example is
Stack Overflow, popular among developers as a venue for sharing
programming knowledge. Stack Overflow is vast: In 2010 it already
had 300k users, and millions of questions, answers, and comments
[23]. This makes finding the right piece of information cumbersome
and challenging.
Recommender systems [33] represent a possible solution to this
problem. A recommender system gathers and analyzes data, iden-
tifies useful artifacts, and suggests them to the developer. Seminal

APPROACH
Search Service
Eclipse
Prompter
Query Generation
Service
Search Engines
Google
Bing
Blekko
Stack Overflow
API Service
Ranking
Model
Search Engine
Proxy
Code
Context
1
3
2
Code
Context
Query &
Triggering
Info
Query &
Code Context
4 Query
5 Results
6
Discussion
IDs
7 Documents
8
Ranked
Results

EVALUATION
NP P
20
40
60
80
100
Treatment
Completeness
•User stud
y

•Developers
performing the task
with and without
the tool

METHODOLOGICAL INFRASTRUCTURE
PERSPECTIVE
EMPIRICAL TECHNOLOGICAL

RECAP
There are different ways you can contribute to MSR
research, beyond empirical studies

CHAPTER II - HOWTO
PREVENT REJECTIONS?

ANSWER -YOU CAN’T
There is always a chance reviewers won’t like your paper

This is an opportunity
to make our work more
convincing
Don’t despair!
In the end we will thank
the reviewers

LET’STRYTO MINIMIZ
E

THE RISK…

HOW IS MY PAPER GOINGTO
BE EVALUATED?

FROM MSR 202
1

CALL FOR PAPERS
• Soundness of approac
h

• Relevance to software engineerin
g

• Clarity of relation with related wor
k

• Quality of presentatio
n

• Quality of evaluation [for long papers
]

• Ability to replicate [for long papers
]

• Novelty
https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers

RELEVANCE
Ok my paper is about software engineering,
 
so it’s
fi
ne…

QUESTIONSTO ASK
• Does the paper solve a problem relevant for any
stakeholder
?

• Does phenomenon being investigated by the study
frequently occur and impact real projects
?

• Is the achieved improvement tangible for the interested
stakeholder?

RELEVANCE: EXAMPLES OF
WEAK CONTRIBUTION
The investigated code bad smell occurs in the 1% of
the studied projects

RELEVANCE: EXAMPLES OF
WEAK CONTRIBUTION
We improve defect prediction precision 30%
precision to 40%

THAT BEING SAID…
Sometimes very small improvements pave the road
towards tangible, signi
fi
cant ones!

MSR RESEARCHER
TEMPTATION
Here’s a new dataset… let’s try to do something
with that!

PROBLEM-DRIVENVS OF
DATA-DRIVEN RESEARCH
How would my study (help to) solve a problem
developers have?

EXAMPLES OF NOVEL
CONTRIBUTIONS
• Novel approach: propose an approach improving the state-
of-the-ar
t

• New empirical results: New, possibly unexpected, empirical
evidenc
e

• Negative result: Shows that something does not work

• Replication: Con
fi
rms (in a different context) previous
results

TECHNICAL SOUNDNESS

(OFTHE MINING PROCESS)

VERSIONING MINING
Details to describe and justify
:

• History rang
e

• Branche
s

• Commit orderin
g

• On excluding merge commits

THREATSTO DISCUSS
• History can be rewritte
n

• When mining repositories, there’s little you can d
o

• At least, discuss the threats

NOT ALL BUG-RELATED ISSUES

ARE BUGS
0
150
300
450
600
Mozilla Eclipse JBoss
156
24
121
99
382
209
345
194
270
Bugs
Non bugs
Others
Giuliano Antoniol, Kamel Ayari, Massimiliano Di Penta, Foutse Khomh, Yann-Gaël
Guéhéneuc: Is it a bug or an enhancement?: a text-based approach to classify change
requests. CASCON 2008: 23

It’s not a Bug, it’s a Feature:
How Misclassification Impacts Bug Prediction
Kim Herzig
Saarland University
herzig@cs.uni-saarland.de
Sascha Just
Saarland University
just@st.cs.uni-saarland.de
Andreas Zeller
Saarland University
zeller@cs.uni-saarland.de
Abstract—In a manual examination of more than 7,000 issue
reports from the bug databases of five open-source projects,
we found 33.8% of all bug reports to be misclassified—that
is, rather than referring to a code fix, they resulted in a new
feature, an update to documentation, or an internal refactoring.
This misclassification introduces bias in bug prediction models,
confusing bugs and features: On average, 39% of files marked
as defective actually never had a bug. We estimate the impact of
this misclassification on earlier studies and recommend manual
data validation for future studies.
Index Terms—mining software repositories; bug reports; data
quality; noise; bias
I. INTRODUCTION
In empirical software engineering, it has become common-
place to mine data from change and bug databases to detect
where bugs have occurred in the past, or to predict where they
will occur in the future. The accuracy of such measurements
and predictions depends on the quality of the data. Therefore,
TABLE I
PROJECT DETAILS.
Maintainer Tracker type # reports
HTTPClient APACHE Jira 746
Jackrabbit APACHE Jira 2,402
Lucene-Java APACHE Jira 2,443
Rhino MOZILLA Bugzilla 1,226
Tomcat5 APACHE Bugzilla 584
These are the questions we address in this paper. From
five open source projects (Section II), we manually classified
more than 7,000 issue reports into a fixed set of issue report
categories clearly distinguishing the kind of maintenance work
required to resolve the task (Section III). Our findings indicate
substantial data quality issues:
Issue report classifications are unreliable. In the five bug
databases investigated, more than 40% of issue reports

THREAT: MISSING LINKS
nmbd_incomingdgrams.c: Fix bug with Syntax 5.1 servers
reported by SGI where they do host announcements to
LOCAL_MASTER_BROWSER_NAME<00 rather than
WORKGROUP<1d
Quieten level 0 debug when probing for modules.We shouldn't
display so loud an error when a smb_probe_module() fails.Also
tidy up debugs a bit. Bug 375.

MISSING LINKS
The Missing Links: Bugs and Bug-fix Commits
Adrian Bachmann1
, Christian Bird2
, Foyzur Rahman2
,
Premkumar Devanbu2
and Abraham Bernstein1
1
Department of Informatics, University of Zurich, Switzerland
2
Computer Science Department, University of California, Davis, USA
{bachmann,bernstein}@ifi.uzh.ch
{cabird,mfrahman,ptdevanbu}@ucdavis.edu
ABSTRACT
Empirical studies of software defects rely on links between
bug databases and program code repositories. This linkage
is typically based on bug-fixes identified in developer-entered
commit logs. Unfortunately, developers do not always report
which commits perform bug-fixes. Prior work suggests that
such links can be a biased sample of the entire population
of fixed bugs. The validity of statistical hypotheses-testing
based on linked data could well be affected by bias. Given
the wide use of linked defect data, it is vital to gauge the
nature and extent of the bias, and try to develop testable
theories and models of the bias. To do this, we must establish
ground truth: manually analyze a complete version history
corpus, and nail down those commits that fix defects, and
those that do not. This is a difficult task, requiring an ex-
pert to compare versions, analyze changes, find related bugs
in the bug database, reverse-engineer missing links, and fi-
1. INTRODUCTION
Software process data, especially bug reports and commit
logs, are widely used in software engineering research. The
integration of these two provides valuable information on the
history and evolution of a software project. It is used, e.g.,
to predict the number and locale of bugs in future software
releases (e.g., [27, 31, 17, 6]). The two data sources are nor-
mally integrated by scanning through the version control
log messages for potential bug report numbers; conscien-
tious developers enter this information when they check-in
bug fixes (e.g., see [14]). We used similar techniques in our
previous work, and, in fact, improved current practice by
adding heuristics to check the results [3, 4]. Even so, the
links (between program code commits and bug reports) thus
extracted cannot be guaranteed to be correct, as they are
reliant on voluntary developer annotations in commit logs.
In prior work, we have shown that such data sets are

ONTHE USE OFTOOLS
• You are not reinventing the whee
l

• The MSR community is contributing with great
tool
s

• Consider about reusing them

ISTHETOOL WORKING?
• A minimal validation to check
whether a tool correctly
work
s

• We gave up in using a popular
tool as its results were wrong

MACHINE LEARNING
RETRAINING/TUNING
Machine learning-based tools may need to be
retrained/tuned if applied in a completely different
context

EMPIRICAL EVALUATION
SOUNDNESS
• This topic would require a separate tutorial (and
there are many
)

• Suitable design, appropriate use of statistical
procedures, threats to validity discussed/
mitigated,
…

• We will focus on projects’ selection

HOW BIG? - 20YEARS AGO
The evaluation is very small… only one project is
analyzed

HOW BIG? - 10YEARS AGO
The evaluation is very small… only
fi
ve projects are
analyzed

HOW BIG? -TODAY
The evaluation is very small… only 100 projects are
analyzed

JOKE APART…
I use this argument very rarely against (and in
favor) of a paper

ONE SIZE DOES NOT FIT ALL
The size and type of the dataset depends o
n

• the goals of the pape
r

• the research method being use
d

• depth vs. breadth

CHOICE OF DATASETS
• Existing datasets: are they appropriate to your
research? Are they too obsolete
?

• Mining your own dataset: de
fi
ne a clear selection
criteria

ON PROJECTS’ SELECTION
• Toy projects, tutorial
s

• Forked project
s

• Inactive projects

STARS MAY NO
T

BETHE BESTTHING…
The Journal of Systems and Software 146 (2018) 112–129
Contents lists available at ScienceDirect
The Journal of Systems and Software
journal homepage: www.elsevier.com/locate/jss
Controversy Corner
What’s in a GitHub Star? Understanding Repository Starring Practices
in a Social Coding Platform
Hudson Borges∗
, Marco Tulio Valente
Department of Computer Science, UFMG, Brazil
a r t i c l e i n f o
Article history:
Received 4 September 2017
Revised 27 August 2018
Accepted 7 September 2018
Available online 10 September 2018
Keywords:
a b s t r a c t
Besides a git-based version control system, GitHub integrates several social coding features. Particularly,
GitHub users can star a repository, presumably to manifest interest or satisfaction with an open source
project. However, the real and practical meaning of starring a project was never the subject of an in-
depth and well-founded empirical investigation. Therefore, we provide in this paper a throughout study
on the meaning, characteristics, and dynamic growth of GitHub stars. First, by surveying 791 developers,
we report that three out of four developers consider the number of stars before using or contributing

DIVERSITY (WHEN NEEDED)
Diversity in Software Engineering Research
Meiyappan Nagappan
Software Analysis and Intelligence Lab
Queen’s University, Kingston, Canada
mei@cs.queensu.ca
Thomas Zimmermann
Microsoft Research
Redmond, WA, USA
tzimmer@microsoft.com
Christian Bird
Microsoft Research
Redmond, WA, USA
Christian.Bird@microsoft.com
ABSTRACT
One of the goals of software engineering research is to achieve gen-
erality: Are the phenomena found in a few projects reflective of
others? Will a technique perform as well on projects other than the
projects it is evaluated on? While it is common sense to select a
sample that is representative of a population, the importance of di-
versity is often overlooked, yet as important. In this paper, we com-
bine ideas from representativeness and diversity and introduce a
measure called sample coverage, defined as the percentage of pro-
jects in a population that are similar to the given sample. We intro-
duce algorithms to compute the sample coverage for a given set of
projects and to select the projects that increase the coverage the
most. We demonstrate our technique on research presented over
the span of two years at ICSE and FSE with respect to a population
of 20,000 active open source projects monitored by Ohloh.net.
Knowing the coverage of a sample enhances our ability to reason
about the findings of a study. Furthermore, we propose reporting
guidelines for research: in addition to coverage scores, papers
should discuss the target population of the research (universe) and
dimensions that potentially can influence the outcomes of a re-
search (space).
D.2.6 [Software Engineering]: Metrics
et al. [2] examined 1,000 projects. Another example is the study
by Gabel and Su that examined 6,000 projects [3]. But if care isn’t
taken when selecting which projects to analyze, then increasing the
sample size does not actually contribute to the goal of increased
generality. More is not necessarily better.
As an example, consider a researcher who wants to investigate a
hypothesis about say distributed development on a large number of
projects in an effort to demonstrate generality. The researcher goes
to the json.org website and randomly selects twenty projects, all of
them JSON parsers. Because of the narrow range of functionality
of the projects in the sample, any findings will not be very repre-
sentative; we would learn about JSON parsers, but little about other
types of software. While this is an extreme and contrived example,
it shows the importance of systematically selecting projects for em-
pirical research rather than selecting projects that are convenient.
With this paper we provide techniques to (1) assess the quality of a
sample, and to (2) identify projects that could be added to further
improve the quality of the sample.
Other fields such as medicine and sociology have published and
accepted methodological guidelines for subject selection [2] [4].
While it is common sense to select a sample that is representative
of a population, the importance of diversity is often overlooked yet
as important [5]. As stated by the Research Governance Framework

RECENTTOOL SUPPORT
https://seart-ghs.si.usi.ch

REPRODUCIBILITY
• Not just about replication package
s

• Including details in your paper, which should be
self-contained

SUPPORTINGTECHNOLOGY
Zenodo, Jupyter notebooks, Docker containers,
Virtual Machines

REPOSITORIES AREVOLATILE!
• Q&A posts get delete
d

• GitHub repositories become private, archived, or
get delete
d

• The same may happen to any content available on
the Internet

78% OF PROMPTER’S
RECOMMENDATIONS
CHANGED AFTER ONEYEAR
Empir Software Eng
DOI 10.1007/s10664-015-9397-1
Prompter
Turning the IDE into a self-confident programming assistant
Luca Ponzanelli1 · Gabriele Bavota2 ·
Massimiliano Di Penta3 · Rocco Oliveto4 ·
Michele Lanza1
© Springer Science+Business Media New York 2015
Abstract Developers often require knowledge beyond the one they possess, which boils
down to asking co-workers for help or consulting additional sources of information, such
as Application Programming Interfaces (API) documentation, forums, and Q&A websites.
However, it requires time and energy to formulate one’s problem, peruse and process the

IN CONCLUSION…
If you run a study today, this may not be reproduced
from scratch tomorrow unless having all data

PRESENTATION QUALITY
• I rarely reject a paper because of tha
t

• Not just matter of getting your paper in, but rather
to let others better understanding your work

FOLLOWING ATEMPLATE
• There are recurring templates for papers
belonging to different categorie
s

• Such templates may help the reader know where
to
fi
nd what

EMPIRICAL PAPER
• Introductio
n

• Study design (include the data extraction process
)

• Study result
s

• Threats to validit
y

• Related wor
k

• Conclusion

EMPIRICAL STUDY DESIGN
• De
fi
nitio
n

• Research Questions / Hypothese
s

• Context Selectio
n

• Data extraction methodolog
y

• Data analysis methodology

TECHNOLOGICAL PAPER
• Introductio
n

• Backgrounds (if any
)

• Approac
h

• Empirical evaluation (may be split
)

• Related Wor
k

• Conclusions

NOTE
• You do not have to stick to those template
s

• There are good reasons to avoid that

FOR EXAMPLE

YOU MAY HAVE
• First study needed to understand the proble
m

• Approach de
fi
nition (based on
fi
rst study
)

• Approach evaluation

GAME SMELL PAPER

(MSR 2020)
Detecting Video Game-Specific Bad Smells in Unity Projects
Antonio Borrelli
University of Sannio
Benevento, Italy
aborrelli@unisannio.it
Vittoria Nardone
Benevento, Italy
vnardone@unisannio.it
Giuseppe A. Di Lucca
Benevento, Italy
dilucca@unisannio.it
Gerardo Canfora
Benevento, Italy
canfora@unisannio.it
Massimiliano Di Penta
Benevento, Italy
dipenta@unisannio.it
ABSTRACT
The growth of the video game market, the large proportion of games
targeting mobile devices or streaming services, and the increasing
complexity of video games trigger the availability of video game-
speci�c tools to assess performance and maintainability problems.
This paper proposes UnityLinter, a static analysis tool that supports
Unity video game developers to detect seven types of bad smells
we have identi�ed as relevant in video game development. Such
smell types pertain to performance, maintainability and incorrect
behavior problems. After having de�ned the smells by analyzing
the existing literature and discussion forums, we have assessed
their relevance with a survey involving 68 participants. Then, we
have analyzed the occurrence of the studied smells in 100 open-
source Unity projects, and also assessed UnityLinter’s accuracy.
Results of our empirical investigation indicate that developers well-
received performance- and behavior-related issues, while some
maintainability issues are more controversial. UnityLinter is, in
general, accurate enough in detecting smells (86%-100% precision
and 50%-100% recall), and our study shows that the studied smell
types occur in 39%-97% of the analyzed projects.
1 INTRODUCTION
Video games represent a conspicuous and increasing share of the
software development market. In 2018, the video game industry
has generated 134.9 billion dollars, with over 10% increase over
2017 [25]. Such a market is changing continuously also in terms of
platforms on which video games are deployed. In the past, video
games mainly targeted consoles and desktop computers; nowadays
mobile devices account for nearly half of the market [24], and the
current trend is the streaming of video game contents.
While the video game market is increasing, development skills in
this area still represent a niche. Just to give an idea, Stack Over�ow
features over 1.5M discussions tagged [java] and 1.2M tagged An-
droid, while only 50k are about Unity3D. It is therefore clear how in
this context developers may need suitable support while creating
their video games, helping them to avoid introducing performance
bottlenecks, or making the game di�cult to maintain and evolve.
Static code analysis tools (SCAT) are a typical support developers
have while coding. Such tools, known also as “linters” (from the
�rst tool developed by Johnson for the C language [28]) analyze the
source code or the compiled (e.g., bytecode) program to highlight

YET…
One reviewer complained that research questions
weren’t all addressed in a single place

RELEASE NOTE GENERATION
(TSE 2017)
ARENA: An Approach for the Automated
Generation of Release Notes
Laura Moreno, Member, IEEE, Gabriele Bavota, Member, IEEE, Massimiliano Di Penta, Member, IEEE,
Rocco Oliveto, Member, IEEE, Andrian Marcus, Member, IEEE, and Gerardo Canfora
Abstract—Release notes document corrections, enhancements, and, in general, changes that were implemented in a new release of
a software project. They are usually created manually and may include hundreds of different items, such as descriptions of new
features, bug fixes, structural changes, new or deprecated APIs, and changes to software licenses. Thus, producing them can be a
time-consuming and daunting task. This paper describes ARENA (Automatic RElease Notes generAtor), an approach for the
automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with
information from versioning systems and issue trackers. ARENA was designed based on the manual analysis of 990 existing release
notes. In order to evaluate the quality of the release notes automatically generated by ARENA, we performed four empirical studies
involving a total of 56 participants (48 professional developers and eight students). The obtained results indicate that the generated
release notes are very good approximations of the ones manually produced by developers and often include important information that
is missing in the manually created release notes.
Index Terms—Release notes, software documentation, software evolution
Ç
1 INTRODUCTION
RELEASE notes summarize the main changes that occurred
in a software system since its previous release, such as,
the addition of new features, bug fixes, changes to licenses
this task by generating simplified release notes (e.g., the Atlas-
sian OnDemand release note generator1
), yet such notes are lim-
ited to list closed issues that developers have manually
106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 2, FEBRUARY 2017

FROM MSR 2021
CALL FOR PAPERS
• Soundness of approach
• Relevance to software engineering
• Clarity of relation with related work
• Quality of presentation
• Quality of evaluation [for long papers]
• Ability to replicate [for long papers]
• Novelty
PERSPECTIVE

FROM MSR 2021
CALL FOR PAPERS
• Novelty
PERSPECTIVE
TAKEAWAYS
Different types of contributions to MSR,
beyond studies, are highly needed
Dataset size and type depends on the study
goals and research method
Mining process must be documented and justified
in detail

FROM MSR 2021
CALL FOR PAPERS
• Novelty
PERSPECTIVE
TAKEAWAYS
Different types of contributions to MSR,
beyond studies, are highly needed
Dataset size and type depends on the study
goals and research method
Mining process must be documented and justified
in detail
dipenta@unisannio.it
@mdipenta

Msr2021 tutorial-di penta

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Msr2021 tutorial-di penta

Similar to Msr2021 tutorial-di penta (20)

More from Massimiliano Di Penta

More from Massimiliano Di Penta (9)

Recently uploaded

Recently uploaded (20)

Msr2021 tutorial-di penta