SlideShare a Scribd company logo
1 of 102
Download to read offline
CRAFTING
 

YOUR NEXT MSR PAPER
SUGGESTIONS FROM MY (GOOD AND BAD)
EXPERIENCE
Massimiliano Di Pent
a

University of Sannio, Italy
MY MSR EXPERIENCE
• 19 papers published (17 full research papers
)

• Program committee member in 9 edition
s

• Program co-chair in 2012 and 201
3

• General chair in 201
5

• Steering committee member 2011-2018
GOALS OFTHISTUTORIAL
• Explain different ways for contributing to MSR
researc
h

• Go over the paper’s evaluation criteria and try to
satisfy them
NOTES
• I will refer to some exemplar paper
s

• Those are just examples, but some of them quite
representative one
s

• All are MSR-related papers, not only from the
MSR conference
ANALYSIS OF MSR REPORTING
• I’m studying this with Davide Falessi and Alexander Serebrenik
• We are interested to hear your opinion, especially if you are a senior
member of the community (SurveyHero, takes 15 min.
)

https://tinyurl.com/MiningReporting
CHAPTER I -
 

HOW CAN I CONTRIBUTE
 

TO MSR RESEARCH?
DIFFERENT WAYS FOR
CONTRIBUTINGTO MSR
METHODOLOGICAL
METHODOLOGICAL PAPERS
Providing techniques that will hopefully help
future mining research
FIX INDUCING CHANGES
(SZZ ALGORITHM)
When Do Changes Induce Fixes?
(On Fridays.)
Jacek Śliwerski
International Max Planck Research School
Max Planck Institute for Computer Science
Saarbrücken, Germany
sliwers@mpi-sb.mpg.de
Thomas Zimmermann Andreas Zeller
Department of Computer Science
Saarland University
Saarbrücken, Germany
{tz, zeller}@acm.org
ABSTRACT
As a software system evolves, programmers make changes that
sometimes cause problems. We analyze CVS archives for fix-in-
ducing changes—changes that lead to problems, indicated by fixes.
We show how to automatically locate fix-inducing changes by link-
ing a version archive (such as CVS) to a bug database (such as
BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE
history, it turns out that fix-inducing changes show distinct patterns
with respect to their size and the day of week they were applied.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance, and
Enhancement—corrections, version control; D.2.8 [Metrics]: Com-
plexity measures
General Terms
Management, Measurement
1. INTRODUCTION
Which change properties may lead to problems? We can inves-
tigate which properties of a change correlate with inducing
fixes, for instance, changes made on a specific day or by a
specific group of developers.
How error-prone is my product? We can assign a metric to the
product—on average, how likely is it that a change induces a
later fix?
How can I filter out problematic changes? When extracting the
architecture via co-changes from a version archive, there is
no need to consider fix-inducing changes, as they get undone
later.
Can I improve guidance along related changes? When using co-
changes to guide programmers along related changes, we
would like to avoid fix-inducing changes in our suggestions.
This paper describes our first experiences with fix-inducing chang-
es. We discuss how to extract data from version and bug archives
(Section 2), and how we link bug reports to changes (Section 3).
In Section 4, we describe how to identify and locate fix-inducing
changes. Section 5 shows the results of our investigation of the
LINKING ISSUES TO
COMMITS
“
fi
x 367920 setting pop3 Messages as junk/not junk ignored when Message
quarantining turned on sr=mscott
”

Solution: Regular expression matching e.g.


$l=~/BR (d+)/ || $l=~/fixs+(d+)/i || $l=~/PRs+(d+)/ || 

$l=~/Bugzillas+(d+)/i ||

$l=~/Bugs+(d+)/i || $l=~/^#(d+)/i
IDENTIFYING FIX INDUCING
CHANGES
bug


fi
xing
fi
x inducing


changes
fi
x inducing


change
Affected


lines
Affected


lines
fi
le
source
code
lines
cn
^cn
before


bug


fi
xing
ci cj ck
INFRASTRUCTURE
INFRASTRUCTURE
• Setting up tools or data for other researcher
s

• Sometimes a consequence of a methodological
contribution
SRCML
An XML-Based Lightweight C++ Fact Extractor
Michael L. Collard, Huzefa H. Kagdi, Jonathan I. Maletic
Department of Computer Science
Kent State University
Kent Ohio 44242
330 672 9039
collard@cs.kent.edu, hkagdi@cs.kent.edu, jmaletic@cs.kent.edu
Abstract
A lightweight fact extractor is presented that utilizes
XML tools, such as XPath and XSLT, to extract static
information from C++ source code programs. The
source code is first converted into an XML
representation, srcML, to facilitate the use of a wide
variety of XML tools. The method is deemed lightweight
because only a partial parsing of the source is done.
Additionally, the technique is quite robust and can be
applied to incomplete and non-compile-able source code.
The trade off to this approach is that queries on some low
level details cannot be directly addressed. This approach
is applied to a fact extractor benchmark as comparison
with other, abet heavier weight, fact extractors. Fact
extractors are widely used to support understanding
tasks associated with maintenance, reverse engineering
and various other software engineering tasks.
a lightweight, robust, and tolerant C++ fact extractor.
We use the term lightweight to highlight the fact that
only lightweight parsing is done and a number of very
low-level type facts can not be directly derived from the
data source (i.e., srcML markup of the C++ source).
Our method allows the extraction of high-level entities
such as functions, classes, namespaces, and templates, as
well as middle-level entities such as individual
statements (if, while, etc.), declarations and expressions.
Lower-level entities such as variables and function calls
can also be queried. Additionally, it allows the extraction
of entities that are typically discarded during pre-
processing such as comments, pre-processor directives,
and macros. The entities are extracted with full lexical
information such as white space and all original source
code information.
The following section will address some of the
problems encountered during fact extraction and address
the related work in the field of fact extraction. We then
describe srcML and our C++ to srcML translator.
SRCML
https://www.srcml.org
• Parses source code and produces the output in XM
L

• Multi-languag
e

• Also supports transformations, lightweight slicing/
data
fl
ow analysis
PERCEVAL
Perceval: Software Project Data at Your Will
Santiago Dueñas
Bitergia
sduenas@bitergia.com
Valerio Cosentino
Bitergia
valcos@bitergia.com
Gregorio Robles
Universidad Rey Juan Carlos
grex@gsyc.urjc.es
Jesus M. Gonzalez-Barahona
Universidad Rey Juan Carlos
jgb@gsyc.urjc.es
ABSTRACT
Software development projects, in particular open source ones,
heavily rely on the use of tools to support, coordinate and promote
development activities. Despite their paramount value, they con-
tribute to fragment the project data, thus challenging practitioners
and researchers willing to derive insightful analytics about software
projects. In this demo we present Perceval, a loyal helper able to
perform automatic and incremental data gathering from almost any
tool related with contributing to open source development, among
others, source code management, issue tracking systems, mailing
lists, forums, and social media. Perceval is an industry strong free
software tool that has been widely used in Bitergia, a company
devoted to offer commercial software analytics of software projects.
It hides the technical complexities related to data acquisition and
eases the definition of analytics. A video showcasing the main
features of Perceval can be found at https://youtu.be/eH1sYF0Hdc8.
KEYWORDS
Software mining, empirical software engineering, open source soft-
However, accessing and gathering this data is often a time-
consuming and an error-prone task, that entails many considera-
tions and technical expertise [1, 12, 16]. It may require to understand
how to obtain an OAuth [11] token (e.g., StackExchange, GitHub)
or prepare storage to download the data (e.g., Git repositories, mail-
ing list archives); when dealing with development support tools
that expose their data via APIs, special attention has to be paid to
the terms of service (e.g., an excessive number of requests could
lead to temporary or permanent bans); recovery solutions to tackle
connection problems when fetching remote data should also taken
into account; storing the data already received and retrying failed
API calls may speed up the overall gathering process and reduce
the risk of corrupted data. Nonetheless, even if these problems are
known, many scholars and practitioners tend to re-invent the wheel
by retrieving the data themselves with ad-hoc scripts.
In this paper, we present Perceval, a tool that simplifies the col-
lection of project data by covering more than 20 well-known tools
and platforms related to contributing to open source development,
thus enabling the definition of software analytics. It rebuilts and
2018 ACM/IEEE 40th International Conference on Software Engineering: Companion Proceedings
PERCEVAL
https://github.com/chaoss/grimoirelab-perceval


•Gathers data from a wide number of software
repositories


•git, GitHub, issue trackers, Slack, Gerrit, Docker hub,
and many others
PYDRILLER
PyDriller: Python Framework for Mining So�ware Repositories
Davide Spadini
Delft University of Technology
Software Improvement Group
Delft, The Netherlands
d.spadini@sig.eu
Maurício Aniche
Delft University of Technology
Delft, The Netherlands
m.f.aniche@tudelft.nl
Alberto Bacchelli
University of Zurich
Zurich, Switzerland
bacchelli@i�.uzh.ch
ABSTRACT
Software repositories contain historical and valuable information
about the overall development of software systems. Mining software
repositories (MSR) is nowadays considered one of the most inter-
esting growing �elds within software engineering. MSR focuses
on extracting and analyzing data available in software repositories
to uncover interesting, useful, and actionable information about
the system. Even though MSR plays an important role in software
engineering research, few tools have been created and made public
to support developers in extracting information from Git reposi-
tory. In this paper, we present P��������, a Python Framework that
eases the process of mining Git. We compare our tool against the
state-of-the-art Python Framework GitPython, demonstrating that
P�������� can achieve the same results with, on average, 50% less
LOC and signi�cantly lower complexity.
URL: https://github.com/ishepard/pydriller,
Materials: https://doi.org/10.5281/zenodo.1327363,
Pre-print: https://doi.org/10.5281/zenodo.1327411
CCS CONCEPTS
• Software and its engineering;
actionable insights for software engineering, such as understanding
the impact of code smells [13–15], exploring how developers are
doing code reviews [2, 4, 10, 21] and which testing practices they
follow [20], predicting classes that are more prone to change/de-
fects [3, 6, 16, 17], and identifying the core developers of a software
team to transfer knowledge [12].
Among the di�erent sources of information researchers can use,
version control systems, such as Git, are among the most used ones.
Indeed, version control systems provide researchers with precise
information about the source code, its evolution, the developers of
the software, and the commit messages (which explain the reasons
for changing).
Nevertheless, extracting information from Git repositories is
not trivial. Indeed, many frameworks can be used to interact with
Git (depending on the preferred programming language), such as
GitPython [1] for Python, or JGit for Java [8]. However, these tools
are often di�cult to use. One of the main reasons for such di�culty
is that they encapsulate all the features from Git, hence, developers
are forced to write long and complex implementations to extract
even simple data from a Git repository.
In this paper, we present P��������, a Python framework that
helps developers to mine software repositories. P�������� provides
PYDRILLER
https://github.com/ishepard/pydrille
r

• Python-based mining framewor
k

• Changed
fi
les, diffs, metric
s

• Watch back this morning Tutorial


by Mauricio Aniche and Alberto Bacchelli
GHTORRENT
TRAVISTORRENT
SOFTWARE HERITAGE
YESTERDAY’S SESSION
PERSPECTIVE
PERSPECTIVE PAPERS
Provide insights on how (not to) mine certain
repositorie
s

Lessons learned, things to avoid
ON MINING GIT…
The Promises and Perils of Mining Git
Christian Bird⇤, Peter C. Rigby†, Earl T. Barr⇤, David J. Hamilton⇤, Daniel M. German†, Prem Devanbu⇤
⇤University of California, Davis, USA
†University of Victoria, Canada
{bird,barr,hamiltod,devanbu}@cs.ucdavis.edu {pcr,dmg}@cs.uvic.ca
Abstract
We are now witnessing the rapid growth of decentralized
source code management (DSCM) systems, in which every
developer has her own repository. DSCMs facilitate a style
of collaboration in which work output can flow sideways
(and privately) between collaborators, rather than always
up and down (and publicly) via a central repository. Decen-
tralization comes with both the promise of new data and the
peril of its misinterpretation. We focus on git, a very popular
DSCM used in high-profile projects. Decentralization, and
other features of git, such as automatically recorded con-
500
1000
1500
2000
2500
3000
Number
of
Projects
Subversion
Git
Bazaar
CVS
Darcs
Hg
… AND GITHUB
The Promises and Perils of Mining GitHub
Eirini Kalliamvakou
University of Victoria
ikaliam@uvic.ca
Georgios Gousios
Delft University of Technology
G.Gousios@tudelft.nl
Kelly Blincoe
University of Victoria
kblincoe@acm.org
Leif Singer
University of Victoria
lsinger@uvic.ca
Daniel M. German⇤
University of Victoria
dmg@uvic.ca
Daniela Damian
University of Victoria
danielad@cs.uvic.ca
ABSTRACT
With over 10 million git repositories, GitHub is becoming
one of the most important source of software artifacts on
the Internet. Researchers are starting to mine the infor-
mation stored in GitHub’s event logs, trying to understand
how its users employ the site to collaborate on software.
However, so far there have been no studies describing the
quality and properties of the data available from GitHub.
We document the results of an empirical study aimed at un-
derstanding the characteristics of the repositories in GitHub
and how users take advantage of GitHub’s main features—
namely commits, pull requests, and issues. Our results indi-
cate that, while GitHub is a rich source of data on software
development, mining GitHub for research purposes should
take various potential perils into consideration. We show,
for example, that the majority of the projects are personal
and inactive; that GitHub is also being used for free storage
and as a Web hosting service; and that almost 40% of all pull
requests do not appear as merged, even though they were.
We provide a set of recommendations for software engineer-
ing researchers on how to approach the data in GitHub.
Categories and Subject Descriptors
D.2.8 [Software Engineering]: Management—Software con-
“fork & pull” model in which developers create their own
copy of a repository and submit a pull request when they
want the project maintainer to pull their changes into the
main branch. In addition to code hosting, collaborative code
review, and integrated issue tracking, GitHub has integrated
social features. Users are able to subscribe to information by
“watching” projects and “following” users, resulting in a feed
of information on those projects and users of interest. Users
also have profiles that can be populated with identifying
information and contain their recent activity within the site.
With over 10.6 million repositories1
hosted as of January
2014, GitHub is currently the largest code hosting site in the
world. Its popularity, the integrated social features, and the
availability of metadata through an accessible api have made
GitHub very attractive for software engineering researchers.
Existing research has been both qualitative [4, 7, 16, 17, 19]
and quantitative [10, 24, 25, 26]. Qualitative studies have fo-
cused on how developers use GitHub’s social features to form
impressions and draw conclusions on other developers’ and
projects’ activity to assess success, performance, and possi-
ble collaboration opportunities. Quantitative studies have
aimed to systematically archive GitHub’s publicly available
data and use that to investigate development practices and
network structure in the GitHub environment.
As part of our research on collaboration on GitHub [15],
LOOK ATTHE FIRST MSR!
https://dblp.org/db/conf/msr/msr2004.html
EMPIRICAL
ABOUT EMPIRICAL RESEARCH
Quantitative, Qualitative, or both
Observing patterns in a project
Finding correlations between variables
QUANTITATIVE STUDY
An Empirical Analysis of the
Docker Container Ecosystem on GitHub
Jürgen Cito∗, Gerald Schermann∗, John Erik Wittern†, Philipp Leitner∗, Sali Zumberi∗, Harald C. Gall∗
∗ Software Evolution and Architecture Lab
University of Zurich, Switzerland
{lastname}@ifi.uzh.ch
† IBM T. J. Watson Research Center
Yorktown Heights, NY, USA
witternj@us.ibm.com
Abstract—Docker allows packaging an application with its
dependencies into a standardized, self-contained unit (a so-called
container), which can be used for software development and to
run the application on any system. Dockerfiles are declarative
definitions of an environment that aim to enable reproducible
builds of the container. They can often be found in source code
repositories and enable the hosted software to come to life in
its execution environment. We conduct an exploratory empirical
study with the goal of characterizing the Docker ecosystem,
prevalent quality issues, and the evolution of Dockerfiles. We base
our study on a data set of over 70000 Dockerfiles, and contrast
this general population with samplings that contain the Top-100
and Top-1000 most popular Docker-using projects. We find that
most quality issues (28.6%) arise from missing version pinning
(i.e., specifying a concrete version for dependencies). Further, we
were not able to build 34% of Dockerfiles from a representative
sample of 560 projects. Integrating quality checks, e.g., to issue
version pinning warnings, into the container build process could
result into more reproducible builds. The most popular projects
change more often than the rest of the Docker population, with
5.81 revisions per year and 5 lines of code changed on average.
ity [4], we study the Docker ecosystem with respect to quality
of Dockerfiles and their change and evolution behavior within
software repositories. We developed a tool chain that trans-
forms Dockerfiles and their evolution in Git repositories into
a relational database model. We mined the entire population
of Dockerfiles on GitHub as of October 2016, and summarize
our findings on the ecosystem in general, quality aspects,
and evolution behavior. The results of our study can inform
standard bodies around containers and tool developers to
develop better support to improve quality and drive ecosystem
change.
We make the following contributions through our ex-
ploratory study:
Ecosystem Overview. We characterize the ecosystem of
Docker containers on GitHub by analyzing the distribution of
projects using Docker, broken down by primary programming
language, project size, and the base infrastructure (base image)
2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)
2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)
2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)
2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)
QUALITATIVE STUDY
 

(ONE PROJECT)
Communication in Open Source Software
Development Mailing Lists
Anja Guzzi1
, Alberto Bacchelli2
, Michele Lanza2
, Martin Pinzger3
, Arie van Deursen1
1: Department of Software and Computer Technology - Delft University of Technology, The Netherlands
2: REVEAL @ Faculty of Informatics - University of Lugano, Switzerland
3: Institute for Informatics Systems - University of Klagenfurt, Austria
Abstract—Open source software (OSS) development teams use
electronic means, such as emails, instant messaging, or forums,
to conduct open and public discussions. Researchers investigated
mailing lists considering them as a hub for project communica-
tion. Prior work focused on specific aspects of emails, for example
the handling of patches, traceability concerns, or social networks.
This led to insights pertaining to the investigated aspects, but not
to a comprehensive view of what developers communicate about.
Our objective is to increase the understanding of development
mailing lists communication.
We quantitatively and qualitatively analyzed a sample of 506
email threads from the development mailing list of a major OSS
project, Lucene. Our investigation reveals that implementation
details are discussed only in about 35% of the threads, and that
a range of other topics is discussed. Moreover, core developers
participate in less than 75% of the threads. We observed that the
development mailing list is not the main player in OSS project
communication, as it also includes other channels such as the
issue repository.
I. Introduction
Open source software (OSS) development teams use elec-
tronic means, such as emails, instant messaging, or forums,
Nevertheless, there is no clear, updated, and well-rounded
picture of the communication taking place in open source
development mailing lists that supports these assumptions. In
fact, at our disposal, we only have either abstract and outdated
knowledge (e.g., obtained as a side e↵ect of the analysis of
the Linux project), which does not consider the recent shift
of interest to new social platforms (e.g., GitHub and Jira),
or a very specialized understanding (e.g., regarding specific
information, such as the process of code review [25]), which
does not take into account all the information that can be
distilled from development emails.
Our goal is to increase our understanding of development
mailing lists communication: What do participants talk about?
How much do they discuss each topic? What is the role of
the development mailing lists for OSS project communication?
Answering these questions can confirm or cast doubts on the
previous assumptions, and it can provide insights for future
research on mining developers’ communication and for building
tools to help project teams communicate e↵ectively.
To answer these questions, we conducted an in-depth analysis
of the communication taking place in the development mailing
TECHNOLOGICAL
TECHNOLOGICAL
• Those should be the ice on the cak
e

• Consequence of all previous researc
h

• Exploiting software repositories to help
developers
RECOMMENDING RELEVANT
STACKOVERFLOW DISCUSSIONS
Mining StackOverflow to Turn the IDE into a
Self-Confident Programming Prompter
Luca Ponzanelli1, Gabriele Bavota2, Massimiliano Di Penta2, Rocco Oliveto3, Michele Lanza1
1: REVEAL @ Faculty of Informatics – University of Lugano, Switzerland
2: University of Sannio, Benevento, Italy 3: University of Molise, Pesche (IS), Italy
ABSTRACT
Developers often require knowledge beyond the one they possess,
which often boils down to consulting sources of information like
Application Programming Interfaces (API) documentation, forums,
Q&A websites, etc. Knowing what to search for and how is non-
trivial, and developers spend time and energy to formulate their
problems as queries and to peruse and process the results.
We propose a novel approach that, given a context in the IDE,
automatically retrieves pertinent discussions from Stack Overflow,
evaluates their relevance, and, if a given confidence threshold is
surpassed, notifies the developer about the available help. We
have implemented our approach in Prompter, an Eclipse plug-in.
Prompter has been evaluated through two studies. The first was
aimed at evaluating the devised ranking model, while the second
was conducted to evaluate the usefulness of Prompter.
problems, the main one being the absence of automation: Every time
developers need to look for information, they interrupt their work
flow, leave the IDE, and use a Web browser to perform and refine
searches, and assess the results. Finally, they transfer the obtained
knowledge to the problem context in the IDE. The information is
retrieved from di↵erent sources, such as forums, mailing lists [2],
blogs, Q&A websites, bug trackers [1], etc. A prominent example is
Stack Overflow, popular among developers as a venue for sharing
programming knowledge. Stack Overflow is vast: In 2010 it already
had 300k users, and millions of questions, answers, and comments
[23]. This makes finding the right piece of information cumbersome
and challenging.
Recommender systems [33] represent a possible solution to this
problem. A recommender system gathers and analyzes data, iden-
tifies useful artifacts, and suggests them to the developer. Seminal
APPROACH
Search Service
Eclipse
Prompter
Query Generation
Service
Search Engines
Google
Bing
Blekko
Stack Overflow
API Service
Ranking
Model
Search Engine
Proxy
Code
Context
1
3
2
Code
Context
Query &
Triggering
Info
Query &
Code Context
4 Query
5 Results
6
Discussion
IDs
7 Documents
8
Ranked
Results
TOOL (PROMPTER)
1
2
EVALUATION
NP P
20
40
60
80
100
Treatment
Completeness
•User stud
y

•Developers
performing the task
with and without
the tool
METHODOLOGICAL INFRASTRUCTURE
PERSPECTIVE
EMPIRICAL TECHNOLOGICAL
RECAP
There are different ways you can contribute to MSR
research, beyond empirical studies
CHAPTER II - HOWTO
PREVENT REJECTIONS?
ANSWER -YOU CAN’T
There is always a chance reviewers won’t like your paper
This is an opportunity
to make our work more
convincing
Don’t despair!
In the end we will thank
the reviewers
LET’STRYTO MINIMIZ
E

THE RISK…
HOW IS MY PAPER GOINGTO
BE EVALUATED?
FROM MSR 202
1

CALL FOR PAPERS
• Soundness of approac
h

• Relevance to software engineerin
g

• Clarity of relation with related wor
k

• Quality of presentatio
n

• Quality of evaluation [for long papers
]

• Ability to replicate [for long papers
]

• Novelty
https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers
RELEVANCE
RELEVANCE
Ok my paper is about software engineering,


so it’s
fi
ne…
QUESTIONSTO ASK
• Does the paper solve a problem relevant for any
stakeholder
?

• Does phenomenon being investigated by the study
frequently occur and impact real projects
?

• Is the achieved improvement tangible for the interested
stakeholder?
RELEVANCE: EXAMPLES OF
WEAK CONTRIBUTION
The investigated code bad smell occurs in the 1% of
the studied projects
RELEVANCE: EXAMPLES OF
WEAK CONTRIBUTION
We improve defect prediction precision 30%
precision to 40%
THAT BEING SAID…
Sometimes very small improvements pave the road
towards tangible, signi
fi
cant ones!
MSR RESEARCHER
TEMPTATION
Here’s a new dataset… let’s try to do something
with that!
PROBLEM-DRIVENVS OF
DATA-DRIVEN RESEARCH
How would my study (help to) solve a problem
developers have?
NOVELTY
EXAMPLES OF NOVEL
CONTRIBUTIONS
• Novel approach: propose an approach improving the state-
of-the-ar
t

• New empirical results: New, possibly unexpected, empirical
evidenc
e

• Negative result: Shows that something does not work
 

• Replication: Con
fi
rms (in a different context) previous
results
TECHNICAL SOUNDNESS
 

(OFTHE MINING PROCESS)
VERSIONING MINING
Details to describe and justify
:

• History rang
e

• Branche
s

• Commit orderin
g

• On excluding merge commits
THREATSTO DISCUSS
• History can be rewritte
n

• When mining repositories, there’s little you can d
o

• At least, discuss the threats
NOT ALL BUG-RELATED ISSUES
 

ARE BUGS
0
150
300
450
600
Mozilla Eclipse JBoss
156
24
121
99
382
209
345
194
270
Bugs
Non bugs
Others
Giuliano Antoniol, Kamel Ayari, Massimiliano Di Penta, Foutse Khomh, Yann-Gaël
Guéhéneuc: Is it a bug or an enhancement?: a text-based approach to classify change
requests. CASCON 2008: 23
It’s not a Bug, it’s a Feature:
How Misclassification Impacts Bug Prediction
Kim Herzig
Saarland University
Saarbrücken, Germany
herzig@cs.uni-saarland.de
Sascha Just
Saarland University
Saarbrücken, Germany
just@st.cs.uni-saarland.de
Andreas Zeller
Saarland University
Saarbrücken, Germany
zeller@cs.uni-saarland.de
Abstract—In a manual examination of more than 7,000 issue
reports from the bug databases of five open-source projects,
we found 33.8% of all bug reports to be misclassified—that
is, rather than referring to a code fix, they resulted in a new
feature, an update to documentation, or an internal refactoring.
This misclassification introduces bias in bug prediction models,
confusing bugs and features: On average, 39% of files marked
as defective actually never had a bug. We estimate the impact of
this misclassification on earlier studies and recommend manual
data validation for future studies.
Index Terms—mining software repositories; bug reports; data
quality; noise; bias
I. INTRODUCTION
In empirical software engineering, it has become common-
place to mine data from change and bug databases to detect
where bugs have occurred in the past, or to predict where they
will occur in the future. The accuracy of such measurements
and predictions depends on the quality of the data. Therefore,
TABLE I
PROJECT DETAILS.
Maintainer Tracker type # reports
HTTPClient APACHE Jira 746
Jackrabbit APACHE Jira 2,402
Lucene-Java APACHE Jira 2,443
Rhino MOZILLA Bugzilla 1,226
Tomcat5 APACHE Bugzilla 584
These are the questions we address in this paper. From
five open source projects (Section II), we manually classified
more than 7,000 issue reports into a fixed set of issue report
categories clearly distinguishing the kind of maintenance work
required to resolve the task (Section III). Our findings indicate
substantial data quality issues:
Issue report classifications are unreliable. In the five bug
databases investigated, more than 40% of issue reports
THREAT: MISSING LINKS
nmbd_incomingdgrams.c: Fix bug with Syntax 5.1 servers
reported by SGI where they do host announcements to
LOCAL_MASTER_BROWSER_NAME<00 rather than
WORKGROUP<1d
Quieten level 0 debug when probing for modules.We shouldn't
display so loud an error when a smb_probe_module() fails.Also
tidy up debugs a bit. Bug 375.
MISSING LINKS
The Missing Links: Bugs and Bug-fix Commits
Adrian Bachmann1
, Christian Bird2
, Foyzur Rahman2
,
Premkumar Devanbu2
and Abraham Bernstein1
1
Department of Informatics, University of Zurich, Switzerland
2
Computer Science Department, University of California, Davis, USA
{bachmann,bernstein}@ifi.uzh.ch
{cabird,mfrahman,ptdevanbu}@ucdavis.edu
ABSTRACT
Empirical studies of software defects rely on links between
bug databases and program code repositories. This linkage
is typically based on bug-fixes identified in developer-entered
commit logs. Unfortunately, developers do not always report
which commits perform bug-fixes. Prior work suggests that
such links can be a biased sample of the entire population
of fixed bugs. The validity of statistical hypotheses-testing
based on linked data could well be affected by bias. Given
the wide use of linked defect data, it is vital to gauge the
nature and extent of the bias, and try to develop testable
theories and models of the bias. To do this, we must establish
ground truth: manually analyze a complete version history
corpus, and nail down those commits that fix defects, and
those that do not. This is a difficult task, requiring an ex-
pert to compare versions, analyze changes, find related bugs
in the bug database, reverse-engineer missing links, and fi-
1. INTRODUCTION
Software process data, especially bug reports and commit
logs, are widely used in software engineering research. The
integration of these two provides valuable information on the
history and evolution of a software project. It is used, e.g.,
to predict the number and locale of bugs in future software
releases (e.g., [27, 31, 17, 6]). The two data sources are nor-
mally integrated by scanning through the version control
log messages for potential bug report numbers; conscien-
tious developers enter this information when they check-in
bug fixes (e.g., see [14]). We used similar techniques in our
previous work, and, in fact, improved current practice by
adding heuristics to check the results [3, 4]. Even so, the
links (between program code commits and bug reports) thus
extracted cannot be guaranteed to be correct, as they are
reliant on voluntary developer annotations in commit logs.
In prior work, we have shown that such data sets are
ONTHE USE OFTOOLS
• You are not reinventing the whee
l

• The MSR community is contributing with great
tool
s

• Consider about reusing them
ISTHETOOL WORKING?
• A minimal validation to check
whether a tool correctly
work
s

• We gave up in using a popular
tool as its results were wrong
MACHINE LEARNING
RETRAINING/TUNING
Machine learning-based tools may need to be
retrained/tuned if applied in a completely different
context
EVALUATION SOUNDNESS
EMPIRICAL EVALUATION
SOUNDNESS
• This topic would require a separate tutorial (and
there are many
)

• Suitable design, appropriate use of statistical
procedures, threats to validity discussed/
mitigated,
…

• We will focus on projects’ selection
HOW BIG? - 20YEARS AGO
The evaluation is very small… only one project is
analyzed
HOW BIG? - 10YEARS AGO
The evaluation is very small… only
fi
ve projects are
analyzed
HOW BIG? -TODAY
The evaluation is very small… only 100 projects are
analyzed
JOKE APART…
I use this argument very rarely against (and in
favor) of a paper
ONE SIZE DOES NOT FIT ALL
The size and type of the dataset depends o
n

• the goals of the pape
r

• the research method being use
d

• depth vs. breadth
CHOICE OF DATASETS
• Existing datasets: are they appropriate to your
research? Are they too obsolete
?

• Mining your own dataset: de
fi
ne a clear selection
criteria
ON PROJECTS’ SELECTION
• Toy projects, tutorial
s

• Forked project
s

• Inactive projects
STARS MAY NO
T

BETHE BESTTHING…
The Journal of Systems and Software 146 (2018) 112–129
Contents lists available at ScienceDirect
The Journal of Systems and Software
journal homepage: www.elsevier.com/locate/jss
Controversy Corner
What’s in a GitHub Star? Understanding Repository Starring Practices
in a Social Coding Platform
Hudson Borges∗
, Marco Tulio Valente
Department of Computer Science, UFMG, Brazil
a r t i c l e i n f o
Article history:
Received 4 September 2017
Revised 27 August 2018
Accepted 7 September 2018
Available online 10 September 2018
Keywords:
a b s t r a c t
Besides a git-based version control system, GitHub integrates several social coding features. Particularly,
GitHub users can star a repository, presumably to manifest interest or satisfaction with an open source
project. However, the real and practical meaning of starring a project was never the subject of an in-
depth and well-founded empirical investigation. Therefore, we provide in this paper a throughout study
on the meaning, characteristics, and dynamic growth of GitHub stars. First, by surveying 791 developers,
we report that three out of four developers consider the number of stars before using or contributing
DIVERSITY (WHEN NEEDED)
Diversity in Software Engineering Research
Meiyappan Nagappan
Software Analysis and Intelligence Lab
Queen’s University, Kingston, Canada
mei@cs.queensu.ca
Thomas Zimmermann
Microsoft Research
Redmond, WA, USA
tzimmer@microsoft.com
Christian Bird
Microsoft Research
Redmond, WA, USA
Christian.Bird@microsoft.com
ABSTRACT
One of the goals of software engineering research is to achieve gen-
erality: Are the phenomena found in a few projects reflective of
others? Will a technique perform as well on projects other than the
projects it is evaluated on? While it is common sense to select a
sample that is representative of a population, the importance of di-
versity is often overlooked, yet as important. In this paper, we com-
bine ideas from representativeness and diversity and introduce a
measure called sample coverage, defined as the percentage of pro-
jects in a population that are similar to the given sample. We intro-
duce algorithms to compute the sample coverage for a given set of
projects and to select the projects that increase the coverage the
most. We demonstrate our technique on research presented over
the span of two years at ICSE and FSE with respect to a population
of 20,000 active open source projects monitored by Ohloh.net.
Knowing the coverage of a sample enhances our ability to reason
about the findings of a study. Furthermore, we propose reporting
guidelines for research: in addition to coverage scores, papers
should discuss the target population of the research (universe) and
dimensions that potentially can influence the outcomes of a re-
search (space).
Categories and Subject Descriptors
D.2.6 [Software Engineering]: Metrics
et al. [2] examined 1,000 projects. Another example is the study
by Gabel and Su that examined 6,000 projects [3]. But if care isn’t
taken when selecting which projects to analyze, then increasing the
sample size does not actually contribute to the goal of increased
generality. More is not necessarily better.
As an example, consider a researcher who wants to investigate a
hypothesis about say distributed development on a large number of
projects in an effort to demonstrate generality. The researcher goes
to the json.org website and randomly selects twenty projects, all of
them JSON parsers. Because of the narrow range of functionality
of the projects in the sample, any findings will not be very repre-
sentative; we would learn about JSON parsers, but little about other
types of software. While this is an extreme and contrived example,
it shows the importance of systematically selecting projects for em-
pirical research rather than selecting projects that are convenient.
With this paper we provide techniques to (1) assess the quality of a
sample, and to (2) identify projects that could be added to further
improve the quality of the sample.
Other fields such as medicine and sociology have published and
accepted methodological guidelines for subject selection [2] [4].
While it is common sense to select a sample that is representative
of a population, the importance of diversity is often overlooked yet
as important [5]. As stated by the Research Governance Framework
RECENTTOOL SUPPORT
https://seart-ghs.si.usi.ch
REPRODUCIBILITY
REPRODUCIBILITY
• Not just about replication package
s

• Including details in your paper, which should be
self-contained
SUPPORTINGTECHNOLOGY
Zenodo, Jupyter notebooks, Docker containers,
Virtual Machines
REPOSITORIES AREVOLATILE!
• Q&A posts get delete
d

• GitHub repositories become private, archived, or
get delete
d

• The same may happen to any content available on
the Internet
78% OF PROMPTER’S
RECOMMENDATIONS
CHANGED AFTER ONEYEAR
Empir Software Eng
DOI 10.1007/s10664-015-9397-1
Prompter
Turning the IDE into a self-confident programming assistant
Luca Ponzanelli1 · Gabriele Bavota2 ·
Massimiliano Di Penta3 · Rocco Oliveto4 ·
Michele Lanza1
© Springer Science+Business Media New York 2015
Abstract Developers often require knowledge beyond the one they possess, which boils
down to asking co-workers for help or consulting additional sources of information, such
as Application Programming Interfaces (API) documentation, forums, and Q&A websites.
However, it requires time and energy to formulate one’s problem, peruse and process the
IN CONCLUSION…
If you run a study today, this may not be reproduced
from scratch tomorrow unless having all data
PRESENTATION QUALITY
PRESENTATION QUALITY
• I rarely reject a paper because of tha
t

• Not just matter of getting your paper in, but rather
to let others better understanding your work
FOLLOWING ATEMPLATE
• There are recurring templates for papers
belonging to different categorie
s

• Such templates may help the reader know where
to
fi
nd what
EMPIRICAL PAPER
• Introductio
n

• Study design (include the data extraction process
)

• Study result
s

• Threats to validit
y

• Related wor
k

• Conclusion
EMPIRICAL STUDY DESIGN
• De
fi
nitio
n

• Research Questions / Hypothese
s

• Context Selectio
n

• Data extraction methodolog
y

• Data analysis methodology
TECHNOLOGICAL PAPER
• Introductio
n

• Backgrounds (if any
)

• Approac
h

• Empirical evaluation (may be split
)

• Related Wor
k

• Conclusions
NOTE
• You do not have to stick to those template
s

• There are good reasons to avoid that
FOR EXAMPLE
 

YOU MAY HAVE
• First study needed to understand the proble
m

• Approach de
fi
nition (based on
fi
rst study
)

• Approach evaluation
GAME SMELL PAPER
 

(MSR 2020)
Detecting Video Game-Specific Bad Smells in Unity Projects
Antonio Borrelli
University of Sannio
Benevento, Italy
aborrelli@unisannio.it
Vittoria Nardone
University of Sannio
Benevento, Italy
vnardone@unisannio.it
Giuseppe A. Di Lucca
University of Sannio
Benevento, Italy
dilucca@unisannio.it
Gerardo Canfora
University of Sannio
Benevento, Italy
canfora@unisannio.it
Massimiliano Di Penta
University of Sannio
Benevento, Italy
dipenta@unisannio.it
ABSTRACT
The growth of the video game market, the large proportion of games
targeting mobile devices or streaming services, and the increasing
complexity of video games trigger the availability of video game-
speci�c tools to assess performance and maintainability problems.
This paper proposes UnityLinter, a static analysis tool that supports
Unity video game developers to detect seven types of bad smells
we have identi�ed as relevant in video game development. Such
smell types pertain to performance, maintainability and incorrect
behavior problems. After having de�ned the smells by analyzing
the existing literature and discussion forums, we have assessed
their relevance with a survey involving 68 participants. Then, we
have analyzed the occurrence of the studied smells in 100 open-
source Unity projects, and also assessed UnityLinter’s accuracy.
Results of our empirical investigation indicate that developers well-
received performance- and behavior-related issues, while some
maintainability issues are more controversial. UnityLinter is, in
general, accurate enough in detecting smells (86%-100% precision
and 50%-100% recall), and our study shows that the studied smell
types occur in 39%-97% of the analyzed projects.
1 INTRODUCTION
Video games represent a conspicuous and increasing share of the
software development market. In 2018, the video game industry
has generated 134.9 billion dollars, with over 10% increase over
2017 [25]. Such a market is changing continuously also in terms of
platforms on which video games are deployed. In the past, video
games mainly targeted consoles and desktop computers; nowadays
mobile devices account for nearly half of the market [24], and the
current trend is the streaming of video game contents.
While the video game market is increasing, development skills in
this area still represent a niche. Just to give an idea, Stack Over�ow
features over 1.5M discussions tagged [java] and 1.2M tagged An-
droid, while only 50k are about Unity3D. It is therefore clear how in
this context developers may need suitable support while creating
their video games, helping them to avoid introducing performance
bottlenecks, or making the game di�cult to maintain and evolve.
Static code analysis tools (SCAT) are a typical support developers
have while coding. Such tools, known also as “linters” (from the
�rst tool developed by Johnson for the C language [28]) analyze the
source code or the compiled (e.g., bytecode) program to highlight
YET…
One reviewer complained that research questions
weren’t all addressed in a single place
RELEASE NOTE GENERATION
(TSE 2017)
ARENA: An Approach for the Automated
Generation of Release Notes
Laura Moreno, Member, IEEE, Gabriele Bavota, Member, IEEE, Massimiliano Di Penta, Member, IEEE,
Rocco Oliveto, Member, IEEE, Andrian Marcus, Member, IEEE, and Gerardo Canfora
Abstract—Release notes document corrections, enhancements, and, in general, changes that were implemented in a new release of
a software project. They are usually created manually and may include hundreds of different items, such as descriptions of new
features, bug fixes, structural changes, new or deprecated APIs, and changes to software licenses. Thus, producing them can be a
time-consuming and daunting task. This paper describes ARENA (Automatic RElease Notes generAtor), an approach for the
automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with
information from versioning systems and issue trackers. ARENA was designed based on the manual analysis of 990 existing release
notes. In order to evaluate the quality of the release notes automatically generated by ARENA, we performed four empirical studies
involving a total of 56 participants (48 professional developers and eight students). The obtained results indicate that the generated
release notes are very good approximations of the ones manually produced by developers and often include important information that
is missing in the manually created release notes.
Index Terms—Release notes, software documentation, software evolution
Ç
1 INTRODUCTION
RELEASE notes summarize the main changes that occurred
in a software system since its previous release, such as,
the addition of new features, bug fixes, changes to licenses
this task by generating simplified release notes (e.g., the Atlas-
sian OnDemand release note generator1
), yet such notes are lim-
ited to list closed issues that developers have manually
106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 2, FEBRUARY 2017
CONCLUSION
METHODOLOGICAL INFRASTRUCTURE
PERSPECTIVE
EMPIRICAL TECHNOLOGICAL
FROM MSR 2021
CALL FOR PAPERS
• Soundness of approach
• Relevance to software engineering
• Clarity of relation with related work
• Quality of presentation
• Quality of evaluation [for long papers]
• Ability to replicate [for long papers]
• Novelty
https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers
METHODOLOGICAL INFRASTRUCTURE
PERSPECTIVE
EMPIRICAL TECHNOLOGICAL
FROM MSR 2021
CALL FOR PAPERS
• Soundness of approach
• Relevance to software engineering
• Clarity of relation with related work
• Quality of presentation
• Quality of evaluation [for long papers]
• Ability to replicate [for long papers]
• Novelty
https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers
METHODOLOGICAL INFRASTRUCTURE
PERSPECTIVE
EMPIRICAL TECHNOLOGICAL
TAKEAWAYS
Different types of contributions to MSR,
beyond studies, are highly needed
Dataset size and type depends on the study
goals and research method
Mining process must be documented and justified
in detail
FROM MSR 2021
CALL FOR PAPERS
• Soundness of approach
• Relevance to software engineering
• Clarity of relation with related work
• Quality of presentation
• Quality of evaluation [for long papers]
• Ability to replicate [for long papers]
• Novelty
https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers
METHODOLOGICAL INFRASTRUCTURE
PERSPECTIVE
EMPIRICAL TECHNOLOGICAL
TAKEAWAYS
Different types of contributions to MSR,
beyond studies, are highly needed
Dataset size and type depends on the study
goals and research method
Mining process must be documented and justified
in detail
dipenta@unisannio.it
@mdipenta

More Related Content

What's hot

Strategic Benefits of Contributing to Open Source: For businesses and individ...
Strategic Benefits of Contributing to Open Source: For businesses and individ...Strategic Benefits of Contributing to Open Source: For businesses and individ...
Strategic Benefits of Contributing to Open Source: For businesses and individ...All Things Open
 
Is software engineering research addressing software engineering problems?
Is software engineering research addressing software engineering problems?Is software engineering research addressing software engineering problems?
Is software engineering research addressing software engineering problems?Gail Murphy
 
Tomorrow's software testing for embedded systems
Tomorrow's software testing for embedded systemsTomorrow's software testing for embedded systems
Tomorrow's software testing for embedded systemsYasuharu Nishi
 
Introduction to the cooperation principles in software development - Part II
Introduction to the cooperation principles in software development - Part IIIntroduction to the cooperation principles in software development - Part II
Introduction to the cooperation principles in software development - Part IIProf. Dr. Febe Angel Ciudad Ricardo
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tshmustafa sarac
 
Introduction to the cooperation principles in software development - Part I
Introduction to the cooperation principles in software development - Part IIntroduction to the cooperation principles in software development - Part I
Introduction to the cooperation principles in software development - Part IProf. Dr. Febe Angel Ciudad Ricardo
 
Supporting team coordination of software development across multiple companies
Supporting team coordination of software development across multiple companiesSupporting team coordination of software development across multiple companies
Supporting team coordination of software development across multiple companiesAnh Nguyen Duc
 
Emerging Trends of Software Engineering
Emerging Trends of Software Engineering Emerging Trends of Software Engineering
Emerging Trends of Software Engineering DR. Ram Kumar Pathak
 
Agile And Open Development
Agile And Open DevelopmentAgile And Open Development
Agile And Open DevelopmentRoss Gardler
 
Remote first team interactions with Team Topologies - IT Revolution webinar -...
Remote first team interactions with Team Topologies - IT Revolution webinar -...Remote first team interactions with Team Topologies - IT Revolution webinar -...
Remote first team interactions with Team Topologies - IT Revolution webinar -...Matthew Skelton
 
Remote-first Team Interactions with Team Topologies (public online session Ap...
Remote-first Team Interactions with Team Topologies (public online session Ap...Remote-first Team Interactions with Team Topologies (public online session Ap...
Remote-first Team Interactions with Team Topologies (public online session Ap...Manuel Pais
 
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...Forget monoliths vs microservices - focus on team cognitive load - Team Topol...
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...Conflux
 
Driving Platform-as-Product Using Lean Hypothesis - Karina Villaneuva
Driving Platform-as-Product Using Lean Hypothesis - Karina VillaneuvaDriving Platform-as-Product Using Lean Hypothesis - Karina Villaneuva
Driving Platform-as-Product Using Lean Hypothesis - Karina VillaneuvaVMware Tanzu
 
Mendix essentials 25 11-2011 introductie mendix by arno rood
Mendix essentials 25 11-2011 introductie mendix by arno roodMendix essentials 25 11-2011 introductie mendix by arno rood
Mendix essentials 25 11-2011 introductie mendix by arno roodMendix
 
Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Thoughtworks
 
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...Matthew Skelton
 
Lanika solutions corp
Lanika solutions corpLanika solutions corp
Lanika solutions corpMahesh Gowda
 
Business agility with Team Topologies - NatWest Group - 2021-01-19
Business agility with Team Topologies - NatWest Group - 2021-01-19Business agility with Team Topologies - NatWest Group - 2021-01-19
Business agility with Team Topologies - NatWest Group - 2021-01-19Matthew Skelton
 
Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016Skelton Thatcher Consulting Ltd
 

What's hot (20)

Strategic Benefits of Contributing to Open Source: For businesses and individ...
Strategic Benefits of Contributing to Open Source: For businesses and individ...Strategic Benefits of Contributing to Open Source: For businesses and individ...
Strategic Benefits of Contributing to Open Source: For businesses and individ...
 
Is software engineering research addressing software engineering problems?
Is software engineering research addressing software engineering problems?Is software engineering research addressing software engineering problems?
Is software engineering research addressing software engineering problems?
 
Tomorrow's software testing for embedded systems
Tomorrow's software testing for embedded systemsTomorrow's software testing for embedded systems
Tomorrow's software testing for embedded systems
 
Introduction to the cooperation principles in software development - Part II
Introduction to the cooperation principles in software development - Part IIIntroduction to the cooperation principles in software development - Part II
Introduction to the cooperation principles in software development - Part II
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
 
Introduction to the cooperation principles in software development - Part I
Introduction to the cooperation principles in software development - Part IIntroduction to the cooperation principles in software development - Part I
Introduction to the cooperation principles in software development - Part I
 
Supporting team coordination of software development across multiple companies
Supporting team coordination of software development across multiple companiesSupporting team coordination of software development across multiple companies
Supporting team coordination of software development across multiple companies
 
L22 Architecture and Agile
L22 Architecture and AgileL22 Architecture and Agile
L22 Architecture and Agile
 
Emerging Trends of Software Engineering
Emerging Trends of Software Engineering Emerging Trends of Software Engineering
Emerging Trends of Software Engineering
 
Agile And Open Development
Agile And Open DevelopmentAgile And Open Development
Agile And Open Development
 
Remote first team interactions with Team Topologies - IT Revolution webinar -...
Remote first team interactions with Team Topologies - IT Revolution webinar -...Remote first team interactions with Team Topologies - IT Revolution webinar -...
Remote first team interactions with Team Topologies - IT Revolution webinar -...
 
Remote-first Team Interactions with Team Topologies (public online session Ap...
Remote-first Team Interactions with Team Topologies (public online session Ap...Remote-first Team Interactions with Team Topologies (public online session Ap...
Remote-first Team Interactions with Team Topologies (public online session Ap...
 
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...Forget monoliths vs microservices - focus on team cognitive load - Team Topol...
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...
 
Driving Platform-as-Product Using Lean Hypothesis - Karina Villaneuva
Driving Platform-as-Product Using Lean Hypothesis - Karina VillaneuvaDriving Platform-as-Product Using Lean Hypothesis - Karina Villaneuva
Driving Platform-as-Product Using Lean Hypothesis - Karina Villaneuva
 
Mendix essentials 25 11-2011 introductie mendix by arno rood
Mendix essentials 25 11-2011 introductie mendix by arno roodMendix essentials 25 11-2011 introductie mendix by arno rood
Mendix essentials 25 11-2011 introductie mendix by arno rood
 
Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22
 
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
 
Lanika solutions corp
Lanika solutions corpLanika solutions corp
Lanika solutions corp
 
Business agility with Team Topologies - NatWest Group - 2021-01-19
Business agility with Team Topologies - NatWest Group - 2021-01-19Business agility with Team Topologies - NatWest Group - 2021-01-19
Business agility with Team Topologies - NatWest Group - 2021-01-19
 
Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016Teams and monoliths - Matthew Skelton - Velocity EU 2016
Teams and monoliths - Matthew Skelton - Velocity EU 2016
 

Similar to Msr2021 tutorial-di penta

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Prakash_Profile(279074)
Prakash_Profile(279074)Prakash_Profile(279074)
Prakash_Profile(279074)Prakash s
 
Put Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowPut Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowMassimiliano Di Penta
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...zillesubhan
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
MK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMohammed Ali Khan
 
Software Engineering with Objects (M363) Final Revision By Kuwait10
Software Engineering with Objects (M363) Final Revision By Kuwait10Software Engineering with Objects (M363) Final Revision By Kuwait10
Software Engineering with Objects (M363) Final Revision By Kuwait10Kuwait10
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigmJonathan Challener
 
The Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs PublicThe Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs PublicDavid Solivan
 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 
RAD - System i - Presentation
RAD - System i - PresentationRAD - System i - Presentation
RAD - System i - PresentationChuck Walker
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesRaphael Branger
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020Ralf Laemmel
 
boughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifij
boughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifijboughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifij
boughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifijakd3143
 
Big data analytics fas trak solution overview
Big data analytics fas trak solution overviewBig data analytics fas trak solution overview
Big data analytics fas trak solution overviewMarc St-Pierre
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET Journal
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 

Similar to Msr2021 tutorial-di penta (20)

Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Prakash_Profile(279074)
Prakash_Profile(279074)Prakash_Profile(279074)
Prakash_Profile(279074)
 
Put Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowPut Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and How
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
A CRUD Matrix
A CRUD MatrixA CRUD Matrix
A CRUD Matrix
 
MK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updated
 
Software Engineering with Objects (M363) Final Revision By Kuwait10
Software Engineering with Objects (M363) Final Revision By Kuwait10Software Engineering with Objects (M363) Final Revision By Kuwait10
Software Engineering with Objects (M363) Final Revision By Kuwait10
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigm
 
The Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs PublicThe Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs Public
 
Requirementv4
Requirementv4Requirementv4
Requirementv4
 
Waterfall model
Waterfall modelWaterfall model
Waterfall model
 
RAD - System i - Presentation
RAD - System i - PresentationRAD - System i - Presentation
RAD - System i - Presentation
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
boughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifij
boughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifijboughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifij
boughtonalexand jdjdjfjjfjfjfjnfjfjjjfkdifij
 
Big data analytics fas trak solution overview
Big data analytics fas trak solution overviewBig data analytics fas trak solution overview
Big data analytics fas trak solution overview
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 

More from Massimiliano Di Penta

More from Massimiliano Di Penta (9)

Empirical evaluation in 2020: how big, how beautiful?
Empirical evaluation in 2020: how big, how beautiful?Empirical evaluation in 2020: how big, how beautiful?
Empirical evaluation in 2020: how big, how beautiful?
 
ASE 2017 Intro slides
ASE 2017 Intro slidesASE 2017 Intro slides
ASE 2017 Intro slides
 
Most Influential Paper - SANER 2017
Most Influential Paper - SANER 2017Most Influential Paper - SANER 2017
Most Influential Paper - SANER 2017
 
MSR 2015 Announcement
MSR 2015 AnnouncementMSR 2015 Announcement
MSR 2015 Announcement
 
FSE 2012 talk: finding mentors in software projects
FSE 2012 talk: finding mentors in software projectsFSE 2012 talk: finding mentors in software projects
FSE 2012 talk: finding mentors in software projects
 
SSBSE 2012 Keynote
SSBSE 2012 KeynoteSSBSE 2012 Keynote
SSBSE 2012 Keynote
 
Dipenta msr2011-csbf
Dipenta msr2011-csbfDipenta msr2011-csbf
Dipenta msr2011-csbf
 
Dipenta msr2011-challenge
Dipenta msr2011-challenge Dipenta msr2011-challenge
Dipenta msr2011-challenge
 
Dipenta msr2011-renaming
Dipenta msr2011-renamingDipenta msr2011-renaming
Dipenta msr2011-renaming
 

Recently uploaded

The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...stockholm university
 
Transport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MITransport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MIRomil Mishra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdfHulkTheDevil
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Antonio de Llamas
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024BookNet Canada
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxatharvdev2010
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...BookNet Canada
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023Joshua Flannery
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 

Recently uploaded (20)

The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...
 
Transport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MITransport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MI
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptx
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 

Msr2021 tutorial-di penta

  • 1. CRAFTING YOUR NEXT MSR PAPER SUGGESTIONS FROM MY (GOOD AND BAD) EXPERIENCE Massimiliano Di Pent a University of Sannio, Italy
  • 2. MY MSR EXPERIENCE • 19 papers published (17 full research papers ) • Program committee member in 9 edition s • Program co-chair in 2012 and 201 3 • General chair in 201 5 • Steering committee member 2011-2018
  • 3. GOALS OFTHISTUTORIAL • Explain different ways for contributing to MSR researc h • Go over the paper’s evaluation criteria and try to satisfy them
  • 4. NOTES • I will refer to some exemplar paper s • Those are just examples, but some of them quite representative one s • All are MSR-related papers, not only from the MSR conference
  • 5. ANALYSIS OF MSR REPORTING • I’m studying this with Davide Falessi and Alexander Serebrenik • We are interested to hear your opinion, especially if you are a senior member of the community (SurveyHero, takes 15 min. ) https://tinyurl.com/MiningReporting
  • 6. CHAPTER I - HOW CAN I CONTRIBUTE TO MSR RESEARCH?
  • 9. METHODOLOGICAL PAPERS Providing techniques that will hopefully help future mining research
  • 10. FIX INDUCING CHANGES (SZZ ALGORITHM) When Do Changes Induce Fixes? (On Fridays.) Jacek Śliwerski International Max Planck Research School Max Planck Institute for Computer Science Saarbrücken, Germany sliwers@mpi-sb.mpg.de Thomas Zimmermann Andreas Zeller Department of Computer Science Saarland University Saarbrücken, Germany {tz, zeller}@acm.org ABSTRACT As a software system evolves, programmers make changes that sometimes cause problems. We analyze CVS archives for fix-in- ducing changes—changes that lead to problems, indicated by fixes. We show how to automatically locate fix-inducing changes by link- ing a version archive (such as CVS) to a bug database (such as BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE history, it turns out that fix-inducing changes show distinct patterns with respect to their size and the day of week they were applied. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—corrections, version control; D.2.8 [Metrics]: Com- plexity measures General Terms Management, Measurement 1. INTRODUCTION Which change properties may lead to problems? We can inves- tigate which properties of a change correlate with inducing fixes, for instance, changes made on a specific day or by a specific group of developers. How error-prone is my product? We can assign a metric to the product—on average, how likely is it that a change induces a later fix? How can I filter out problematic changes? When extracting the architecture via co-changes from a version archive, there is no need to consider fix-inducing changes, as they get undone later. Can I improve guidance along related changes? When using co- changes to guide programmers along related changes, we would like to avoid fix-inducing changes in our suggestions. This paper describes our first experiences with fix-inducing chang- es. We discuss how to extract data from version and bug archives (Section 2), and how we link bug reports to changes (Section 3). In Section 4, we describe how to identify and locate fix-inducing changes. Section 5 shows the results of our investigation of the
  • 11. LINKING ISSUES TO COMMITS “ fi x 367920 setting pop3 Messages as junk/not junk ignored when Message quarantining turned on sr=mscott ” Solution: Regular expression matching e.g. $l=~/BR (d+)/ || $l=~/fixs+(d+)/i || $l=~/PRs+(d+)/ || 
 $l=~/Bugzillas+(d+)/i ||
 $l=~/Bugs+(d+)/i || $l=~/^#(d+)/i
  • 12. IDENTIFYING FIX INDUCING CHANGES bug fi xing fi x inducing changes fi x inducing change Affected lines Affected lines fi le source code lines cn ^cn before bug fi xing ci cj ck
  • 14. INFRASTRUCTURE • Setting up tools or data for other researcher s • Sometimes a consequence of a methodological contribution
  • 15. SRCML An XML-Based Lightweight C++ Fact Extractor Michael L. Collard, Huzefa H. Kagdi, Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 330 672 9039 collard@cs.kent.edu, hkagdi@cs.kent.edu, jmaletic@cs.kent.edu Abstract A lightweight fact extractor is presented that utilizes XML tools, such as XPath and XSLT, to extract static information from C++ source code programs. The source code is first converted into an XML representation, srcML, to facilitate the use of a wide variety of XML tools. The method is deemed lightweight because only a partial parsing of the source is done. Additionally, the technique is quite robust and can be applied to incomplete and non-compile-able source code. The trade off to this approach is that queries on some low level details cannot be directly addressed. This approach is applied to a fact extractor benchmark as comparison with other, abet heavier weight, fact extractors. Fact extractors are widely used to support understanding tasks associated with maintenance, reverse engineering and various other software engineering tasks. a lightweight, robust, and tolerant C++ fact extractor. We use the term lightweight to highlight the fact that only lightweight parsing is done and a number of very low-level type facts can not be directly derived from the data source (i.e., srcML markup of the C++ source). Our method allows the extraction of high-level entities such as functions, classes, namespaces, and templates, as well as middle-level entities such as individual statements (if, while, etc.), declarations and expressions. Lower-level entities such as variables and function calls can also be queried. Additionally, it allows the extraction of entities that are typically discarded during pre- processing such as comments, pre-processor directives, and macros. The entities are extracted with full lexical information such as white space and all original source code information. The following section will address some of the problems encountered during fact extraction and address the related work in the field of fact extraction. We then describe srcML and our C++ to srcML translator.
  • 16. SRCML https://www.srcml.org • Parses source code and produces the output in XM L • Multi-languag e • Also supports transformations, lightweight slicing/ data fl ow analysis
  • 17. PERCEVAL Perceval: Software Project Data at Your Will Santiago Dueñas Bitergia sduenas@bitergia.com Valerio Cosentino Bitergia valcos@bitergia.com Gregorio Robles Universidad Rey Juan Carlos grex@gsyc.urjc.es Jesus M. Gonzalez-Barahona Universidad Rey Juan Carlos jgb@gsyc.urjc.es ABSTRACT Software development projects, in particular open source ones, heavily rely on the use of tools to support, coordinate and promote development activities. Despite their paramount value, they con- tribute to fragment the project data, thus challenging practitioners and researchers willing to derive insightful analytics about software projects. In this demo we present Perceval, a loyal helper able to perform automatic and incremental data gathering from almost any tool related with contributing to open source development, among others, source code management, issue tracking systems, mailing lists, forums, and social media. Perceval is an industry strong free software tool that has been widely used in Bitergia, a company devoted to offer commercial software analytics of software projects. It hides the technical complexities related to data acquisition and eases the definition of analytics. A video showcasing the main features of Perceval can be found at https://youtu.be/eH1sYF0Hdc8. KEYWORDS Software mining, empirical software engineering, open source soft- However, accessing and gathering this data is often a time- consuming and an error-prone task, that entails many considera- tions and technical expertise [1, 12, 16]. It may require to understand how to obtain an OAuth [11] token (e.g., StackExchange, GitHub) or prepare storage to download the data (e.g., Git repositories, mail- ing list archives); when dealing with development support tools that expose their data via APIs, special attention has to be paid to the terms of service (e.g., an excessive number of requests could lead to temporary or permanent bans); recovery solutions to tackle connection problems when fetching remote data should also taken into account; storing the data already received and retrying failed API calls may speed up the overall gathering process and reduce the risk of corrupted data. Nonetheless, even if these problems are known, many scholars and practitioners tend to re-invent the wheel by retrieving the data themselves with ad-hoc scripts. In this paper, we present Perceval, a tool that simplifies the col- lection of project data by covering more than 20 well-known tools and platforms related to contributing to open source development, thus enabling the definition of software analytics. It rebuilts and 2018 ACM/IEEE 40th International Conference on Software Engineering: Companion Proceedings
  • 18. PERCEVAL https://github.com/chaoss/grimoirelab-perceval •Gathers data from a wide number of software repositories •git, GitHub, issue trackers, Slack, Gerrit, Docker hub, and many others
  • 19. PYDRILLER PyDriller: Python Framework for Mining So�ware Repositories Davide Spadini Delft University of Technology Software Improvement Group Delft, The Netherlands d.spadini@sig.eu Maurício Aniche Delft University of Technology Delft, The Netherlands m.f.aniche@tudelft.nl Alberto Bacchelli University of Zurich Zurich, Switzerland bacchelli@i�.uzh.ch ABSTRACT Software repositories contain historical and valuable information about the overall development of software systems. Mining software repositories (MSR) is nowadays considered one of the most inter- esting growing �elds within software engineering. MSR focuses on extracting and analyzing data available in software repositories to uncover interesting, useful, and actionable information about the system. Even though MSR plays an important role in software engineering research, few tools have been created and made public to support developers in extracting information from Git reposi- tory. In this paper, we present P��������, a Python Framework that eases the process of mining Git. We compare our tool against the state-of-the-art Python Framework GitPython, demonstrating that P�������� can achieve the same results with, on average, 50% less LOC and signi�cantly lower complexity. URL: https://github.com/ishepard/pydriller, Materials: https://doi.org/10.5281/zenodo.1327363, Pre-print: https://doi.org/10.5281/zenodo.1327411 CCS CONCEPTS • Software and its engineering; actionable insights for software engineering, such as understanding the impact of code smells [13–15], exploring how developers are doing code reviews [2, 4, 10, 21] and which testing practices they follow [20], predicting classes that are more prone to change/de- fects [3, 6, 16, 17], and identifying the core developers of a software team to transfer knowledge [12]. Among the di�erent sources of information researchers can use, version control systems, such as Git, are among the most used ones. Indeed, version control systems provide researchers with precise information about the source code, its evolution, the developers of the software, and the commit messages (which explain the reasons for changing). Nevertheless, extracting information from Git repositories is not trivial. Indeed, many frameworks can be used to interact with Git (depending on the preferred programming language), such as GitPython [1] for Python, or JGit for Java [8]. However, these tools are often di�cult to use. One of the main reasons for such di�culty is that they encapsulate all the features from Git, hence, developers are forced to write long and complex implementations to extract even simple data from a Git repository. In this paper, we present P��������, a Python framework that helps developers to mine software repositories. P�������� provides
  • 20. PYDRILLER https://github.com/ishepard/pydrille r • Python-based mining framewor k • Changed fi les, diffs, metric s • Watch back this morning Tutorial 
 by Mauricio Aniche and Alberto Bacchelli
  • 26. PERSPECTIVE PAPERS Provide insights on how (not to) mine certain repositorie s Lessons learned, things to avoid
  • 27. ON MINING GIT… The Promises and Perils of Mining Git Christian Bird⇤, Peter C. Rigby†, Earl T. Barr⇤, David J. Hamilton⇤, Daniel M. German†, Prem Devanbu⇤ ⇤University of California, Davis, USA †University of Victoria, Canada {bird,barr,hamiltod,devanbu}@cs.ucdavis.edu {pcr,dmg}@cs.uvic.ca Abstract We are now witnessing the rapid growth of decentralized source code management (DSCM) systems, in which every developer has her own repository. DSCMs facilitate a style of collaboration in which work output can flow sideways (and privately) between collaborators, rather than always up and down (and publicly) via a central repository. Decen- tralization comes with both the promise of new data and the peril of its misinterpretation. We focus on git, a very popular DSCM used in high-profile projects. Decentralization, and other features of git, such as automatically recorded con- 500 1000 1500 2000 2500 3000 Number of Projects Subversion Git Bazaar CVS Darcs Hg
  • 28. … AND GITHUB The Promises and Perils of Mining GitHub Eirini Kalliamvakou University of Victoria ikaliam@uvic.ca Georgios Gousios Delft University of Technology G.Gousios@tudelft.nl Kelly Blincoe University of Victoria kblincoe@acm.org Leif Singer University of Victoria lsinger@uvic.ca Daniel M. German⇤ University of Victoria dmg@uvic.ca Daniela Damian University of Victoria danielad@cs.uvic.ca ABSTRACT With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the infor- mation stored in GitHub’s event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at un- derstanding the characteristics of the repositories in GitHub and how users take advantage of GitHub’s main features— namely commits, pull requests, and issues. Our results indi- cate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineer- ing researchers on how to approach the data in GitHub. Categories and Subject Descriptors D.2.8 [Software Engineering]: Management—Software con- “fork & pull” model in which developers create their own copy of a repository and submit a pull request when they want the project maintainer to pull their changes into the main branch. In addition to code hosting, collaborative code review, and integrated issue tracking, GitHub has integrated social features. Users are able to subscribe to information by “watching” projects and “following” users, resulting in a feed of information on those projects and users of interest. Users also have profiles that can be populated with identifying information and contain their recent activity within the site. With over 10.6 million repositories1 hosted as of January 2014, GitHub is currently the largest code hosting site in the world. Its popularity, the integrated social features, and the availability of metadata through an accessible api have made GitHub very attractive for software engineering researchers. Existing research has been both qualitative [4, 7, 16, 17, 19] and quantitative [10, 24, 25, 26]. Qualitative studies have fo- cused on how developers use GitHub’s social features to form impressions and draw conclusions on other developers’ and projects’ activity to assess success, performance, and possi- ble collaboration opportunities. Quantitative studies have aimed to systematically archive GitHub’s publicly available data and use that to investigate development practices and network structure in the GitHub environment. As part of our research on collaboration on GitHub [15],
  • 29. LOOK ATTHE FIRST MSR! https://dblp.org/db/conf/msr/msr2004.html
  • 31. ABOUT EMPIRICAL RESEARCH Quantitative, Qualitative, or both Observing patterns in a project Finding correlations between variables
  • 32. QUANTITATIVE STUDY An Empirical Analysis of the Docker Container Ecosystem on GitHub Jürgen Cito∗, Gerald Schermann∗, John Erik Wittern†, Philipp Leitner∗, Sali Zumberi∗, Harald C. Gall∗ ∗ Software Evolution and Architecture Lab University of Zurich, Switzerland {lastname}@ifi.uzh.ch † IBM T. J. Watson Research Center Yorktown Heights, NY, USA witternj@us.ibm.com Abstract—Docker allows packaging an application with its dependencies into a standardized, self-contained unit (a so-called container), which can be used for software development and to run the application on any system. Dockerfiles are declarative definitions of an environment that aim to enable reproducible builds of the container. They can often be found in source code repositories and enable the hosted software to come to life in its execution environment. We conduct an exploratory empirical study with the goal of characterizing the Docker ecosystem, prevalent quality issues, and the evolution of Dockerfiles. We base our study on a data set of over 70000 Dockerfiles, and contrast this general population with samplings that contain the Top-100 and Top-1000 most popular Docker-using projects. We find that most quality issues (28.6%) arise from missing version pinning (i.e., specifying a concrete version for dependencies). Further, we were not able to build 34% of Dockerfiles from a representative sample of 560 projects. Integrating quality checks, e.g., to issue version pinning warnings, into the container build process could result into more reproducible builds. The most popular projects change more often than the rest of the Docker population, with 5.81 revisions per year and 5 lines of code changed on average. ity [4], we study the Docker ecosystem with respect to quality of Dockerfiles and their change and evolution behavior within software repositories. We developed a tool chain that trans- forms Dockerfiles and their evolution in Git repositories into a relational database model. We mined the entire population of Dockerfiles on GitHub as of October 2016, and summarize our findings on the ecosystem in general, quality aspects, and evolution behavior. The results of our study can inform standard bodies around containers and tool developers to develop better support to improve quality and drive ecosystem change. We make the following contributions through our ex- ploratory study: Ecosystem Overview. We characterize the ecosystem of Docker containers on GitHub by analyzing the distribution of projects using Docker, broken down by primary programming language, project size, and the base infrastructure (base image) 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)
  • 33. QUALITATIVE STUDY (ONE PROJECT) Communication in Open Source Software Development Mailing Lists Anja Guzzi1 , Alberto Bacchelli2 , Michele Lanza2 , Martin Pinzger3 , Arie van Deursen1 1: Department of Software and Computer Technology - Delft University of Technology, The Netherlands 2: REVEAL @ Faculty of Informatics - University of Lugano, Switzerland 3: Institute for Informatics Systems - University of Klagenfurt, Austria Abstract—Open source software (OSS) development teams use electronic means, such as emails, instant messaging, or forums, to conduct open and public discussions. Researchers investigated mailing lists considering them as a hub for project communica- tion. Prior work focused on specific aspects of emails, for example the handling of patches, traceability concerns, or social networks. This led to insights pertaining to the investigated aspects, but not to a comprehensive view of what developers communicate about. Our objective is to increase the understanding of development mailing lists communication. We quantitatively and qualitatively analyzed a sample of 506 email threads from the development mailing list of a major OSS project, Lucene. Our investigation reveals that implementation details are discussed only in about 35% of the threads, and that a range of other topics is discussed. Moreover, core developers participate in less than 75% of the threads. We observed that the development mailing list is not the main player in OSS project communication, as it also includes other channels such as the issue repository. I. Introduction Open source software (OSS) development teams use elec- tronic means, such as emails, instant messaging, or forums, Nevertheless, there is no clear, updated, and well-rounded picture of the communication taking place in open source development mailing lists that supports these assumptions. In fact, at our disposal, we only have either abstract and outdated knowledge (e.g., obtained as a side e↵ect of the analysis of the Linux project), which does not consider the recent shift of interest to new social platforms (e.g., GitHub and Jira), or a very specialized understanding (e.g., regarding specific information, such as the process of code review [25]), which does not take into account all the information that can be distilled from development emails. Our goal is to increase our understanding of development mailing lists communication: What do participants talk about? How much do they discuss each topic? What is the role of the development mailing lists for OSS project communication? Answering these questions can confirm or cast doubts on the previous assumptions, and it can provide insights for future research on mining developers’ communication and for building tools to help project teams communicate e↵ectively. To answer these questions, we conducted an in-depth analysis of the communication taking place in the development mailing
  • 35. TECHNOLOGICAL • Those should be the ice on the cak e • Consequence of all previous researc h • Exploiting software repositories to help developers
  • 36. RECOMMENDING RELEVANT STACKOVERFLOW DISCUSSIONS Mining StackOverflow to Turn the IDE into a Self-Confident Programming Prompter Luca Ponzanelli1, Gabriele Bavota2, Massimiliano Di Penta2, Rocco Oliveto3, Michele Lanza1 1: REVEAL @ Faculty of Informatics – University of Lugano, Switzerland 2: University of Sannio, Benevento, Italy 3: University of Molise, Pesche (IS), Italy ABSTRACT Developers often require knowledge beyond the one they possess, which often boils down to consulting sources of information like Application Programming Interfaces (API) documentation, forums, Q&A websites, etc. Knowing what to search for and how is non- trivial, and developers spend time and energy to formulate their problems as queries and to peruse and process the results. We propose a novel approach that, given a context in the IDE, automatically retrieves pertinent discussions from Stack Overflow, evaluates their relevance, and, if a given confidence threshold is surpassed, notifies the developer about the available help. We have implemented our approach in Prompter, an Eclipse plug-in. Prompter has been evaluated through two studies. The first was aimed at evaluating the devised ranking model, while the second was conducted to evaluate the usefulness of Prompter. problems, the main one being the absence of automation: Every time developers need to look for information, they interrupt their work flow, leave the IDE, and use a Web browser to perform and refine searches, and assess the results. Finally, they transfer the obtained knowledge to the problem context in the IDE. The information is retrieved from di↵erent sources, such as forums, mailing lists [2], blogs, Q&A websites, bug trackers [1], etc. A prominent example is Stack Overflow, popular among developers as a venue for sharing programming knowledge. Stack Overflow is vast: In 2010 it already had 300k users, and millions of questions, answers, and comments [23]. This makes finding the right piece of information cumbersome and challenging. Recommender systems [33] represent a possible solution to this problem. A recommender system gathers and analyzes data, iden- tifies useful artifacts, and suggests them to the developer. Seminal
  • 37. APPROACH Search Service Eclipse Prompter Query Generation Service Search Engines Google Bing Blekko Stack Overflow API Service Ranking Model Search Engine Proxy Code Context 1 3 2 Code Context Query & Triggering Info Query & Code Context 4 Query 5 Results 6 Discussion IDs 7 Documents 8 Ranked Results
  • 41. RECAP There are different ways you can contribute to MSR research, beyond empirical studies
  • 42. CHAPTER II - HOWTO PREVENT REJECTIONS?
  • 43. ANSWER -YOU CAN’T There is always a chance reviewers won’t like your paper
  • 44. This is an opportunity to make our work more convincing Don’t despair! In the end we will thank the reviewers
  • 46. HOW IS MY PAPER GOINGTO BE EVALUATED?
  • 47. FROM MSR 202 1 CALL FOR PAPERS • Soundness of approac h • Relevance to software engineerin g • Clarity of relation with related wor k • Quality of presentatio n • Quality of evaluation [for long papers ] • Ability to replicate [for long papers ] • Novelty https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers
  • 49. RELEVANCE Ok my paper is about software engineering, 
 so it’s fi ne…
  • 50. QUESTIONSTO ASK • Does the paper solve a problem relevant for any stakeholder ? • Does phenomenon being investigated by the study frequently occur and impact real projects ? • Is the achieved improvement tangible for the interested stakeholder?
  • 51. RELEVANCE: EXAMPLES OF WEAK CONTRIBUTION The investigated code bad smell occurs in the 1% of the studied projects
  • 52. RELEVANCE: EXAMPLES OF WEAK CONTRIBUTION We improve defect prediction precision 30% precision to 40%
  • 53. THAT BEING SAID… Sometimes very small improvements pave the road towards tangible, signi fi cant ones!
  • 54. MSR RESEARCHER TEMPTATION Here’s a new dataset… let’s try to do something with that!
  • 55. PROBLEM-DRIVENVS OF DATA-DRIVEN RESEARCH How would my study (help to) solve a problem developers have?
  • 57. EXAMPLES OF NOVEL CONTRIBUTIONS • Novel approach: propose an approach improving the state- of-the-ar t • New empirical results: New, possibly unexpected, empirical evidenc e • Negative result: Shows that something does not work • Replication: Con fi rms (in a different context) previous results
  • 58. TECHNICAL SOUNDNESS (OFTHE MINING PROCESS)
  • 59. VERSIONING MINING Details to describe and justify : • History rang e • Branche s • Commit orderin g • On excluding merge commits
  • 60. THREATSTO DISCUSS • History can be rewritte n • When mining repositories, there’s little you can d o • At least, discuss the threats
  • 61. NOT ALL BUG-RELATED ISSUES ARE BUGS 0 150 300 450 600 Mozilla Eclipse JBoss 156 24 121 99 382 209 345 194 270 Bugs Non bugs Others Giuliano Antoniol, Kamel Ayari, Massimiliano Di Penta, Foutse Khomh, Yann-Gaël Guéhéneuc: Is it a bug or an enhancement?: a text-based approach to classify change requests. CASCON 2008: 23
  • 62. It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction Kim Herzig Saarland University Saarbrücken, Germany herzig@cs.uni-saarland.de Sascha Just Saarland University Saarbrücken, Germany just@st.cs.uni-saarland.de Andreas Zeller Saarland University Saarbrücken, Germany zeller@cs.uni-saarland.de Abstract—In a manual examination of more than 7,000 issue reports from the bug databases of five open-source projects, we found 33.8% of all bug reports to be misclassified—that is, rather than referring to a code fix, they resulted in a new feature, an update to documentation, or an internal refactoring. This misclassification introduces bias in bug prediction models, confusing bugs and features: On average, 39% of files marked as defective actually never had a bug. We estimate the impact of this misclassification on earlier studies and recommend manual data validation for future studies. Index Terms—mining software repositories; bug reports; data quality; noise; bias I. INTRODUCTION In empirical software engineering, it has become common- place to mine data from change and bug databases to detect where bugs have occurred in the past, or to predict where they will occur in the future. The accuracy of such measurements and predictions depends on the quality of the data. Therefore, TABLE I PROJECT DETAILS. Maintainer Tracker type # reports HTTPClient APACHE Jira 746 Jackrabbit APACHE Jira 2,402 Lucene-Java APACHE Jira 2,443 Rhino MOZILLA Bugzilla 1,226 Tomcat5 APACHE Bugzilla 584 These are the questions we address in this paper. From five open source projects (Section II), we manually classified more than 7,000 issue reports into a fixed set of issue report categories clearly distinguishing the kind of maintenance work required to resolve the task (Section III). Our findings indicate substantial data quality issues: Issue report classifications are unreliable. In the five bug databases investigated, more than 40% of issue reports
  • 63. THREAT: MISSING LINKS nmbd_incomingdgrams.c: Fix bug with Syntax 5.1 servers reported by SGI where they do host announcements to LOCAL_MASTER_BROWSER_NAME<00 rather than WORKGROUP<1d Quieten level 0 debug when probing for modules.We shouldn't display so loud an error when a smb_probe_module() fails.Also tidy up debugs a bit. Bug 375.
  • 64. MISSING LINKS The Missing Links: Bugs and Bug-fix Commits Adrian Bachmann1 , Christian Bird2 , Foyzur Rahman2 , Premkumar Devanbu2 and Abraham Bernstein1 1 Department of Informatics, University of Zurich, Switzerland 2 Computer Science Department, University of California, Davis, USA {bachmann,bernstein}@ifi.uzh.ch {cabird,mfrahman,ptdevanbu}@ucdavis.edu ABSTRACT Empirical studies of software defects rely on links between bug databases and program code repositories. This linkage is typically based on bug-fixes identified in developer-entered commit logs. Unfortunately, developers do not always report which commits perform bug-fixes. Prior work suggests that such links can be a biased sample of the entire population of fixed bugs. The validity of statistical hypotheses-testing based on linked data could well be affected by bias. Given the wide use of linked defect data, it is vital to gauge the nature and extent of the bias, and try to develop testable theories and models of the bias. To do this, we must establish ground truth: manually analyze a complete version history corpus, and nail down those commits that fix defects, and those that do not. This is a difficult task, requiring an ex- pert to compare versions, analyze changes, find related bugs in the bug database, reverse-engineer missing links, and fi- 1. INTRODUCTION Software process data, especially bug reports and commit logs, are widely used in software engineering research. The integration of these two provides valuable information on the history and evolution of a software project. It is used, e.g., to predict the number and locale of bugs in future software releases (e.g., [27, 31, 17, 6]). The two data sources are nor- mally integrated by scanning through the version control log messages for potential bug report numbers; conscien- tious developers enter this information when they check-in bug fixes (e.g., see [14]). We used similar techniques in our previous work, and, in fact, improved current practice by adding heuristics to check the results [3, 4]. Even so, the links (between program code commits and bug reports) thus extracted cannot be guaranteed to be correct, as they are reliant on voluntary developer annotations in commit logs. In prior work, we have shown that such data sets are
  • 65. ONTHE USE OFTOOLS • You are not reinventing the whee l • The MSR community is contributing with great tool s • Consider about reusing them
  • 66. ISTHETOOL WORKING? • A minimal validation to check whether a tool correctly work s • We gave up in using a popular tool as its results were wrong
  • 67. MACHINE LEARNING RETRAINING/TUNING Machine learning-based tools may need to be retrained/tuned if applied in a completely different context
  • 69. EMPIRICAL EVALUATION SOUNDNESS • This topic would require a separate tutorial (and there are many ) • Suitable design, appropriate use of statistical procedures, threats to validity discussed/ mitigated, … • We will focus on projects’ selection
  • 70. HOW BIG? - 20YEARS AGO The evaluation is very small… only one project is analyzed
  • 71. HOW BIG? - 10YEARS AGO The evaluation is very small… only fi ve projects are analyzed
  • 72. HOW BIG? -TODAY The evaluation is very small… only 100 projects are analyzed
  • 73. JOKE APART… I use this argument very rarely against (and in favor) of a paper
  • 74. ONE SIZE DOES NOT FIT ALL The size and type of the dataset depends o n • the goals of the pape r • the research method being use d • depth vs. breadth
  • 75. CHOICE OF DATASETS • Existing datasets: are they appropriate to your research? Are they too obsolete ? • Mining your own dataset: de fi ne a clear selection criteria
  • 76. ON PROJECTS’ SELECTION • Toy projects, tutorial s • Forked project s • Inactive projects
  • 77. STARS MAY NO T BETHE BESTTHING… The Journal of Systems and Software 146 (2018) 112–129 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss Controversy Corner What’s in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform Hudson Borges∗ , Marco Tulio Valente Department of Computer Science, UFMG, Brazil a r t i c l e i n f o Article history: Received 4 September 2017 Revised 27 August 2018 Accepted 7 September 2018 Available online 10 September 2018 Keywords: a b s t r a c t Besides a git-based version control system, GitHub integrates several social coding features. Particularly, GitHub users can star a repository, presumably to manifest interest or satisfaction with an open source project. However, the real and practical meaning of starring a project was never the subject of an in- depth and well-founded empirical investigation. Therefore, we provide in this paper a throughout study on the meaning, characteristics, and dynamic growth of GitHub stars. First, by surveying 791 developers, we report that three out of four developers consider the number of stars before using or contributing
  • 78. DIVERSITY (WHEN NEEDED) Diversity in Software Engineering Research Meiyappan Nagappan Software Analysis and Intelligence Lab Queen’s University, Kingston, Canada mei@cs.queensu.ca Thomas Zimmermann Microsoft Research Redmond, WA, USA tzimmer@microsoft.com Christian Bird Microsoft Research Redmond, WA, USA Christian.Bird@microsoft.com ABSTRACT One of the goals of software engineering research is to achieve gen- erality: Are the phenomena found in a few projects reflective of others? Will a technique perform as well on projects other than the projects it is evaluated on? While it is common sense to select a sample that is representative of a population, the importance of di- versity is often overlooked, yet as important. In this paper, we com- bine ideas from representativeness and diversity and introduce a measure called sample coverage, defined as the percentage of pro- jects in a population that are similar to the given sample. We intro- duce algorithms to compute the sample coverage for a given set of projects and to select the projects that increase the coverage the most. We demonstrate our technique on research presented over the span of two years at ICSE and FSE with respect to a population of 20,000 active open source projects monitored by Ohloh.net. Knowing the coverage of a sample enhances our ability to reason about the findings of a study. Furthermore, we propose reporting guidelines for research: in addition to coverage scores, papers should discuss the target population of the research (universe) and dimensions that potentially can influence the outcomes of a re- search (space). Categories and Subject Descriptors D.2.6 [Software Engineering]: Metrics et al. [2] examined 1,000 projects. Another example is the study by Gabel and Su that examined 6,000 projects [3]. But if care isn’t taken when selecting which projects to analyze, then increasing the sample size does not actually contribute to the goal of increased generality. More is not necessarily better. As an example, consider a researcher who wants to investigate a hypothesis about say distributed development on a large number of projects in an effort to demonstrate generality. The researcher goes to the json.org website and randomly selects twenty projects, all of them JSON parsers. Because of the narrow range of functionality of the projects in the sample, any findings will not be very repre- sentative; we would learn about JSON parsers, but little about other types of software. While this is an extreme and contrived example, it shows the importance of systematically selecting projects for em- pirical research rather than selecting projects that are convenient. With this paper we provide techniques to (1) assess the quality of a sample, and to (2) identify projects that could be added to further improve the quality of the sample. Other fields such as medicine and sociology have published and accepted methodological guidelines for subject selection [2] [4]. While it is common sense to select a sample that is representative of a population, the importance of diversity is often overlooked yet as important [5]. As stated by the Research Governance Framework
  • 81. REPRODUCIBILITY • Not just about replication package s • Including details in your paper, which should be self-contained
  • 82. SUPPORTINGTECHNOLOGY Zenodo, Jupyter notebooks, Docker containers, Virtual Machines
  • 83. REPOSITORIES AREVOLATILE! • Q&A posts get delete d • GitHub repositories become private, archived, or get delete d • The same may happen to any content available on the Internet
  • 84. 78% OF PROMPTER’S RECOMMENDATIONS CHANGED AFTER ONEYEAR Empir Software Eng DOI 10.1007/s10664-015-9397-1 Prompter Turning the IDE into a self-confident programming assistant Luca Ponzanelli1 · Gabriele Bavota2 · Massimiliano Di Penta3 · Rocco Oliveto4 · Michele Lanza1 © Springer Science+Business Media New York 2015 Abstract Developers often require knowledge beyond the one they possess, which boils down to asking co-workers for help or consulting additional sources of information, such as Application Programming Interfaces (API) documentation, forums, and Q&A websites. However, it requires time and energy to formulate one’s problem, peruse and process the
  • 85. IN CONCLUSION… If you run a study today, this may not be reproduced from scratch tomorrow unless having all data
  • 87. PRESENTATION QUALITY • I rarely reject a paper because of tha t • Not just matter of getting your paper in, but rather to let others better understanding your work
  • 88. FOLLOWING ATEMPLATE • There are recurring templates for papers belonging to different categorie s • Such templates may help the reader know where to fi nd what
  • 89. EMPIRICAL PAPER • Introductio n • Study design (include the data extraction process ) • Study result s • Threats to validit y • Related wor k • Conclusion
  • 90. EMPIRICAL STUDY DESIGN • De fi nitio n • Research Questions / Hypothese s • Context Selectio n • Data extraction methodolog y • Data analysis methodology
  • 91. TECHNOLOGICAL PAPER • Introductio n • Backgrounds (if any ) • Approac h • Empirical evaluation (may be split ) • Related Wor k • Conclusions
  • 92. NOTE • You do not have to stick to those template s • There are good reasons to avoid that
  • 93. FOR EXAMPLE YOU MAY HAVE • First study needed to understand the proble m • Approach de fi nition (based on fi rst study ) • Approach evaluation
  • 94. GAME SMELL PAPER (MSR 2020) Detecting Video Game-Specific Bad Smells in Unity Projects Antonio Borrelli University of Sannio Benevento, Italy aborrelli@unisannio.it Vittoria Nardone University of Sannio Benevento, Italy vnardone@unisannio.it Giuseppe A. Di Lucca University of Sannio Benevento, Italy dilucca@unisannio.it Gerardo Canfora University of Sannio Benevento, Italy canfora@unisannio.it Massimiliano Di Penta University of Sannio Benevento, Italy dipenta@unisannio.it ABSTRACT The growth of the video game market, the large proportion of games targeting mobile devices or streaming services, and the increasing complexity of video games trigger the availability of video game- speci�c tools to assess performance and maintainability problems. This paper proposes UnityLinter, a static analysis tool that supports Unity video game developers to detect seven types of bad smells we have identi�ed as relevant in video game development. Such smell types pertain to performance, maintainability and incorrect behavior problems. After having de�ned the smells by analyzing the existing literature and discussion forums, we have assessed their relevance with a survey involving 68 participants. Then, we have analyzed the occurrence of the studied smells in 100 open- source Unity projects, and also assessed UnityLinter’s accuracy. Results of our empirical investigation indicate that developers well- received performance- and behavior-related issues, while some maintainability issues are more controversial. UnityLinter is, in general, accurate enough in detecting smells (86%-100% precision and 50%-100% recall), and our study shows that the studied smell types occur in 39%-97% of the analyzed projects. 1 INTRODUCTION Video games represent a conspicuous and increasing share of the software development market. In 2018, the video game industry has generated 134.9 billion dollars, with over 10% increase over 2017 [25]. Such a market is changing continuously also in terms of platforms on which video games are deployed. In the past, video games mainly targeted consoles and desktop computers; nowadays mobile devices account for nearly half of the market [24], and the current trend is the streaming of video game contents. While the video game market is increasing, development skills in this area still represent a niche. Just to give an idea, Stack Over�ow features over 1.5M discussions tagged [java] and 1.2M tagged An- droid, while only 50k are about Unity3D. It is therefore clear how in this context developers may need suitable support while creating their video games, helping them to avoid introducing performance bottlenecks, or making the game di�cult to maintain and evolve. Static code analysis tools (SCAT) are a typical support developers have while coding. Such tools, known also as “linters” (from the �rst tool developed by Johnson for the C language [28]) analyze the source code or the compiled (e.g., bytecode) program to highlight
  • 95. YET… One reviewer complained that research questions weren’t all addressed in a single place
  • 96. RELEASE NOTE GENERATION (TSE 2017) ARENA: An Approach for the Automated Generation of Release Notes Laura Moreno, Member, IEEE, Gabriele Bavota, Member, IEEE, Massimiliano Di Penta, Member, IEEE, Rocco Oliveto, Member, IEEE, Andrian Marcus, Member, IEEE, and Gerardo Canfora Abstract—Release notes document corrections, enhancements, and, in general, changes that were implemented in a new release of a software project. They are usually created manually and may include hundreds of different items, such as descriptions of new features, bug fixes, structural changes, new or deprecated APIs, and changes to software licenses. Thus, producing them can be a time-consuming and daunting task. This paper describes ARENA (Automatic RElease Notes generAtor), an approach for the automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with information from versioning systems and issue trackers. ARENA was designed based on the manual analysis of 990 existing release notes. In order to evaluate the quality of the release notes automatically generated by ARENA, we performed four empirical studies involving a total of 56 participants (48 professional developers and eight students). The obtained results indicate that the generated release notes are very good approximations of the ones manually produced by developers and often include important information that is missing in the manually created release notes. Index Terms—Release notes, software documentation, software evolution Ç 1 INTRODUCTION RELEASE notes summarize the main changes that occurred in a software system since its previous release, such as, the addition of new features, bug fixes, changes to licenses this task by generating simplified release notes (e.g., the Atlas- sian OnDemand release note generator1 ), yet such notes are lim- ited to list closed issues that developers have manually 106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 43, NO. 2, FEBRUARY 2017
  • 98.
  • 100. FROM MSR 2021 CALL FOR PAPERS • Soundness of approach • Relevance to software engineering • Clarity of relation with related work • Quality of presentation • Quality of evaluation [for long papers] • Ability to replicate [for long papers] • Novelty https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers METHODOLOGICAL INFRASTRUCTURE PERSPECTIVE EMPIRICAL TECHNOLOGICAL
  • 101. FROM MSR 2021 CALL FOR PAPERS • Soundness of approach • Relevance to software engineering • Clarity of relation with related work • Quality of presentation • Quality of evaluation [for long papers] • Ability to replicate [for long papers] • Novelty https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers METHODOLOGICAL INFRASTRUCTURE PERSPECTIVE EMPIRICAL TECHNOLOGICAL TAKEAWAYS Different types of contributions to MSR, beyond studies, are highly needed Dataset size and type depends on the study goals and research method Mining process must be documented and justified in detail
  • 102. FROM MSR 2021 CALL FOR PAPERS • Soundness of approach • Relevance to software engineering • Clarity of relation with related work • Quality of presentation • Quality of evaluation [for long papers] • Ability to replicate [for long papers] • Novelty https://2021.msrconf.org/track/msr-2021-technical-papers?#Call-for-Papers METHODOLOGICAL INFRASTRUCTURE PERSPECTIVE EMPIRICAL TECHNOLOGICAL TAKEAWAYS Different types of contributions to MSR, beyond studies, are highly needed Dataset size and type depends on the study goals and research method Mining process must be documented and justified in detail dipenta@unisannio.it @mdipenta