Data has always been used in every company irrespective of its domain to improve the operational
efficiency and the products themselves. However, analyzing and extracting information from “Big Data”
is the next revolution in technology, since previously unknown nuggets of information are now made
visible. In fact, over 90% of the data available in the world has been generated in the last two years.
“Big Data” analytics has become the next hot topic for most companies - from financial institutions to
technology companies to service providers. Likewise in software engineering, data collected about the
development of software, the operation of the software in the field, and the users feedback on software
have been used before. However, collecting and analyzing this information across hundreds of thousands
or millions of software projects gives us the unique ability to reason about the ecosystem at large, and
software in general. At no time in history has there been easier access to extremely powerful
computational resources as it is today, thanks to the advances in cloud computing, both from the
technology and business perspectives. Therefore, it is easier today than ever before to analyze big data.
In this technical briefing, we will present the state-of-the-art with respect to the research carried out in the
area of big data analytics in software engineering research. We will present the research along three
dimensions:
1) What are the software engineering problems being solved? Examples of problems include: How
much source code is newly written and how much is reused from past projects? Can we
recommend best practices to developers by observing the development of software among
hundreds of thousands of software projects?
2) What are the datasets that are being used? Examples of my datasets include: all the mobile apps
in the Google Play store, all of the world's Open Source projects, and hundreds of gigabytes of
execution logs. Such large datasets provide us with a unique view into the SE field.
3) What are the tools and techniques available to analyze the large datasets? We intend to present
generic software solutions that have been applied to big datasets in other areas of research, and
the tools and techniques created by software engineering researchers.
In the end we will present the challenges inherently present in large datasets - volume, variety, velocity,
and veracity. Such challenges often complicate the analysis of the data and can invalidate the
interpretation of the results. We will conclude with the future opportunities that are present in big data
analytics for software engineering research.
Decoding Loan Approval: Predictive Modeling in Action
Big(ger) Data in Software Engineering
1. Big(ger) Data
Open Source
Software
Github
Apache
Sourceforge
App Store Data
App Store
Google
App
Windows
store
Execution
Logs
Amazon
Microsoft
Stackoverflow
TopCoder
Big(ger) Data in Software Engineering
Meiyappan Nagappan, Mehdi Mirakhorli
Rochester Institute of Technology
2. Nagappan & Mirakhorli ICSE 2015
Meiyappan Nagappan
Dept of Software Engineering
Rochester Institute of Technology
mei@se.rit.edu
http://mei-nagappan.com
Mehdi Mirakhorli
Dept of Software Engineering
Rochester Institute of Technology
mehdi@se.rit.edu
http://www.se.rit.edu/~mehdi/
Speakers
5. Sam Malek
George Mason
University
Rick Kazman
SEI
Yuanfang Cai
University of Drexel
Patrick Maeder
University of Illmenue
Bob Hanmer
Alcatel Lucent
Muhammad Ali Babar
University of Adelaide
Robert L. Nord
SEI
Jane Cleland-Huang
DePaul University
Nagappan & Mirakhorli ICSE 2015
6. Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Agenda
Nagappan & Mirakhorli ICSE 2015
8. Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
25. World of Code
What Area
do SE
Studies
Cover?
Nagappan & Mirakhorli ICSE 2015
26. Patterns:
Data brings
knowledge
Can you find new
patterns?
Why is Big(ger) Data useful?
Generalizability:
A limited set of
projects examined.
Results are valid in
the context
Nagappan & Mirakhorli ICSE 2015
27. Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
28. 100’s of GBs of Execution Logs per Day
App Store Data
100+ 300K+6.9M+
All the Open Source Projects in
the World
Crowd sourced Data
Datasets
Nagappan & Mirakhorli ICSE 2015
29. Sourcerer provides a collection of tools for automated
crawling, parsing and fingerprinting of open source applications.
Sourcerer
Repositories: Apache, Java.net, Google
Code and Sourceforge.
Collected Info:
– Versioned source code across multiple
releases
– documentation(if available)
– Projects’ metadata
– a coarse-grained structural analysis of
each project.
Size: Over 20,000
open source systems.
Download:
http://www.ics.uci.ed
u/~lopes/datasets/
lopes@ics.uci.edu
Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale
collection and analysis of open-source code. Sci. Comput. Program. 79 (January 2014), 241-259.
Usage data of Koders.com
sourcerer-maven-aug12 containing 2,232 projects from the
Maven Central repository (~80GB).
Nagappan & Mirakhorli ICSE 2015
30. Domain-specific language and infrastructure for software
repository mining.
Boa
• Boa project has collected
source code of 23K java
projects (only subversion)
• Meta-data of 600K
projects.
• Offers a domain specific
language to query the data,
it is primarily useful for
replicating the existing
research where the
concepts are known and
well understood
Nagappan & Mirakhorli ICSE 2015
31. Ghtorrent
Github - http://ghtorrent.org/
Create a scalable, queriable, offline mirror of data
offered through the Github REST API.
Every two months, the project releases the collected data.
Nagappan & Mirakhorli ICSE 2015
34. 2014 2015 2016
TeraPromise
Oursummary. Andotherrelatedbooks
The MSR
community
and others
Perspective on
Data Science
for Software
Engineering
Tim Menzies
Laurie Williams
Thomas
Zimmermann
Nagappan & Mirakhorli ICSE 2015
37. Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
40. Hadoop’s Major Subsystems
• HDFS is designed for
large, streaming reads
of files.
• Files in HDFS are write
once.
Nagappan & Mirakhorli ICSE 2015
41. 1. Read: Sequentially read a lot of data
2. Map: Extract something you care about
3. Group by key: Sort and Shuffle
Map-Reduce Example
Depending on
the problem,
you only define
map and
reduce
functions.
4. Reduce: Aggregate,
summarize, filter or
transform
5. Write the result
Nagappan & Mirakhorli ICSE 2015
42. Data-Mining Libraries
A framework for building scalable
algorithms, many new Scala +
Spark (H2O in progress)
algorithms, and Mahout's mature
Hadoop MapReduce algorithms.
Dimensionality Reduction
Singular Value Decomposition
Lanczos Algorithm
Stochastic SVD
PCA
Nagappan & Mirakhorli ICSE 2015
43. Data-Mining Libraries
Parallel Computing Toolbox™ supports
solving computationally and data-intensive
problems using multicore processors, GPUs,
and computer clusters.
http://it.mathworks.com/pr
oducts/parallel-computing/
Mr.LDA is an open-source package for
flexible, scalable, multilingual topic modeling
using variational inference in MapReduce.
http://arxiv.org/pdf/1502.07989
v1.pdf
Collection of Different Statistical Methods
and Computing for Big Data
Mr.LDA
https://github.com/lintool/Mr.
LDA
Rhadoop
https://github.com/Revolution
Analytics/RHadoop/
A collection of R packages that allow users to
manage and analyze data with Hadoop.
Nagappan & Mirakhorli ICSE 2015
45. Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
46. SE problems using Big Data
(to name a few)
Nagappan & Mirakhorli ICSE 2015
47. Big Data Analytics Applications
Assisting Developers of Big Data Analytics Applications
When Deploying on Hadoop Clouds
Code Evolution Analysis
Clone Detection
Log Analysis
Nagappan & Mirakhorli ICSE 2015
48. Mobile Apps
API Change and
Fault Proneness:
A Threat to
Success of
Android Apps
An Examination
of the Current
Rating System
used in Mobile
App Stores
On the
Relationship
between the
Number of Ad
Libraries in an
Android App and
its Rating
Nagappan & Mirakhorli ICSE 2015
49. Programming Languages
A Large-Scale
Empirical Study of
the Relationship
Between Build
Technology and
Build Maintenance
A large scale
study of
programming
languages and
code quality in
github
An empirical
study of goto in C
code
Nagappan & Mirakhorli ICSE 2015
50. Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions
Nagappan & Mirakhorli ICSE 2015
51. Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions
Nagappan & Mirakhorli ICSE 2015
52. Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions
Nagappan & Mirakhorli ICSE 2015
53. Big(ger) Data Analysis in Software
Architecture Domain
Variability Points and Design Pattern Usage in
Architectural Tactics
Learn from millions of
open source developers.
How to implement high
level design decision
(fault detection) using low
level implementation
techniques (design
patterns)?
Nagappan & Mirakhorli ICSE 2015
55. Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
56. Our Research Manifesto
Developer Maintainer
Operator Manager
Assist various Stakeholders to
build better SoftwareNagappan & Mirakhorli ICSE 2015
Editor's Notes
I would also like to acknowledge some of my industrial and academic collaborators from different parts of the world.
I am also grateful to have worked with other upcoming academics from around the world, like Thorsten, Yasu, and Romain. Now that I have acknowledged a small percentage of the people who I have been able to work with, I will dive into my research.
I have focused on using Big Data to deliver on my research goals. However the term Big Data is absolute, and as a researcher absolutes do not sit well with me.
I prefer the term Bigger Data. Because the truth is that the size of the data is very relative. What may be big data for software engineers is very small data for climate scientists. I will therefore give some examples of the data that I for context. These are datasets that are several orders of magnitude bigger than typical SE datasets, which in the past has looked at maybe a handful of case study subjects.
I prefer the term Bigger Data. Because the truth is that the size of the data is very relative. What may be big data for software engineers is very small data for climate scientists. I will therefore give some examples of the data that I for context. These are datasets that are several orders of magnitude bigger than typical SE datasets, which in the past has looked at maybe a handful of case study subjects.
Another example of bigger data in SE is considering all hundreds of thousands of apps in the google play market.
So why study bigger data in SE now? There are two reasons.
(1) We have access to various pieces of data on millions of software projects. We have development data, bug data, user review data, and software execution data.
And (2) We also have the computing power necessary to analyze these terabytes of data – from resource providers like amazon.
But, Big DATA => Big CHALLENGES. The research community on Big Data has identified 4 V’s, …
namely Volume or just the size of the dataset
and velocity or the rate at which data is generated. These two issues presents challenges with respect to what kind of analysis can be applied on the dataset. We need algorithms that are not just quick and efficient, but also scale well.
Then there is variety in the data. One example are mobile apps in the Google play store, there are apps for banks that are built by software companies and game apps that are built by one developer in their spare time. Each is equally popular, but the development practices and purpose of each are very different.
And finally the veracity of the data, or how we filter noise. For example when we look at the development practices in open source repositories like github, we have to filter out the student repositories that were created for assignments. The last 2, namely variety and veracity affect the conclusions that we arrive at. We may arrive at conclusions that may not be valid for regular software development if the noise remains in the data.
So why study bigger data in SE now? There are two reasons.
So given the world of source code that we know about, that we have data about as researchers,
We then measure the diversity of the case study subjects used in the research papers in two of the top SE research venues, ICSE and FSE, against the diversity of the ohloh dataset.
When all the 7 attributes are taken together, we find that SE research has very low diversity. Even the exemplary ones do not have very high diversity among its case study systems.
We wanted to ask what percentage of this world of code is covered by current SE studies. Are a majority of studies just focused on a small area of the WoC?
For example one such dataset that I examined are the millions of log lines generated every day by data centers and cloud platforms that are stored in execution log files. These files are typically 100’s of GBs big. In the interest of time, I will not be presenting my research on log files today. But if you are interested, please do let me know and we can talk about it later.
Dimensionality Reduction
Singular Value Decomposition
Lanczos Algorithm
Stochastic SVD
PCA
Assisting Developers of Big Data Analytics
Applications When Deploying on Hadoop Clouds
Assisting Developers of Big Data Analytics
Applications When Deploying on Hadoop Clouds