Big(ger) Data in Software Engineering

Big(ger) Data
Open Source
Software
Github
Apache
Sourceforge
App Store Data
App Store
Google
App
Windows
store
Execution
Logs
Amazon
Microsoft
Stackoverflow
TopCoder
Big(ger) Data in Software Engineering
Meiyappan Nagappan, Mehdi Mirakhorli
Rochester Institute of Technology

Nagappan & Mirakhorli ICSE 2015
Meiyappan Nagappan
Dept of Software Engineering
mei@se.rit.edu
http://mei-nagappan.com
Mehdi Mirakhorli
Dept of Software Engineering
mehdi@se.rit.edu
http://www.se.rit.edu/~mehdi/
Speakers

Our Research Collaborations

Sam Malek
George Mason
University
Rick Kazman
SEI
Yuanfang Cai
University of Drexel
Patrick Maeder
University of Illmenue
Bob Hanmer
Alcatel Lucent
Muhammad Ali Babar
University of Adelaide
Robert L. Nord
SEI
Jane Cleland-Huang
DePaul University

Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Agenda

What are you passionate about?

Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering

BIG DATA in SE

BIG(ger) DATA in SE

BIG(ger) DATA in SE
SE Datasets that are several orders of
magnitude BIGGERNagappan & Mirakhorli ICSE 2015

All Android Apps in Google Play

So Why BIG(ger) Data in SE Now?

Access to Data
Why BIG(ger) Data in SE Now?

Computing PowerAccess to Data
Why BIG(ger) Data in SE Now?

But, Big DATA => Big CHALLENGES

Volume

Volume
Velocity

Volume
Velocity
Variety

Volume
Veracity/ NoiseVelocity
Variety

But why should SE Research
adopt BIG(ger) Data?

World of Code

FSE13: Extracted Case Study Subject
Systems from All Research Papers

TLOC
Language Type
# Dev
Churn
#
Commits
AgeActivity
Diversity in ICSE/FSE

World of Code
What Area
do SE
Studies
Cover?

Patterns:
 Data brings
knowledge
 Can you find new
patterns?
Why is Big(ger) Data useful?
Generalizability:
A limited set of
projects examined.
Results are valid in
the context

100’s of GBs of Execution Logs per Day
App Store Data
100+ 300K+6.9M+
All the Open Source Projects in
the World
Crowd sourced Data
Datasets

Sourcerer provides a collection of tools for automated
crawling, parsing and fingerprinting of open source applications.
Sourcerer
Repositories: Apache, Java.net, Google
Code and Sourceforge.
Collected Info:
– Versioned source code across multiple
releases
– documentation(if available)
– Projects’ metadata
– a coarse-grained structural analysis of
each project.
Size: Over 20,000
open source systems.
Download:
http://www.ics.uci.ed
u/~lopes/datasets/
lopes@ics.uci.edu
Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale
collection and analysis of open-source code. Sci. Comput. Program. 79 (January 2014), 241-259.
Usage data of Koders.com
sourcerer-maven-aug12 containing 2,232 projects from the
Maven Central repository (~80GB).

Domain-specific language and infrastructure for software
repository mining.
Boa
• Boa project has collected
source code of 23K java
projects (only subversion)
• Meta-data of 600K
projects.
• Offers a domain specific
language to query the data,
it is primarily useful for
replicating the existing
research where the
concepts are known and
well understood

Ghtorrent
Github - http://ghtorrent.org/
Create a scalable, queriable, offline mirror of data
offered through the Github REST API.
Every two months, the project releases the collected data.

Apache
Apache - http://svn-dump.apache.org/
Download 250 Apache Projects, in 24
categories (domain)

TeraPromise
https://terapromise.csc.ncsu.edu/
http://openscience.us/repo/ Nagappan & Mirakhorli ICSE 2015

2014 2015 2016
TeraPromise
Oursummary. Andotherrelatedbooks
The MSR
community
and others
Perspective on
Data Science
for Software
Engineering
Tim Menzies
Laurie Williams
Thomas
Zimmermann

StackOverflow
http://blog.stackoverflow.com/2014/01/stack-exchange-cc-
data-now-hosted-by-the-internet-archive/
http://2013.msrconf.org/challenge.php#challenge_data
anonymized dump of all user-contributed Stack
Exchange content since 2009.

Google Play

Hadoop

Who uses Hadoop?
40
Amazon
Facebook
Google
IBM
Intel Research
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!

Hadoop’s Major Subsystems
• HDFS is designed for
large, streaming reads
of files.
• Files in HDFS are write
once.

1. Read: Sequentially read a lot of data
2. Map: Extract something you care about
3. Group by key: Sort and Shuffle
Map-Reduce Example
Depending on
the problem,
you only define
map and
reduce
functions.
4. Reduce: Aggregate,
summarize, filter or
transform
5. Write the result

Data-Mining Libraries
A framework for building scalable
algorithms, many new Scala +
Spark (H2O in progress)
algorithms, and Mahout's mature
Hadoop MapReduce algorithms.
Dimensionality Reduction
Singular Value Decomposition
Lanczos Algorithm
Stochastic SVD
PCA

Parallel Computing Toolbox™ supports
solving computationally and data-intensive
problems using multicore processors, GPUs,
and computer clusters.
http://it.mathworks.com/pr
oducts/parallel-computing/
Mr.LDA is an open-source package for
flexible, scalable, multilingual topic modeling
using variational inference in MapReduce.
http://arxiv.org/pdf/1502.07989
v1.pdf
Collection of Different Statistical Methods
and Computing for Big Data
Mr.LDA
https://github.com/lintool/Mr.
LDA
Rhadoop
https://github.com/Revolution
Analytics/RHadoop/
A collection of R packages that allow users to
manage and analyze data with Hadoop.

Is Hadoop THE SOLUTION?

SE problems using Big Data
(to name a few)

Big Data Analytics Applications
Assisting Developers of Big Data Analytics Applications
When Deploying on Hadoop Clouds
Code Evolution Analysis
Clone Detection
Log Analysis

Mobile Apps
API Change and
Fault Proneness:
A Threat to
Success of
Android Apps
An Examination
of the Current
Rating System
used in Mobile
App Stores
On the
Relationship
between the
Number of Ad
Libraries in an
Android App and
its Rating

Programming Languages
A Large-Scale
Empirical Study of
the Relationship
Between Build
Technology and
Build Maintenance
A large scale
study of
programming
languages and
code quality in
github
An empirical
study of goto in C
code

Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions

Big(ger) Data Analysis in Software
Architecture Domain
Variability Points and Design Pattern Usage in
Architectural Tactics
Learn from millions of
open source developers.
How to implement high
level design decision
(fault detection) using low
level implementation
techniques (design
patterns)?

Big(ger) Data Analysis and Rapid
Development

Our Research Manifesto
Developer Maintainer
Operator Manager
Assist various Stakeholders to
build better SoftwareNagappan & Mirakhorli ICSE 2015

Big(ger) Data in Software Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big(ger) Data in Software Engineering

Similar to Big(ger) Data in Software Engineering (20)

Recently uploaded

Recently uploaded (20)

Big(ger) Data in Software Engineering

Editor's Notes