SlideShare a Scribd company logo
1 of 56
Big(ger) Data
Open Source
Software
Github
Apache
Sourceforge
App Store Data
App Store
Google
App
Windows
store
Execution
Logs
Amazon
Microsoft
Stackoverflow
TopCoder
Big(ger) Data in Software Engineering
Meiyappan Nagappan, Mehdi Mirakhorli
Rochester Institute of Technology
Nagappan & Mirakhorli ICSE 2015
Meiyappan Nagappan
Dept of Software Engineering
Rochester Institute of Technology
mei@se.rit.edu
http://mei-nagappan.com
Mehdi Mirakhorli
Dept of Software Engineering
Rochester Institute of Technology
mehdi@se.rit.edu
http://www.se.rit.edu/~mehdi/
Speakers
Our Research Collaborations
Nagappan & Mirakhorli ICSE 2015
Nagappan & Mirakhorli ICSE 2015
Sam Malek
George Mason
University
Rick Kazman
SEI
Yuanfang Cai
University of Drexel
Patrick Maeder
University of Illmenue
Bob Hanmer
Alcatel Lucent
Muhammad Ali Babar
University of Adelaide
Robert L. Nord
SEI
Jane Cleland-Huang
DePaul University
Nagappan & Mirakhorli ICSE 2015
Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Agenda
Nagappan & Mirakhorli ICSE 2015
What are you passionate about?
Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
BIG DATA in SE
Nagappan & Mirakhorli ICSE 2015
BIG(ger) DATA in SE
Nagappan & Mirakhorli ICSE 2015
BIG(ger) DATA in SE
SE Datasets that are several orders of
magnitude BIGGERNagappan & Mirakhorli ICSE 2015
All Android Apps in Google Play
Nagappan & Mirakhorli ICSE 2015
So Why BIG(ger) Data in SE Now?
Nagappan & Mirakhorli ICSE 2015
Access to Data
Why BIG(ger) Data in SE Now?
Nagappan & Mirakhorli ICSE 2015
Computing PowerAccess to Data
Why BIG(ger) Data in SE Now?
Nagappan & Mirakhorli ICSE 2015
But, Big DATA => Big CHALLENGES
Nagappan & Mirakhorli ICSE 2015
But, Big DATA => Big CHALLENGES
Volume
Nagappan & Mirakhorli ICSE 2015
But, Big DATA => Big CHALLENGES
Volume
Velocity
Nagappan & Mirakhorli ICSE 2015
But, Big DATA => Big CHALLENGES
Volume
Velocity
Variety
Nagappan & Mirakhorli ICSE 2015
But, Big DATA => Big CHALLENGES
Volume
Veracity/ NoiseVelocity
Variety
Nagappan & Mirakhorli ICSE 2015
But why should SE Research
adopt BIG(ger) Data?
Nagappan & Mirakhorli ICSE 2015
World of Code
Nagappan & Mirakhorli ICSE 2015
FSE13: Extracted Case Study Subject
Systems from All Research Papers
Nagappan & Mirakhorli ICSE 2015
TLOC
Language Type
# Dev
Churn
#
Commits
AgeActivity
Diversity in ICSE/FSE
Nagappan & Mirakhorli ICSE 2015
World of Code
What Area
do SE
Studies
Cover?
Nagappan & Mirakhorli ICSE 2015
Patterns:
 Data brings
knowledge
 Can you find new
patterns?
Why is Big(ger) Data useful?
Generalizability:
A limited set of
projects examined.
Results are valid in
the context
Nagappan & Mirakhorli ICSE 2015
Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
100’s of GBs of Execution Logs per Day
App Store Data
100+ 300K+6.9M+
All the Open Source Projects in
the World
Crowd sourced Data
Datasets
Nagappan & Mirakhorli ICSE 2015
Sourcerer provides a collection of tools for automated
crawling, parsing and fingerprinting of open source applications.
Sourcerer
Repositories: Apache, Java.net, Google
Code and Sourceforge.
Collected Info:
– Versioned source code across multiple
releases
– documentation(if available)
– Projects’ metadata
– a coarse-grained structural analysis of
each project.
Size: Over 20,000
open source systems.
Download:
http://www.ics.uci.ed
u/~lopes/datasets/
lopes@ics.uci.edu
Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale
collection and analysis of open-source code. Sci. Comput. Program. 79 (January 2014), 241-259.
Usage data of Koders.com
sourcerer-maven-aug12 containing 2,232 projects from the
Maven Central repository (~80GB).
Nagappan & Mirakhorli ICSE 2015
Domain-specific language and infrastructure for software
repository mining.
Boa
• Boa project has collected
source code of 23K java
projects (only subversion)
• Meta-data of 600K
projects.
• Offers a domain specific
language to query the data,
it is primarily useful for
replicating the existing
research where the
concepts are known and
well understood
Nagappan & Mirakhorli ICSE 2015
Ghtorrent
Github - http://ghtorrent.org/
Create a scalable, queriable, offline mirror of data
offered through the Github REST API.
Every two months, the project releases the collected data.
Nagappan & Mirakhorli ICSE 2015
Apache
Apache - http://svn-dump.apache.org/
Download 250 Apache Projects, in 24
categories (domain)
Nagappan & Mirakhorli ICSE 2015
TeraPromise
https://terapromise.csc.ncsu.edu/
http://openscience.us/repo/ Nagappan & Mirakhorli ICSE 2015
2014 2015 2016
TeraPromise
Oursummary. Andotherrelatedbooks
The MSR
community
and others
Perspective on
Data Science
for Software
Engineering
Tim Menzies
Laurie Williams
Thomas
Zimmermann
Nagappan & Mirakhorli ICSE 2015
StackOverflow
http://blog.stackoverflow.com/2014/01/stack-exchange-cc-
data-now-hosted-by-the-internet-archive/
http://2013.msrconf.org/challenge.php#challenge_data
anonymized dump of all user-contributed Stack
Exchange content since 2009.
Nagappan & Mirakhorli ICSE 2015
Google Play
Nagappan & Mirakhorli ICSE 2015
Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
Hadoop
Nagappan & Mirakhorli ICSE 2015
Who uses Hadoop?
40
Amazon
Facebook
Google
IBM
Intel Research
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Nagappan & Mirakhorli ICSE 2015
Hadoop’s Major Subsystems
• HDFS is designed for
large, streaming reads
of files.
• Files in HDFS are write
once.
Nagappan & Mirakhorli ICSE 2015
1. Read: Sequentially read a lot of data
2. Map: Extract something you care about
3. Group by key: Sort and Shuffle
Map-Reduce Example
Depending on
the problem,
you only define
map and
reduce
functions.
4. Reduce: Aggregate,
summarize, filter or
transform
5. Write the result
Nagappan & Mirakhorli ICSE 2015
Data-Mining Libraries
A framework for building scalable
algorithms, many new Scala +
Spark (H2O in progress)
algorithms, and Mahout's mature
Hadoop MapReduce algorithms.
Dimensionality Reduction
Singular Value Decomposition
Lanczos Algorithm
Stochastic SVD
PCA
Nagappan & Mirakhorli ICSE 2015
Data-Mining Libraries
Parallel Computing Toolbox™ supports
solving computationally and data-intensive
problems using multicore processors, GPUs,
and computer clusters.
http://it.mathworks.com/pr
oducts/parallel-computing/
Mr.LDA is an open-source package for
flexible, scalable, multilingual topic modeling
using variational inference in MapReduce.
http://arxiv.org/pdf/1502.07989
v1.pdf
Collection of Different Statistical Methods
and Computing for Big Data
Mr.LDA
https://github.com/lintool/Mr.
LDA
Rhadoop
https://github.com/Revolution
Analytics/RHadoop/
A collection of R packages that allow users to
manage and analyze data with Hadoop.
Nagappan & Mirakhorli ICSE 2015
Data-Mining Libraries
Is Hadoop THE SOLUTION?
Nagappan & Mirakhorli ICSE 2015
Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
SE problems using Big Data
(to name a few)
Nagappan & Mirakhorli ICSE 2015
Big Data Analytics Applications
Assisting Developers of Big Data Analytics Applications
When Deploying on Hadoop Clouds
Code Evolution Analysis
Clone Detection
Log Analysis
Nagappan & Mirakhorli ICSE 2015
Mobile Apps
API Change and
Fault Proneness:
A Threat to
Success of
Android Apps
An Examination
of the Current
Rating System
used in Mobile
App Stores
On the
Relationship
between the
Number of Ad
Libraries in an
Android App and
its Rating
Nagappan & Mirakhorli ICSE 2015
Programming Languages
A Large-Scale
Empirical Study of
the Relationship
Between Build
Technology and
Build Maintenance
A large scale
study of
programming
languages and
code quality in
github
An empirical
study of goto in C
code
Nagappan & Mirakhorli ICSE 2015
Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions
Nagappan & Mirakhorli ICSE 2015
Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions
Nagappan & Mirakhorli ICSE 2015
Big(ger) Data Analysis in
Requirement Engineering Domain
On-demand Feature Recommendations Derived from
Mining Public Product Descriptions
Nagappan & Mirakhorli ICSE 2015
Big(ger) Data Analysis in Software
Architecture Domain
Variability Points and Design Pattern Usage in
Architectural Tactics
Learn from millions of
open source developers.
How to implement high
level design decision
(fault detection) using low
level implementation
techniques (design
patterns)?
Nagappan & Mirakhorli ICSE 2015
Big(ger) Data Analysis and Rapid
Development
Nagappan & Mirakhorli ICSE 2015
Why Big(ger) Data
in software
engineering
• Introduction
• Defining concepts
• One Minute
Madness activity
State of art in
empirical SE and
large datasets
• Summary
• Advances
• Challenges
Public datasets
• Repositories
• Properties
• Accessibility
Tools and
techniques to
analyze large
datasets
• Infrastructure
• Languages
• Techniques
• Example
Agenda for the Technical Briefing of Big(ger) Data in Software Engineering
Nagappan & Mirakhorli ICSE 2015
Our Research Manifesto
Developer Maintainer
Operator Manager
Assist various Stakeholders to
build better SoftwareNagappan & Mirakhorli ICSE 2015

More Related Content

What's hot

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Towards Effective Bug Triage with Software Data Reduction Techniques
Towards Effective Bug Triage with Software Data Reduction TechniquesTowards Effective Bug Triage with Software Data Reduction Techniques
Towards Effective Bug Triage with Software Data Reduction Techniques1crore projects
 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataJeongwhan Choi
 
Planning and Executing Practice-Impactful Research
Planning and Executing Practice-Impactful ResearchPlanning and Executing Practice-Impactful Research
Planning and Executing Practice-Impactful ResearchTao Xie
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsRafael Ferreira da Silva
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsIRJET Journal
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
 
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationkrws
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsAnubhav Jain
 
KunGao_Resume.
KunGao_Resume.KunGao_Resume.
KunGao_Resume.Gao Kun
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...Alex Pinto
 
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
Beyond Matching: Applying Data Science Techniques to IOC-based DetectionBeyond Matching: Applying Data Science Techniques to IOC-based Detection
Beyond Matching: Applying Data Science Techniques to IOC-based DetectionAlex Pinto
 
Bug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram IdfBug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram IdfHideaki Hata
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
 

What's hot (20)

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Towards Effective Bug Triage with Software Data Reduction Techniques
Towards Effective Bug Triage with Software Data Reduction TechniquesTowards Effective Bug Triage with Software Data Reduction Techniques
Towards Effective Bug Triage with Software Data Reduction Techniques
 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software Data
 
Planning and Executing Practice-Impactful Research
Planning and Executing Practice-Impactful ResearchPlanning and Executing Practice-Impactful Research
Planning and Executing Practice-Impactful Research
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific Workflows
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
 
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localization
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
 
KunGao_Resume.
KunGao_Resume.KunGao_Resume.
KunGao_Resume.
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
 
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
Beyond Matching: Applying Data Science Techniques to IOC-based DetectionBeyond Matching: Applying Data Science Techniques to IOC-based Detection
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
 
Bug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram IdfBug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram Idf
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
On impact in Software Engineering Research (ICSE 2018 New Faculty Symposium)
On impact in Software Engineering Research (ICSE 2018 New Faculty Symposium)On impact in Software Engineering Research (ICSE 2018 New Faculty Symposium)
On impact in Software Engineering Research (ICSE 2018 New Faculty Symposium)
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?
 

Similar to Big(ger) Data in Software Engineering

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavSwapnil (Neil) Jadhav
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source frameworkedunextgen
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework edunextgen
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkJerry Wen
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Containers at Netflx - An Evolving Story QConSF2015
Containers at Netflx - An Evolving Story QConSF2015Containers at Netflx - An Evolving Story QConSF2015
Containers at Netflx - An Evolving Story QConSF2015Sangeeta Narayanan
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQLEDB
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingDavid Lauzon
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Ronak Agrawal 2018 Computer Science
Ronak Agrawal 2018 Computer Science Ronak Agrawal 2018 Computer Science
Ronak Agrawal 2018 Computer Science Ronak Agrawal
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introductionakira-ai
 

Similar to Big(ger) Data in Software Engineering (20)

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source framework
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Containers at Netflx - An Evolving Story QConSF2015
Containers at Netflx - An Evolving Story QConSF2015Containers at Netflx - An Evolving Story QConSF2015
Containers at Netflx - An Evolving Story QConSF2015
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Resume
ResumeResume
Resume
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Ronak Agrawal 2018 Computer Science
Ronak Agrawal 2018 Computer Science Ronak Agrawal 2018 Computer Science
Ronak Agrawal 2018 Computer Science
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 

Big(ger) Data in Software Engineering

  • 1. Big(ger) Data Open Source Software Github Apache Sourceforge App Store Data App Store Google App Windows store Execution Logs Amazon Microsoft Stackoverflow TopCoder Big(ger) Data in Software Engineering Meiyappan Nagappan, Mehdi Mirakhorli Rochester Institute of Technology
  • 2. Nagappan & Mirakhorli ICSE 2015 Meiyappan Nagappan Dept of Software Engineering Rochester Institute of Technology mei@se.rit.edu http://mei-nagappan.com Mehdi Mirakhorli Dept of Software Engineering Rochester Institute of Technology mehdi@se.rit.edu http://www.se.rit.edu/~mehdi/ Speakers
  • 3. Our Research Collaborations Nagappan & Mirakhorli ICSE 2015
  • 5. Sam Malek George Mason University Rick Kazman SEI Yuanfang Cai University of Drexel Patrick Maeder University of Illmenue Bob Hanmer Alcatel Lucent Muhammad Ali Babar University of Adelaide Robert L. Nord SEI Jane Cleland-Huang DePaul University Nagappan & Mirakhorli ICSE 2015
  • 6. Why Big(ger) Data in software engineering • Introduction • Defining concepts • One Minute Madness activity State of art in empirical SE and large datasets • Summary • Advances • Challenges Public datasets • Repositories • Properties • Accessibility Tools and techniques to analyze large datasets • Infrastructure • Languages • Techniques • Example Agenda for the Technical Briefing of Big(ger) Data in Software Engineering Agenda Nagappan & Mirakhorli ICSE 2015
  • 7. What are you passionate about?
  • 8. Why Big(ger) Data in software engineering • Introduction • Defining concepts • One Minute Madness activity State of art in empirical SE and large datasets • Summary • Advances • Challenges Public datasets • Repositories • Properties • Accessibility Tools and techniques to analyze large datasets • Infrastructure • Languages • Techniques • Example Agenda for the Technical Briefing of Big(ger) Data in Software Engineering Nagappan & Mirakhorli ICSE 2015
  • 9. BIG DATA in SE Nagappan & Mirakhorli ICSE 2015
  • 10. BIG(ger) DATA in SE Nagappan & Mirakhorli ICSE 2015
  • 11. BIG(ger) DATA in SE SE Datasets that are several orders of magnitude BIGGERNagappan & Mirakhorli ICSE 2015
  • 12. All Android Apps in Google Play Nagappan & Mirakhorli ICSE 2015
  • 13. So Why BIG(ger) Data in SE Now? Nagappan & Mirakhorli ICSE 2015
  • 14. Access to Data Why BIG(ger) Data in SE Now? Nagappan & Mirakhorli ICSE 2015
  • 15. Computing PowerAccess to Data Why BIG(ger) Data in SE Now? Nagappan & Mirakhorli ICSE 2015
  • 16. But, Big DATA => Big CHALLENGES Nagappan & Mirakhorli ICSE 2015
  • 17. But, Big DATA => Big CHALLENGES Volume Nagappan & Mirakhorli ICSE 2015
  • 18. But, Big DATA => Big CHALLENGES Volume Velocity Nagappan & Mirakhorli ICSE 2015
  • 19. But, Big DATA => Big CHALLENGES Volume Velocity Variety Nagappan & Mirakhorli ICSE 2015
  • 20. But, Big DATA => Big CHALLENGES Volume Veracity/ NoiseVelocity Variety Nagappan & Mirakhorli ICSE 2015
  • 21. But why should SE Research adopt BIG(ger) Data? Nagappan & Mirakhorli ICSE 2015
  • 22. World of Code Nagappan & Mirakhorli ICSE 2015
  • 23. FSE13: Extracted Case Study Subject Systems from All Research Papers Nagappan & Mirakhorli ICSE 2015
  • 24. TLOC Language Type # Dev Churn # Commits AgeActivity Diversity in ICSE/FSE Nagappan & Mirakhorli ICSE 2015
  • 25. World of Code What Area do SE Studies Cover? Nagappan & Mirakhorli ICSE 2015
  • 26. Patterns:  Data brings knowledge  Can you find new patterns? Why is Big(ger) Data useful? Generalizability: A limited set of projects examined. Results are valid in the context Nagappan & Mirakhorli ICSE 2015
  • 27. Why Big(ger) Data in software engineering • Introduction • Defining concepts • One Minute Madness activity State of art in empirical SE and large datasets • Summary • Advances • Challenges Public datasets • Repositories • Properties • Accessibility Tools and techniques to analyze large datasets • Infrastructure • Languages • Techniques • Example Agenda for the Technical Briefing of Big(ger) Data in Software Engineering Nagappan & Mirakhorli ICSE 2015
  • 28. 100’s of GBs of Execution Logs per Day App Store Data 100+ 300K+6.9M+ All the Open Source Projects in the World Crowd sourced Data Datasets Nagappan & Mirakhorli ICSE 2015
  • 29. Sourcerer provides a collection of tools for automated crawling, parsing and fingerprinting of open source applications. Sourcerer Repositories: Apache, Java.net, Google Code and Sourceforge. Collected Info: – Versioned source code across multiple releases – documentation(if available) – Projects’ metadata – a coarse-grained structural analysis of each project. Size: Over 20,000 open source systems. Download: http://www.ics.uci.ed u/~lopes/datasets/ lopes@ics.uci.edu Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale collection and analysis of open-source code. Sci. Comput. Program. 79 (January 2014), 241-259. Usage data of Koders.com sourcerer-maven-aug12 containing 2,232 projects from the Maven Central repository (~80GB). Nagappan & Mirakhorli ICSE 2015
  • 30. Domain-specific language and infrastructure for software repository mining. Boa • Boa project has collected source code of 23K java projects (only subversion) • Meta-data of 600K projects. • Offers a domain specific language to query the data, it is primarily useful for replicating the existing research where the concepts are known and well understood Nagappan & Mirakhorli ICSE 2015
  • 31. Ghtorrent Github - http://ghtorrent.org/ Create a scalable, queriable, offline mirror of data offered through the Github REST API. Every two months, the project releases the collected data. Nagappan & Mirakhorli ICSE 2015
  • 32. Apache Apache - http://svn-dump.apache.org/ Download 250 Apache Projects, in 24 categories (domain) Nagappan & Mirakhorli ICSE 2015
  • 34. 2014 2015 2016 TeraPromise Oursummary. Andotherrelatedbooks The MSR community and others Perspective on Data Science for Software Engineering Tim Menzies Laurie Williams Thomas Zimmermann Nagappan & Mirakhorli ICSE 2015
  • 36. Google Play Nagappan & Mirakhorli ICSE 2015
  • 37. Why Big(ger) Data in software engineering • Introduction • Defining concepts • One Minute Madness activity State of art in empirical SE and large datasets • Summary • Advances • Challenges Public datasets • Repositories • Properties • Accessibility Tools and techniques to analyze large datasets • Infrastructure • Languages • Techniques • Example Agenda for the Technical Briefing of Big(ger) Data in Software Engineering Nagappan & Mirakhorli ICSE 2015
  • 39. Who uses Hadoop? 40 Amazon Facebook Google IBM Intel Research Joost Last.fm New York Times PowerSet Veoh Yahoo! Nagappan & Mirakhorli ICSE 2015
  • 40. Hadoop’s Major Subsystems • HDFS is designed for large, streaming reads of files. • Files in HDFS are write once. Nagappan & Mirakhorli ICSE 2015
  • 41. 1. Read: Sequentially read a lot of data 2. Map: Extract something you care about 3. Group by key: Sort and Shuffle Map-Reduce Example Depending on the problem, you only define map and reduce functions. 4. Reduce: Aggregate, summarize, filter or transform 5. Write the result Nagappan & Mirakhorli ICSE 2015
  • 42. Data-Mining Libraries A framework for building scalable algorithms, many new Scala + Spark (H2O in progress) algorithms, and Mahout's mature Hadoop MapReduce algorithms. Dimensionality Reduction Singular Value Decomposition Lanczos Algorithm Stochastic SVD PCA Nagappan & Mirakhorli ICSE 2015
  • 43. Data-Mining Libraries Parallel Computing Toolbox™ supports solving computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. http://it.mathworks.com/pr oducts/parallel-computing/ Mr.LDA is an open-source package for flexible, scalable, multilingual topic modeling using variational inference in MapReduce. http://arxiv.org/pdf/1502.07989 v1.pdf Collection of Different Statistical Methods and Computing for Big Data Mr.LDA https://github.com/lintool/Mr. LDA Rhadoop https://github.com/Revolution Analytics/RHadoop/ A collection of R packages that allow users to manage and analyze data with Hadoop. Nagappan & Mirakhorli ICSE 2015
  • 44. Data-Mining Libraries Is Hadoop THE SOLUTION? Nagappan & Mirakhorli ICSE 2015
  • 45. Why Big(ger) Data in software engineering • Introduction • Defining concepts • One Minute Madness activity State of art in empirical SE and large datasets • Summary • Advances • Challenges Public datasets • Repositories • Properties • Accessibility Tools and techniques to analyze large datasets • Infrastructure • Languages • Techniques • Example Agenda for the Technical Briefing of Big(ger) Data in Software Engineering Nagappan & Mirakhorli ICSE 2015
  • 46. SE problems using Big Data (to name a few) Nagappan & Mirakhorli ICSE 2015
  • 47. Big Data Analytics Applications Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop Clouds Code Evolution Analysis Clone Detection Log Analysis Nagappan & Mirakhorli ICSE 2015
  • 48. Mobile Apps API Change and Fault Proneness: A Threat to Success of Android Apps An Examination of the Current Rating System used in Mobile App Stores On the Relationship between the Number of Ad Libraries in an Android App and its Rating Nagappan & Mirakhorli ICSE 2015
  • 49. Programming Languages A Large-Scale Empirical Study of the Relationship Between Build Technology and Build Maintenance A large scale study of programming languages and code quality in github An empirical study of goto in C code Nagappan & Mirakhorli ICSE 2015
  • 50. Big(ger) Data Analysis in Requirement Engineering Domain On-demand Feature Recommendations Derived from Mining Public Product Descriptions Nagappan & Mirakhorli ICSE 2015
  • 51. Big(ger) Data Analysis in Requirement Engineering Domain On-demand Feature Recommendations Derived from Mining Public Product Descriptions Nagappan & Mirakhorli ICSE 2015
  • 52. Big(ger) Data Analysis in Requirement Engineering Domain On-demand Feature Recommendations Derived from Mining Public Product Descriptions Nagappan & Mirakhorli ICSE 2015
  • 53. Big(ger) Data Analysis in Software Architecture Domain Variability Points and Design Pattern Usage in Architectural Tactics Learn from millions of open source developers. How to implement high level design decision (fault detection) using low level implementation techniques (design patterns)? Nagappan & Mirakhorli ICSE 2015
  • 54. Big(ger) Data Analysis and Rapid Development Nagappan & Mirakhorli ICSE 2015
  • 55. Why Big(ger) Data in software engineering • Introduction • Defining concepts • One Minute Madness activity State of art in empirical SE and large datasets • Summary • Advances • Challenges Public datasets • Repositories • Properties • Accessibility Tools and techniques to analyze large datasets • Infrastructure • Languages • Techniques • Example Agenda for the Technical Briefing of Big(ger) Data in Software Engineering Nagappan & Mirakhorli ICSE 2015
  • 56. Our Research Manifesto Developer Maintainer Operator Manager Assist various Stakeholders to build better SoftwareNagappan & Mirakhorli ICSE 2015

Editor's Notes

  1. I would also like to acknowledge some of my industrial and academic collaborators from different parts of the world.
  2. I am also grateful to have worked with other upcoming academics from around the world, like Thorsten, Yasu, and Romain. Now that I have acknowledged a small percentage of the people who I have been able to work with, I will dive into my research.
  3. I have focused on using Big Data to deliver on my research goals. However the term Big Data is absolute, and as a researcher absolutes do not sit well with me.
  4. I prefer the term Bigger Data. Because the truth is that the size of the data is very relative. What may be big data for software engineers is very small data for climate scientists. I will therefore give some examples of the data that I for context. These are datasets that are several orders of magnitude bigger than typical SE datasets, which in the past has looked at maybe a handful of case study subjects.
  5. I prefer the term Bigger Data. Because the truth is that the size of the data is very relative. What may be big data for software engineers is very small data for climate scientists. I will therefore give some examples of the data that I for context. These are datasets that are several orders of magnitude bigger than typical SE datasets, which in the past has looked at maybe a handful of case study subjects.
  6. Another example of bigger data in SE is considering all hundreds of thousands of apps in the google play market.
  7. So why study bigger data in SE now? There are two reasons.
  8. (1) We have access to various pieces of data on millions of software projects. We have development data, bug data, user review data, and software execution data.
  9. And (2) We also have the computing power necessary to analyze these terabytes of data – from resource providers like amazon.
  10. But, Big DATA => Big CHALLENGES. The research community on Big Data has identified 4 V’s, …
  11. namely Volume or just the size of the dataset
  12. and velocity or the rate at which data is generated. These two issues presents challenges with respect to what kind of analysis can be applied on the dataset. We need algorithms that are not just quick and efficient, but also scale well.
  13. Then there is variety in the data. One example are mobile apps in the Google play store, there are apps for banks that are built by software companies and game apps that are built by one developer in their spare time. Each is equally popular, but the development practices and purpose of each are very different.
  14. And finally the veracity of the data, or how we filter noise. For example when we look at the development practices in open source repositories like github, we have to filter out the student repositories that were created for assignments. The last 2, namely variety and veracity affect the conclusions that we arrive at. We may arrive at conclusions that may not be valid for regular software development if the noise remains in the data.
  15. So why study bigger data in SE now? There are two reasons.
  16. So given the world of source code that we know about, that we have data about as researchers,
  17. We then measure the diversity of the case study subjects used in the research papers in two of the top SE research venues, ICSE and FSE, against the diversity of the ohloh dataset.
  18. When all the 7 attributes are taken together, we find that SE research has very low diversity. Even the exemplary ones do not have very high diversity among its case study systems.
  19. We wanted to ask what percentage of this world of code is covered by current SE studies. Are a majority of studies just focused on a small area of the WoC?
  20. For example one such dataset that I examined are the millions of log lines generated every day by data centers and cloud platforms that are stored in execution log files. These files are typically 100’s of GBs big. In the interest of time, I will not be presenting my research on log files today. But if you are interested, please do let me know and we can talk about it later.
  21. Dimensionality Reduction Singular Value Decomposition Lanczos Algorithm Stochastic SVD PCA
  22. Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop Clouds
  23. Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop Clouds