SlideShare a Scribd company logo
1 of 21
An Experience Report on Scaling Tools
for MSR Studies Using MapReduce
Weiyi Shang, Bram Adams, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
Mining Software Repositories:
Propagating code changes
2
Method
A is
changed
Method
A calls
Method
B
Method
C calls
Method
A
Change
methods
B and C
Method
A is
changed
When method A is
changed, 90% of the
time method D is
changed.
Change
method
D
History
helps!
Traditional pipeline for MSR studies
Software
repositories
Data preparation (ETL)
Extraction
Transformation
Loading
Data
Warehouse
Data Analysis
3
Source
code
history
Bug
database
Mailing
list
System
log
Continues
to grow
More complex
algorithms
MSR studies must scale
Existing solutions to scale
powerful machines
ad hoc distributed computing
multi-threaded and multi-core
4
Example: D-CCFinder Clone Detector
40 days on 1 pc machine 52 hours on 80-
machines cluster
5
Web Analysis is similar to MSR
studies
Large-scale data Scan-centric Rapidly evolving
6
Web-scale platforms
7
We believe that the MSR field can benefit
from web-scale platforms to overcome
the limitations of current approaches.
In our previous research
8
Hadoop is up to 3 times faster
on a 4-machine cluster
Feasibility study using Hadoop to scale a
software evolution study on Eclipse.
In this paper
9
1. Does MapReduce scale to
other MSR studies and larger
clusters?
2. What are the challenges and
experiences of scaling MSR
studies?
ReduceMap
An example of MapReduce
Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Counting the frequency of word lengths
10
Key
4
5
4
3
6
5
5
3
Three large-scale MSR studies
• Software evolution study
– J-REX: code-change information abstractor for
Java from line level to program entity level
• Code clone detection
– CC-Finder: code clone detection tool
• Log analysis
– JACK: log analysis tool for detecting system
anomalies during load testing
11
Experimental environment
CPU type #machines Memor
y size
Operating
system
Intel Quad
Core Q6600
(2.40 GHz)
18 3GB Ubuntu 8.04
8 Xeon (3.0
GHZ)
10 8GB CentOS 5.2
12
Input data
Data Size Data type #Files
Eclipse
Datatools
10.4 GB
227 MB
CVS repository
CVS repository
189,156
10,629
FreeBSD 5.1 GB source code 317,740
Log files No.1
Log files No.2
9.9 GB
2.1 GB
execution log
execution log
54
54
13
1. Does MapReduce scale to other
MSR studies and larger clusters?
14
98
580
0 200 400 600 800
SHARCNET(×10)
1 machine
min
80
755
0 200 400 600 800
SHARCNET(×10)
1 machine
Software Evolution & Log analysis
J-REX
JACK
×9
×6
min
15
Code clone detection
Can MapReduce scale up CCFinder ?
Yes!
58 hours on an 18-machine cluster.
16
2. What are the challenges and
experiences of scaling MSR studies?
17
Challenge 1: Locality of MSR analysis
18
Local
analysis
Semi-local
analysis
Global
analysis
Web
MSR MSR MSR
Challenge 2: Granularity of MSR analysis
19
Fine-grained
analysis
Coarse-grained
analysis
• Web community experience:
– #Map: 10 ~ 100 × #
machines
– #Reduce: 0.95 or 1.75 ×
#CPU cores
• MSR experience:
– #Reduce tasks= #CPU cores
(fine-grained analysis)
– #Reduce task= #input
records (coarse-grained
analysis)Web
MSR MSR
Challenges of migrating MSR studies to
MapReduce
1. Locality of MSR analysis
2. Granularity of MSR analysis
3. Locating a suitable cluster
4. Managing data during analysis
5. Recovering from errors
20
21
Questions?

More Related Content

What's hot

The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
ASE2010
ASE2010ASE2010
ASE2010swy351
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkSri Ambati
 
Deadlock Avoidance - OS
Deadlock Avoidance - OSDeadlock Avoidance - OS
Deadlock Avoidance - OSMsAnita2
 
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.GeeksLab Odessa
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesAndré Valdestilhas
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Implementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCCImplementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCCENCODE-DCC
 

What's hot (20)

The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
The Materials API
The Materials APIThe Materials API
The Materials API
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
ASE2010
ASE2010ASE2010
ASE2010
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Deadlock Avoidance - OS
Deadlock Avoidance - OSDeadlock Avoidance - OS
Deadlock Avoidance - OS
 
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Implementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCCImplementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCC
 

Similar to Ase2010 shang

Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudPasquale Salza
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the CloudMapR Technologies
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 

Similar to Ase2010 shang (20)

Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the Cloud
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Performance
PerformancePerformance
Performance
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsSAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesSAIL_QU
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...SAIL_QU
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?SAIL_QU
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Ase2010 shang

  • 1. An Experience Report on Scaling Tools for MSR Studies Using MapReduce Weiyi Shang, Bram Adams, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University
  • 2. Mining Software Repositories: Propagating code changes 2 Method A is changed Method A calls Method B Method C calls Method A Change methods B and C Method A is changed When method A is changed, 90% of the time method D is changed. Change method D History helps!
  • 3. Traditional pipeline for MSR studies Software repositories Data preparation (ETL) Extraction Transformation Loading Data Warehouse Data Analysis 3 Source code history Bug database Mailing list System log Continues to grow More complex algorithms MSR studies must scale
  • 4. Existing solutions to scale powerful machines ad hoc distributed computing multi-threaded and multi-core 4
  • 5. Example: D-CCFinder Clone Detector 40 days on 1 pc machine 52 hours on 80- machines cluster 5
  • 6. Web Analysis is similar to MSR studies Large-scale data Scan-centric Rapidly evolving 6
  • 7. Web-scale platforms 7 We believe that the MSR field can benefit from web-scale platforms to overcome the limitations of current approaches.
  • 8. In our previous research 8 Hadoop is up to 3 times faster on a 4-machine cluster Feasibility study using Hadoop to scale a software evolution study on Eclipse.
  • 9. In this paper 9 1. Does MapReduce scale to other MSR studies and larger clusters? 2. What are the challenges and experiences of scaling MSR studies?
  • 10. ReduceMap An example of MapReduce Data good hello fish cat school night happy dog ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16 Counting the frequency of word lengths 10 Key 4 5 4 3 6 5 5 3
  • 11. Three large-scale MSR studies • Software evolution study – J-REX: code-change information abstractor for Java from line level to program entity level • Code clone detection – CC-Finder: code clone detection tool • Log analysis – JACK: log analysis tool for detecting system anomalies during load testing 11
  • 12. Experimental environment CPU type #machines Memor y size Operating system Intel Quad Core Q6600 (2.40 GHz) 18 3GB Ubuntu 8.04 8 Xeon (3.0 GHZ) 10 8GB CentOS 5.2 12
  • 13. Input data Data Size Data type #Files Eclipse Datatools 10.4 GB 227 MB CVS repository CVS repository 189,156 10,629 FreeBSD 5.1 GB source code 317,740 Log files No.1 Log files No.2 9.9 GB 2.1 GB execution log execution log 54 54 13
  • 14. 1. Does MapReduce scale to other MSR studies and larger clusters? 14
  • 15. 98 580 0 200 400 600 800 SHARCNET(×10) 1 machine min 80 755 0 200 400 600 800 SHARCNET(×10) 1 machine Software Evolution & Log analysis J-REX JACK ×9 ×6 min 15
  • 16. Code clone detection Can MapReduce scale up CCFinder ? Yes! 58 hours on an 18-machine cluster. 16
  • 17. 2. What are the challenges and experiences of scaling MSR studies? 17
  • 18. Challenge 1: Locality of MSR analysis 18 Local analysis Semi-local analysis Global analysis Web MSR MSR MSR
  • 19. Challenge 2: Granularity of MSR analysis 19 Fine-grained analysis Coarse-grained analysis • Web community experience: – #Map: 10 ~ 100 × # machines – #Reduce: 0.95 or 1.75 × #CPU cores • MSR experience: – #Reduce tasks= #CPU cores (fine-grained analysis) – #Reduce task= #input records (coarse-grained analysis)Web MSR MSR
  • 20. Challenges of migrating MSR studies to MapReduce 1. Locality of MSR analysis 2. Granularity of MSR analysis 3. Locating a suitable cluster 4. Managing data during analysis 5. Recovering from errors 20

Editor's Notes

  1. Use software repos like software history we can support software engineering task
  2. animation
  3. Let’s give an example of this (ad hoc)
  4. In ICSE 2002, Need FreeBSD icon here. It’s not easy to scale something up. It’s expansive and require a lot of tricking. We actually find a exsiting way of the techniques from different fields.
  5. From web analysis, e.g. google yahoo. Put Google yahoo and facebook logo here Scan => read from the beginning not random read
  6. Identify challenges, distill experiences,
  7. Add a column with only key,
  8. Not get the line stuff
  9. Vary in size, some of the maps have more input data, some has less work. Run multiple times.
  10. Think about this (baseline)
  11. Each file can be analyzed without context and other data, in isolation, ultimate parallel able. Put semi local last. Ccfinder, native it’s glocal, we do some transforming to make it semilocal. Red circles on idle ones
  12. Summary of result q1??? Distill experiences, to analyze, to parallise