Ase2010 shang

•Download as PPTX, PDF•

1 like•312 views

SAIL_QU

An Experience Report on Scaling Tools
for MSR Studies Using MapReduce
Weiyi Shang, Bram Adams, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University

Mining Software Repositories:
Propagating code changes
2
Method
A is
changed
Method
A calls
Method
B
Method
C calls
Method
A
Change
methods
B and C
Method
A is
changed
When method A is
changed, 90% of the
time method D is
changed.
Change
method
D
History
helps!

Traditional pipeline for MSR studies
Software
repositories
Data preparation (ETL)
Extraction
Transformation
Loading
Data
Warehouse
Data Analysis
3
Source
code
history
Bug
database
Mailing
list
System
log
Continues
to grow
More complex
algorithms
MSR studies must scale

Existing solutions to scale
powerful machines
ad hoc distributed computing
multi-threaded and multi-core
4

Example: D-CCFinder Clone Detector
40 days on 1 pc machine 52 hours on 80-
machines cluster
5

Web Analysis is similar to MSR
studies
Large-scale data Scan-centric Rapidly evolving
6

Web-scale platforms
7
We believe that the MSR field can benefit
from web-scale platforms to overcome
the limitations of current approaches.

In our previous research
8
Hadoop is up to 3 times faster
on a 4-machine cluster
Feasibility study using Hadoop to scale a
software evolution study on Eclipse.

In this paper
9
1. Does MapReduce scale to
other MSR studies and larger
clusters?
2. What are the challenges and
experiences of scaling MSR
studies?

ReduceMap
An example of MapReduce
Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Counting the frequency of word lengths
10
Key
4
5
4
3
6
5
5
3

Three large-scale MSR studies
• Software evolution study
– J-REX: code-change information abstractor for
Java from line level to program entity level
• Code clone detection
– CC-Finder: code clone detection tool
• Log analysis
– JACK: log analysis tool for detecting system
anomalies during load testing
11

Experimental environment
CPU type #machines Memor
y size
Operating
system
Intel Quad
Core Q6600
(2.40 GHz)
18 3GB Ubuntu 8.04
8 Xeon (3.0
GHZ)
10 8GB CentOS 5.2
12

Input data
Data Size Data type #Files
Eclipse
Datatools
10.4 GB
227 MB
CVS repository
CVS repository
189,156
10,629
FreeBSD 5.1 GB source code 317,740
Log files No.1
Log files No.2
9.9 GB
2.1 GB
execution log
execution log
54
54
13

1. Does MapReduce scale to other
MSR studies and larger clusters?
14

98
580
0 200 400 600 800
SHARCNET(×10)
1 machine
min
80
755
0 200 400 600 800
SHARCNET(×10)
1 machine
Software Evolution & Log analysis
J-REX
JACK
×9
×6
min
15

Code clone detection
Can MapReduce scale up CCFinder ?
Yes!
58 hours on an 18-machine cluster.
16

2. What are the challenges and
experiences of scaling MSR studies?
17

Challenge 1: Locality of MSR analysis
18
Local
analysis
Semi-local
analysis
Global
analysis
Web
MSR MSR MSR

Challenge 2: Granularity of MSR analysis
19
Fine-grained
analysis
Coarse-grained
analysis
• Web community experience:
– #Map: 10 ~ 100 × #
machines
– #Reduce: 0.95 or 1.75 ×
#CPU cores
• MSR experience:
– #Reduce tasks= #CPU cores
(fine-grained analysis)
– #Reduce task= #input
records (coarse-grained
analysis)Web
MSR MSR

Challenges of migrating MSR studies to
MapReduce
1. Locality of MSR analysis
2. Granularity of MSR analysis
3. Locating a suitable cluster
4. Managing data during analysis
5. Recovering from errors
20

What's hot

The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego

The Materials APIUniversity of California, San Diego

Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain

Why is Bioinformatics a Good Fit for Spark?Timothy Danford

ASE2010swy351

ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino

MSR 2009swy351

Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella

H2O World - Munging, modeling, and pipelines using Python - Hank RoarkSri Ambati

Spark Summit East 2015Timothy Danford

ICME Workshop Jul 2014 - The Materials ProjectUniversity of California, San Diego

Deadlock Avoidance - OSMsAnita2

Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.GeeksLab Odessa

The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego

Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference

More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesAndré Valdestilhas

Fast Variant Calling with ADAM and avocadofnothaft

Is 20TB really Big Data?NextMove Software

Big data at experimental facilitiesIan Foster

Implementation of GPU-based bioinformatic tools at the ENCODE DCCENCODE-DCC

What's hot (20)

The Materials Project Ecosystem - A Complete Software and Data Platform for M...

The Materials API

Atomate: a high-level interface to generate, execute, and analyze computation...

Why is Bioinformatics a Good Fit for Spark?

ASE2010

ProteomeXchange: data deposition and data retrieval made easy

MSR 2009

Spark meetup london share and analyse genomic data at scale with spark, adam...

H2O World - Munging, modeling, and pipelines using Python - Hank Roark

Spark Summit East 2015

ICME Workshop Jul 2014 - The Materials Project

Deadlock Avoidance - OS

Java/Scala Lab 2016. Сергей Моренец: Способы повышения эффективности в Java 8.

The Materials Project - Combining Science and Informatics to Accelerate Mater...

Big data from the LHC commissioning: practical lessons from big science - Sim...

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Fast Variant Calling with ADAM and avocado

Is 20TB really Big Data?

Big data at experimental facilities

Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Similar to Ase2010 shang

Extreme Scripting July 2009Ian Foster

Parallel Genetic Algorithms in the CloudPasquale Salza

BioPig for scalable analysis of big sequencing dataZhong Wang

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen

Provenance for Data Munging EnvironmentsPaul Groth

Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems

Microsoft DryadColin Clark

Big Data Lessons from the CloudMapR Technologies

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

Panel: NRP Science ImpactsLarry Smarr

Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH

Deep Learning for Autonomous DrivingJan Wiegelmann

Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager

PerformanceChristophe Marchal

Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit

Big data distributed processing: Spark introductionHektor Jacynycz García

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu

Cassandra adminSandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

Similar to Ase2010 shang (20)

Extreme Scripting July 2009

Parallel Genetic Algorithms in the Cloud

BioPig for scalable analysis of big sequencing data

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

Provenance for Data Munging Environments

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

Microsoft Dryad

Big Data Lessons from the Cloud

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Panel: NRP Science Impacts

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Making Machine Learning Scale: Single Machine and Distributed

Data Science und Machine Learning im Kubernetes-Ökosystem

Deep Learning for Autonomous Driving

Chronix Poster for the Poster Session FAST 2017

Performance

Debunking the Myths of HDFS Erasure Coding Performance

Big data distributed processing: Spark introduction

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Cassandra admin

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU

Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU

Improving the testing efficiency of selenium-based load testsSAIL_QU

Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU

Studying online distribution platforms for games through the mining of data f...SAIL_QU

Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU

Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU

Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU

Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU

Towards Just-in-Time Suggestions for Log ChangesSAIL_QU

The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU

A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...SAIL_QU

How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU

A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU

A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU

Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU

What Do Programmers Know about Software Energy Consumption?SAIL_QU

Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU

Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU

Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...

Studying the Dialogue Between Users and Developers of Free Apps in the Google...

Improving the testing efficiency of selenium-based load tests

Studying User-Developer Interactions Through the Distribution and Reviewing M...

Studying online distribution platforms for games through the mining of data f...

Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...

Investigating the Challenges in Selenium Usage and Improving the Testing Effi...

Mining Development Knowledge to Understand and Support Software Logging Pract...

Which Log Level Should Developers Choose For a New Logging Statement?

Towards Just-in-Time Suggestions for Log Changes

The Impact of Task Granularity on Co-evolution Analyses

A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...

How are Discussions Associated with Bug Reworking? An Empirical Study on Open...

A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...

A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...

Studying the Dialogue Between Users and Developers of Free Apps in the Google...

What Do Programmers Know about Software Energy Consumption?

Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...

Revisiting the Experimental Design Choices for Approaches for the Automated R...

Measuring Program Comprehension: A Large-Scale Field Study with Professionals

Ase2010 shang

1. An Experience Report on Scaling Tools for MSR Studies Using MapReduce Weiyi Shang, Bram Adams, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University

2. Mining Software Repositories: Propagating code changes 2 Method A is changed Method A calls Method B Method C calls Method A Change methods B and C Method A is changed When method A is changed, 90% of the time method D is changed. Change method D History helps!

3. Traditional pipeline for MSR studies Software repositories Data preparation (ETL) Extraction Transformation Loading Data Warehouse Data Analysis 3 Source code history Bug database Mailing list System log Continues to grow More complex algorithms MSR studies must scale

4. Existing solutions to scale powerful machines ad hoc distributed computing multi-threaded and multi-core 4

5. Example: D-CCFinder Clone Detector 40 days on 1 pc machine 52 hours on 80- machines cluster 5

6. Web Analysis is similar to MSR studies Large-scale data Scan-centric Rapidly evolving 6

7. Web-scale platforms 7 We believe that the MSR field can benefit from web-scale platforms to overcome the limitations of current approaches.

8. In our previous research 8 Hadoop is up to 3 times faster on a 4-machine cluster Feasibility study using Hadoop to scale a software evolution study on Eclipse.

9. In this paper 9 1. Does MapReduce scale to other MSR studies and larger clusters? 2. What are the challenges and experiences of scaling MSR studies?

10. ReduceMap An example of MapReduce Data good hello fish cat school night happy dog ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16 Counting the frequency of word lengths 10 Key 4 5 4 3 6 5 5 3

11. Three large-scale MSR studies • Software evolution study – J-REX: code-change information abstractor for Java from line level to program entity level • Code clone detection – CC-Finder: code clone detection tool • Log analysis – JACK: log analysis tool for detecting system anomalies during load testing 11

12. Experimental environment CPU type #machines Memor y size Operating system Intel Quad Core Q6600 (2.40 GHz) 18 3GB Ubuntu 8.04 8 Xeon (3.0 GHZ) 10 8GB CentOS 5.2 12

13. Input data Data Size Data type #Files Eclipse Datatools 10.4 GB 227 MB CVS repository CVS repository 189,156 10,629 FreeBSD 5.1 GB source code 317,740 Log files No.1 Log files No.2 9.9 GB 2.1 GB execution log execution log 54 54 13

14. 1. Does MapReduce scale to other MSR studies and larger clusters? 14

15. 98 580 0 200 400 600 800 SHARCNET(×10) 1 machine min 80 755 0 200 400 600 800 SHARCNET(×10) 1 machine Software Evolution & Log analysis J-REX JACK ×9 ×6 min 15

16. Code clone detection Can MapReduce scale up CCFinder ? Yes! 58 hours on an 18-machine cluster. 16

17. 2. What are the challenges and experiences of scaling MSR studies? 17

18. Challenge 1: Locality of MSR analysis 18 Local analysis Semi-local analysis Global analysis Web MSR MSR MSR

19. Challenge 2: Granularity of MSR analysis 19 Fine-grained analysis Coarse-grained analysis • Web community experience: – #Map: 10 ~ 100 × # machines – #Reduce: 0.95 or 1.75 × #CPU cores • MSR experience: – #Reduce tasks= #CPU cores (fine-grained analysis) – #Reduce task= #input records (coarse-grained analysis)Web MSR MSR

20. Challenges of migrating MSR studies to MapReduce 1. Locality of MSR analysis 2. Granularity of MSR analysis 3. Locating a suitable cluster 4. Managing data during analysis 5. Recovering from errors 20

21. 21 Questions?

Editor's Notes

Use software repos like software history we can support software engineering task
animation
Let’s give an example of this (ad hoc)
In ICSE 2002, Need FreeBSD icon here. It’s not easy to scale something up. It’s expansive and require a lot of tricking. We actually find a exsiting way of the techniques from different fields.
From web analysis, e.g. google yahoo. Put Google yahoo and facebook logo here Scan => read from the beginning not random read
Identify challenges, distill experiences,
Add a column with only key,
Not get the line stuff
Vary in size, some of the maps have more input data, some has less work. Run multiple times.
Think about this (baseline)
Each file can be analyzed without context and other data, in isolation, ultimate parallel able. Put semi local last. Ccfinder, native it’s glocal, we do some transforming to make it semilocal. Red circles on idle ones
Summary of result q1??? Distill experiences, to analyze, to parallise

Ase2010 shang

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ase2010 shang

Similar to Ase2010 shang (20)

More from SAIL_QU

More from SAIL_QU (20)

Ase2010 shang

Editor's Notes