ASE2010

•

0 likes•37,134 views

swy351

ASE2010

Technology

An
Experience
Report
on
Scaling
Tools

for
MSR
Studies
Using
MapReduce

Weiyi
Shang,
Bram
Adams,
Ahmed
E.
Hassan

So2ware
Analysis
and
Intelligence
Lab
(SAIL)

School
of
CompuCng,
Queen’s
University

Mining
So<ware
Repositories:

Propaga@ng
code
changes

2

Method

A
is

changed

Method

A
calls

Method

B

Method

C
calls

Method

A

Change

methods

B
and
C

Method

A
is

changed

When
method
A
is

changed,
90%
of
the

Cme
method
D
is

changed.

Change

method

D

Not
Enough

History

helps!

Tradi@onal
pipeline
for
MSR
studies

So<ware

repositories

Data
prepara@on
(ETL)

Extrac@on

Transforma@on

Loading

Data

Warehouse

Data
Analysis

3

Source

code

history

Bug

database

Mailing

list

System

log

Con@nues

to
grow

More
complex

algorithms

MSR
studies
must
scale

Exis@ng
solu@ons
to
scale

powerful
machines

ad
hoc
distributed
compuCng

mulC-‐threaded
and
mulC-‐core

EXPENSIVE

LARGE

PROGRAMMING
EFFORT

NOT
RE-‐USABLE

4

Example:
D-‐CCFinder
Clone
Detector

40
days
on
1
pc
machine
52
hours
on
80-‐
machines
cluster

5

Web
Analysis
is
similar
to
MSR

studies
Large-‐scale
data
Scan-‐centric
Rapidly
evolving

6

Web-‐scale
plaSorms

7

We
believe
that
the
MSR
ﬁeld
can
beneﬁt

from
web-‐scale
plaSorms
to
overcome

the
limita@ons
of
current
approaches.

In
our
previous
research

8

Hadoop
is
up
to
3
Cmes
faster

on
a
4-‐machine
cluster

Feasibility
study
using
Hadoop
to
scale
a

so2ware
evoluCon
study
on
Eclipse.

In
this
paper

9

1.
Does
MapReduce
scale
to

other
MSR
studies
and
larger

clusters?

2.
What
are
the
challenges
and

experiences
of
scaling
MSR

studies?

Reduce
Map

An
example
of
MapReduce

Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Coun@ng
the
frequency
of
word
lengths

10

Key
4
5
4
3
6
5
5
3

Three
large-‐scale
MSR
studies
•  So<ware
evolu@on
study

– J-‐REX:
code-‐change
informaCon
abstractor
for

Java
from
line
level
to
program
enCty
level

•  Code
clone
detec@on

– CC-‐Finder:
code
clone
detecCon
tool

•  Log
analysis

– JACK:
log
analysis
tool
for
detecCng
system

anomalies
during
load
tesCng

11

Experimental
environment

CPU
type #machines
Memory

size
Opera@ng

system
Intel
Quad

Core
Q6600

(2.40
GHz)
18 3GB Ubuntu
8.04
8
Xeon
(3.0

GHZ)
10 8GB CentOS
5.2
12

Input
data
Data
Size Data
type #Files
Eclipse

Datatools
10.4
GB

227
MB
CVS
repository

CVS
repository

189,156

10,629
FreeBSD 5.1
GB source
code 317,740
Log
ﬁles
No.1

Log
ﬁles
No.2
9.9
GB

2.1
GB
execuCon
log

execuCon
log
54

54
13

1.
Does
MapReduce
scale
to
other

MSR
studies
and
larger
clusters?

14

98

580

0
100
200
300
400
500
600
700

SHARCNET(×10)

1
machine

min
80

755

0
100
200
300
400
500
600
700
800

SHARCNET(×10)

1
machine

So<ware
Evolu@on
&
Log
analysis

J-‐REX

JACK

×9

×6

min
15

Code
clone
detec@on
Can
MapReduce
scale
up
CCFinder
?

Yes!

58
hours
on
an
18-‐machine

cluster.

16

2.
What
are
the
challenges
and

experiences
of
scaling
MSR
studies?

17

Challenge
1:
Locality
of
MSR
analysis

18

Local

analysis

Semi-‐local

analysis

Global

analysis

Web

MSR
MSR
MSR

Challenge
2:
Granularity
of
MSR
analysis

19

Fine-‐grained

analysis

Coarse-‐grained

analysis

•  Web
community
experience:

– #Map:
10
~
100
×
#

machines

– #Reduce:
0.95
or
1.75
×

#CPU
cores

•  MSR
experience:

– #Reduce
tasks=
#CPU
cores

(ﬁne-‐grained
analysis)

– #Reduce
task=
#input

records
(coarse-‐grained

analysis)

Web

MSR
MSR

Challenges
of
migra@ng
MSR
studies
to

MapReduce

1.  Locality
of
MSR
analysis

2.  Granularity
of
MSR
analysis

3.  Loca@ng
a
suitable
cluster

4.  Managing
data
during
analysis

5.  Recovering
from
errors

20

What's hot

Hui 3.0Arulkumar Arumugam

AnalyticOps - Chicago PAW 2016Robert Grossman

Using the python_data_toolkit_timbers_slidesTiffany Timbers

Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks

How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman

Extending the Yahoo Streaming BenchmarkJamie Grier

Gossip & Key Value StoreSajeev P

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...Spark Summit

GluonNLP MXNet Meetup-AugChenguang Wang

Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone

Basics of Digital Design and VerilogGanesan Narayanasamy

A Graph-Based Method For Cross-Entity Threat DetectionJen Aman

Implementation of linear regression and logistic regression on SparkDalei Li

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit

Requirements driven Model-based TestingDharmalingam Ganesan

What's hot (15)

Hui 3.0

AnalyticOps - Chicago PAW 2016

Using the python_data_toolkit_timbers_slides

Building Continuous Application with Structured Streaming and Real-Time Data ...

How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...

Extending the Yahoo Streaming Benchmark

Gossip & Key Value Store

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...

GluonNLP MXNet Meetup-Aug

Reintroducing the Stream Processor: A universal tool for continuous data anal...

Basics of Digital Design and Verilog

A Graph-Based Method For Cross-Entity Threat Detection

Implementation of linear regression and logistic regression on Spark

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

Requirements driven Model-based Testing

Similar to ASE2010

Ase2010 shangSAIL_QU

Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems

Big dataanalyticsbeyondhadoop public_20_june_2013Vijay Srinivas Agneeswaran, Ph.D

SparkNitish Upreti

Next generation analytics with yarn, spark and graph labImpetus Technologies

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

Parallel Genetic Algorithms in the CloudPasquale Salza

System mldl meetupGanesan Narayanasamy

Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems

Msr2009 ianSAIL_QU

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu

Huawei Advanced Data Science With Spark StreamingJen Aman

Scientific marpierc

Ling liu part 02：big graph processingjins0618

Big data analytics_7_giants_public_24_sep_2013Vijay Srinivas Agneeswaran, Ph.D

An efficient data mining solution by integrating Spark and CassandraStratio

Graphlab Ted Dunning ClusteringMapR Technologies

Shark SQL and Rich Analytics at ScaleDataWorks Summit

Scalable Deep Learning in ExtremeEarth-phiweek19ExtremeEarth

Scientific Computing With Amazon Web ServicesJamie Kinney

Similar to ASE2010 (20)

Ase2010 shang

Studies of HPCC Systems from Machine Learning Perspectives

Big dataanalyticsbeyondhadoop public_20_june_2013

Spark

Next generation analytics with yarn, spark and graph lab

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Parallel Genetic Algorithms in the Cloud

System mldl meetup

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

Msr2009 ian

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Huawei Advanced Data Science With Spark Streaming

Scientific

Ling liu part 02：big graph processing

Big data analytics_7_giants_public_24_sep_2013

An efficient data mining solution by integrating Spark and Cassandra

Graphlab Ted Dunning Clustering

Shark SQL and Rich Analytics at Scale

Scalable Deep Learning in ExtremeEarth-phiweek19

Scientific Computing With Amazon Web Services

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Install Stable Diffusion in windows machinePadma Pradeep

Artificial intelligence in the post-deep learning eraDeakin University

Build your next Gen AI Breakthrough - April 2024Neo4j

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Pigging Solutions in Pet Food Manufacturing

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Designing IA for AI - Information Architecture Conference 2024

Unleash Your Potential - Namagunga Girls Coding Club

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

DMCC Future of Trade Web3 - Special Edition

Understanding the Laravel MVC Architecture

Install Stable Diffusion in windows machine

Artificial intelligence in the post-deep learning era

Build your next Gen AI Breakthrough - April 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Advanced Test Driven-Development @ php[tek] 2024

ASE2010

1. An Experience Report on Scaling Tools for MSR Studies Using MapReduce Weiyi Shang, Bram Adams, Ahmed E. Hassan So2ware Analysis and Intelligence Lab (SAIL) School of CompuCng, Queen’s University

2. Mining So<ware Repositories: Propaga@ng code changes 2 Method A is changed Method A calls Method B Method C calls Method A Change methods B and C Method A is changed When method A is changed, 90% of the Cme method D is changed. Change method D Not Enough History helps!

3. Tradi@onal pipeline for MSR studies So<ware repositories Data prepara@on (ETL) Extrac@on Transforma@on Loading Data Warehouse Data Analysis 3 Source code history Bug database Mailing list System log Con@nues to grow More complex algorithms MSR studies must scale

4. Exis@ng solu@ons to scale powerful machines ad hoc distributed compuCng mulC-‐threaded and mulC-‐core EXPENSIVE LARGE PROGRAMMING EFFORT NOT RE-‐USABLE 4

5. Example: D-‐CCFinder Clone Detector 40 days on 1 pc machine 52 hours on 80-‐ machines cluster 5

6. Web Analysis is similar to MSR studies Large-‐scale data Scan-‐centric Rapidly evolving 6

7. Web-‐scale plaSorms 7 We believe that the MSR ﬁeld can beneﬁt from web-‐scale plaSorms to overcome the limita@ons of current approaches.

8. In our previous research 8 Hadoop is up to 3 Cmes faster on a 4-‐machine cluster Feasibility study using Hadoop to scale a so2ware evoluCon study on Eclipse.

9. In this paper 9 1. Does MapReduce scale to other MSR studies and larger clusters? 2. What are the challenges and experiences of scaling MSR studies?

10. Reduce Map An example of MapReduce Data good hello fish cat school night happy dog ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16 Coun@ng the frequency of word lengths 10 Key 4 5 4 3 6 5 5 3

11. Three large-‐scale MSR studies •  So<ware evolu@on study – J-‐REX: code-‐change informaCon abstractor for Java from line level to program enCty level •  Code clone detec@on – CC-‐Finder: code clone detecCon tool •  Log analysis – JACK: log analysis tool for detecCng system anomalies during load tesCng 11

12. Experimental environment CPU type #machines Memory size Opera@ng system Intel Quad Core Q6600 (2.40 GHz) 18 3GB Ubuntu 8.04 8 Xeon (3.0 GHZ) 10 8GB CentOS 5.2 12

13. Input data Data Size Data type #Files Eclipse Datatools 10.4 GB 227 MB CVS repository CVS repository 189,156 10,629 FreeBSD 5.1 GB source code 317,740 Log ﬁles No.1 Log ﬁles No.2 9.9 GB 2.1 GB execuCon log execuCon log 54 54 13

14. 1. Does MapReduce scale to other MSR studies and larger clusters? 14

15. 98 580 0 100 200 300 400 500 600 700 SHARCNET(×10) 1 machine min 80 755 0 100 200 300 400 500 600 700 800 SHARCNET(×10) 1 machine So<ware Evolu@on & Log analysis J-‐REX JACK ×9 ×6 min 15

16. Code clone detec@on Can MapReduce scale up CCFinder ? Yes! 58 hours on an 18-‐machine cluster. 16

17. 2. What are the challenges and experiences of scaling MSR studies? 17

18. Challenge 1: Locality of MSR analysis 18 Local analysis Semi-‐local analysis Global analysis Web MSR MSR MSR

19. Challenge 2: Granularity of MSR analysis 19 Fine-‐grained analysis Coarse-‐grained analysis •  Web community experience: – #Map: 10 ~ 100 × # machines – #Reduce: 0.95 or 1.75 × #CPU cores •  MSR experience: – #Reduce tasks= #CPU cores (ﬁne-‐grained analysis) – #Reduce task= #input records (coarse-‐grained analysis) Web MSR MSR

20. Challenges of migra@ng MSR studies to MapReduce 1.  Locality of MSR analysis 2.  Granularity of MSR analysis 3.  Loca@ng a suitable cluster 4.  Managing data during analysis 5.  Recovering from errors 20

21. 21 Ques@ons?

ASE2010

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to ASE2010

Similar to ASE2010 (20)

More from swy351

More from swy351 (6)

Recently uploaded

Recently uploaded (20)

ASE2010