1. An
Experience
Report
on
Scaling
Tools
for
MSR
Studies
Using
MapReduce
Weiyi
Shang,
Bram
Adams,
Ahmed
E.
Hassan
So2ware
Analysis
and
Intelligence
Lab
(SAIL)
School
of
CompuCng,
Queen’s
University
2. Mining
So<ware
Repositories:
Propaga@ng
code
changes
2
Method
A
is
changed
Method
A
calls
Method
B
Method
C
calls
Method
A
Change
methods
B
and
C
Method
A
is
changed
When
method
A
is
changed,
90%
of
the
Cme
method
D
is
changed.
Change
method
D
Not
Enough
History
helps!
3. Tradi@onal
pipeline
for
MSR
studies
So<ware
repositories
Data
prepara@on
(ETL)
Extrac@on
Transforma@on
Loading
Data
Warehouse
Data
Analysis
3
Source
code
history
Bug
database
Mailing
list
System
log
Con@nues
to
grow
More
complex
algorithms
MSR
studies
must
scale
4. Exis@ng
solu@ons
to
scale
powerful
machines
ad
hoc
distributed
compuCng
mulC-‐threaded
and
mulC-‐core
EXPENSIVE
LARGE
PROGRAMMING
EFFORT
NOT
RE-‐USABLE
4
6. Web
Analysis
is
similar
to
MSR
studies
Large-‐scale
data
Scan-‐centric
Rapidly
evolving
6
7. Web-‐scale
plaSorms
7
We
believe
that
the
MSR
field
can
benefit
from
web-‐scale
plaSorms
to
overcome
the
limita@ons
of
current
approaches.
8. In
our
previous
research
8
Hadoop
is
up
to
3
Cmes
faster
on
a
4-‐machine
cluster
Feasibility
study
using
Hadoop
to
scale
a
so2ware
evoluCon
study
on
Eclipse.
9. In
this
paper
9
1.
Does
MapReduce
scale
to
other
MSR
studies
and
larger
clusters?
2.
What
are
the
challenges
and
experiences
of
scaling
MSR
studies?
10. Reduce
Map
An
example
of
MapReduce
Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Coun@ng
the
frequency
of
word
lengths
10
Key
4
5
4
3
6
5
5
3
11. Three
large-‐scale
MSR
studies
• So<ware
evolu@on
study
– J-‐REX:
code-‐change
informaCon
abstractor
for
Java
from
line
level
to
program
enCty
level
• Code
clone
detec@on
– CC-‐Finder:
code
clone
detecCon
tool
• Log
analysis
– JACK:
log
analysis
tool
for
detecCng
system
anomalies
during
load
tesCng
11
12. Experimental
environment
CPU
type #machines
Memory
size
Opera@ng
system
Intel
Quad
Core
Q6600
(2.40
GHz)
18 3GB Ubuntu
8.04
8
Xeon
(3.0
GHZ)
10 8GB CentOS
5.2
12
16. Code
clone
detec@on
Can
MapReduce
scale
up
CCFinder
?
Yes!
58
hours
on
an
18-‐machine
cluster.
16
17. 2.
What
are
the
challenges
and
experiences
of
scaling
MSR
studies?
17
18. Challenge
1:
Locality
of
MSR
analysis
18
Local
analysis
Semi-‐local
analysis
Global
analysis
Web
MSR
MSR
MSR
19. Challenge
2:
Granularity
of
MSR
analysis
19
Fine-‐grained
analysis
Coarse-‐grained
analysis
• Web
community
experience:
– #Map:
10
~
100
×
#
machines
– #Reduce:
0.95
or
1.75
×
#CPU
cores
• MSR
experience:
– #Reduce
tasks=
#CPU
cores
(fine-‐grained
analysis)
– #Reduce
task=
#input
records
(coarse-‐grained
analysis)
Web
MSR
MSR
20. Challenges
of
migra@ng
MSR
studies
to
MapReduce
1. Locality
of
MSR
analysis
2. Granularity
of
MSR
analysis
3. Loca@ng
a
suitable
cluster
4. Managing
data
during
analysis
5. Recovering
from
errors
20