Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Ase2010 shang
1. An Experience Report on Scaling Tools
for MSR Studies Using MapReduce
Weiyi Shang, Bram Adams, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
2. Mining Software Repositories:
Propagating code changes
2
Method
A is
changed
Method
A calls
Method
B
Method
C calls
Method
A
Change
methods
B and C
Method
A is
changed
When method A is
changed, 90% of the
time method D is
changed.
Change
method
D
History
helps!
3. Traditional pipeline for MSR studies
Software
repositories
Data preparation (ETL)
Extraction
Transformation
Loading
Data
Warehouse
Data Analysis
3
Source
code
history
Bug
database
Mailing
list
System
log
Continues
to grow
More complex
algorithms
MSR studies must scale
4. Existing solutions to scale
powerful machines
ad hoc distributed computing
multi-threaded and multi-core
4
6. Web Analysis is similar to MSR
studies
Large-scale data Scan-centric Rapidly evolving
6
7. Web-scale platforms
7
We believe that the MSR field can benefit
from web-scale platforms to overcome
the limitations of current approaches.
8. In our previous research
8
Hadoop is up to 3 times faster
on a 4-machine cluster
Feasibility study using Hadoop to scale a
software evolution study on Eclipse.
9. In this paper
9
1. Does MapReduce scale to
other MSR studies and larger
clusters?
2. What are the challenges and
experiences of scaling MSR
studies?
10. ReduceMap
An example of MapReduce
Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Counting the frequency of word lengths
10
Key
4
5
4
3
6
5
5
3
11. Three large-scale MSR studies
• Software evolution study
– J-REX: code-change information abstractor for
Java from line level to program entity level
• Code clone detection
– CC-Finder: code clone detection tool
• Log analysis
– JACK: log analysis tool for detecting system
anomalies during load testing
11
12. Experimental environment
CPU type #machines Memor
y size
Operating
system
Intel Quad
Core Q6600
(2.40 GHz)
18 3GB Ubuntu 8.04
8 Xeon (3.0
GHZ)
10 8GB CentOS 5.2
12
20. Challenges of migrating MSR studies to
MapReduce
1. Locality of MSR analysis
2. Granularity of MSR analysis
3. Locating a suitable cluster
4. Managing data during analysis
5. Recovering from errors
20
Use software repos like software history we can support software engineering task
animation
Let’s give an example of this (ad hoc)
In ICSE 2002, Need FreeBSD icon here. It’s not easy to scale something up. It’s expansive and require a lot of tricking. We actually find a exsiting way of the techniques from different fields.
From web analysis, e.g. google yahoo.
Put Google yahoo and facebook logo here
Scan => read from the beginning not random read
Identify challenges, distill experiences,
Add a column with only key,
Not get the line stuff
Vary in size, some of the maps have more input data, some has less work. Run multiple times.
Think about this (baseline)
Each file can be analyzed without context and other data, in isolation, ultimate parallel able. Put semi local last.
Ccfinder, native it’s glocal, we do some transforming to make it semilocal. Red circles on idle ones
Summary of result q1???
Distill experiences, to analyze, to parallise