MSR 2009

MapReduce

as
a
General
Framework
to
Support
Research
in

Mining
So8ware
Repositories(MSR)

Weiyi
Shang,
Zhen
Ming
Jiang,
Bram
Adams,
Ahmed
Hassan

So8ware
Analysis
and
Intelligence
Lab(SAIL)

School
of
CompuCng,
Queen’s
University

As
an
MSR
researcher,

have
you
ever
been
in
such
a

situa>on?
• Analyzing
gigabytes
of
data?

• WaiCng
hours
for
experimental
results?

• Experiments
fail
with
“out
of
memory”
excepCons?

To
overcome
these
problems,
you
could
…
…
buy
more
powerful
machines
…
spend
weeks
to
make
your

tools
more
eﬃcient

However!
• The
data
will
keep
on

growing

• Spend

Cme
on
research

not
on
speeding
up

experiments

Debian
doubles
in
size

approximately
every
two
years

•  Idle
compuCng
power
is
available
in
every
lab

•  We
can
bundle
these
computers
together

•  A
distributed
framework
can
help
us
do
so

General
requirements
for
a

distributed
framework:
1.  Efficiency

speed
up
the
process
significantly

2.  Scalability

scale
with
data
size
and
compuCng
power

3.  Adaptability

require
only
minimal
programming
effort

4.  Flexibility

run
in
various
environments

Google’s

MapReduce

is
an
idea
of
distributed
computa8on

Google’s

MapReduce

is
an
idea
of
distributed
computa8on
•  Open-‐source
MapReduce
implementaCon

•  Well
documented
and
many
examples

available

•  Well
supported
by
large
user
base
and
news

groups

•  Straight
forward
API

Example:
coun>ng
the
frequency
of

word
lengths
dog
cat
fish
good
hello
night
happy
school
# WordsLength
23
24
35
16

dog

cat

ﬁsh

hello

good

night

happy

school
Example:
coun>ng
the
frequency
of

word
length
1.
Deploy
data
into
a
distributed
ﬁle
system

data
network
compuCng
environment

Example:
coun>ng
the
frequency
of

word
length
2.
Read
data
as
records

Data
dog
cat
fish
hello
good
night
happy
school

Example:
coun>ng
the
frequency
of

word
length
3.
Generate
keys
of
each
record
by
Mappers

Data
dog
cat
fish
hello
good
night
happy
school
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
Mapper
Mapper
Mapper
Mapperdog3
cat3
fish4
hello5
good4
night5
happy5
school6

Example:
coun>ng
the
frequency
of

word
length
4.
Group
and
sort
records
by
keys

ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6

Example:
coun>ng
the
frequency
of

word
length
5.
Send
records
with
the
same
key
to
one
reducer

ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6

Example:
coun>ng
the
frequency
of

word
length
6.
Generate
outputs
by
Reducers

ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16

A
typical
MSR
analysis
Extract
all
versions
of
all
ﬁles

Analyze
each
version

Compare
versions
to
each
other
We
implement
MapReduce
on
a
typical
MSR
tool
Repository

Applying
MapReduce
to
typical
MSR
tools
Repository

Data
a0.java
a1.java
b0.java
a2.java
b1.java

Applying
MapReduce
to
typical
MSR
tools
Mapper
Mapper
Mapper
Data
a0.java
a1.java
b0.java
a2.java
b1.java
ValueKey
a.java
a.java
b.java
a.java
b.java
a0.java
a1.java
b0.java
a2.java
b1.java
a.java
a.java
a0.java
a1.java
b.java
a.java
b0.java
a2.java
b.java b1.java

Applying
MapReduce
to
typical
MSR
tools
ValueKey
a.java
a.java
b.java
a.java
b.java
a0.java
a1.java
b0.java
a2.java
b1.java
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java

Applying
MapReduce
to
typical
MSR
tools
Reducer
Reducer
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
a.java
a.java
a.java
a0.java
a1.java
a2.java
b.java
b.java
b0.java
b1.java

Applying
MapReduce
to
typical
MSR
tools
Reducer
Reducer
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
a.java
a.java
a.java
a0.java
a1.java
a2.java
b.java
b.java
b0.java
b1.java
ValueKey
a.outputa.java
b.outputb.java

Case
study： J-‐REX
Extract
snapshots

from
CVS
repository
Use
Eclipse
JDT
to

parse
source
code
to

XML
ﬁles
Compare
each
XML
ﬁle

to
generate
evoluCon

informaCon
XML

output

n
…
JDT
EvoluCon
Analyzer
EvoluConary
Change

Data
… Snapshot
n
XML

output

1
Snapshot
extractor
CVS
ExtracCon

phase
Parsing

phase
Analysis

phase
Snapshot
1

Case
study:
data
Repository

Size
#Source

Code

Files
Length

of

History
#Revisions
Datatools 394MB 10,552 2
years 2,398
BIRT 810MB 13,002 4
years 19,583
Eclipse 4.2GB 56,851 8
years 82,682

Case
study:
experimental
setup
CPU
type #CPU
Memory
size Disk
type
Desktop Intel
Quad
Core

Q6600
@
2.40

GHz
4 2GB SATA
Server Intel
Quad
Core

Q6600
@
2.40

GHz
4 8GB RAID5
Server Intel
Core
i7

920
@
2.67
GHz
8 6GB SSD

Eﬃciency:
signiﬁcant
reduc>on
of

running
>me
by
using
MapReduce
Desktop

Server(SSD)

With
MapReduce

70%
less
64%
less
Running

>me
(hour)
faste
r
59%
less

Scalability:
dras>c
reduc>on
of
run

>me
by
adding
machines
•  When
adding
machines

– Time
to
deploy
data
increases

– Time
to
process
decreases

2nodes

3nodes

4nodes

faster

Adaptability:
liale
eﬀort
to
apply

MapReduce
to
MSR
tool
•  J-‐REX
logic
unchanged

•  Only
300-‐400
LOC
to
implement
Map
and

Reduce

•  Typical
MapReduce
examples
available

•  Less
than
one
hour
for
deployment

Flexibility:
run
on
various
environments

Conclusions
•  Distributed
frameworks
are
needed
to

– deal
with
growing
data

– make
best
use
of
available
compuCng
resources

•  A
MapReduce
soluCon
of
a
typical
MSR

analysis
is:

– straight
forward

– scalable

– eﬃcient

MSR 2009

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to MSR 2009

Similar to MSR 2009 (20)

Recently uploaded

Recently uploaded (20)

MSR 2009