The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
MSR 2009
1. MapReduce
as
a
General
Framework
to
Support
Research
in
Mining
So8ware
Repositories(MSR)
Weiyi
Shang,
Zhen
Ming
Jiang,
Bram
Adams,
Ahmed
Hassan
So8ware
Analysis
and
Intelligence
Lab(SAIL)
School
of
CompuCng,
Queen’s
University
2. As
an
MSR
researcher,
have
you
ever
been
in
such
a
situa>on?
• Analyzing
gigabytes
of
data?
• WaiCng
hours
for
experimental
results?
• Experiments
fail
with
“out
of
memory”
excepCons?
3. To
overcome
these
problems,
you
could
…
…
buy
more
powerful
machines
…
spend
weeks
to
make
your
tools
more
efficient
4. However!
• The
data
will
keep
on
growing
• Spend
Cme
on
research
not
on
speeding
up
experiments
Debian
doubles
in
size
approximately
every
two
years
5. • Idle
compuCng
power
is
available
in
every
lab
• We
can
bundle
these
computers
together
• A
distributed
framework
can
help
us
do
so
6. General
requirements
for
a
distributed
framework:
1. Efficiency
speed
up
the
process
significantly
2. Scalability
scale
with
data
size
and
compuCng
power
3. Adaptability
require
only
minimal
programming
effort
4. Flexibility
run
in
various
environments
8. Google’s
MapReduce
is
an
idea
of
distributed
computa8on
• Open-‐source
MapReduce
implementaCon
• Well
documented
and
many
examples
available
• Well
supported
by
large
user
base
and
news
groups
• Straight
forward
API
9. Example:
coun>ng
the
frequency
of
word
lengths
dog
cat
fish
good
hello
night
happy
school
# WordsLength
23
24
35
16
10. dog
cat
fish
hello
good
night
happy
school
Example:
coun>ng
the
frequency
of
word
length
1.
Deploy
data
into
a
distributed
file
system
data
network
compuCng
environment
11. Example:
coun>ng
the
frequency
of
word
length
2.
Read
data
as
records
Data
dog
cat
fish
hello
good
night
happy
school
12. Example:
coun>ng
the
frequency
of
word
length
3.
Generate
keys
of
each
record
by
Mappers
Data
dog
cat
fish
hello
good
night
happy
school
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
Mapper
Mapper
Mapper
Mapperdog3
cat3
fish4
hello5
good4
night5
happy5
school6
13. Example:
coun>ng
the
frequency
of
word
length
4.
Group
and
sort
records
by
keys
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
14. Example:
coun>ng
the
frequency
of
word
length
5.
Send
records
with
the
same
key
to
one
reducer
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
15. Example:
coun>ng
the
frequency
of
word
length
6.
Generate
outputs
by
Reducers
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
16. A
typical
MSR
analysis
Extract
all
versions
of
all
files
Analyze
each
version
Compare
versions
to
each
other
We
implement
MapReduce
on
a
typical
MSR
tool
Repository
17. Applying
MapReduce
to
typical
MSR
tools
Repository
Data
a0.java
a1.java
b0.java
a2.java
b1.java
22. Case
study: J-‐REX
Extract
snapshots
from
CVS
repository
Use
Eclipse
JDT
to
parse
source
code
to
XML
files
Compare
each
XML
file
to
generate
evoluCon
informaCon
XML
output
n
…
JDT
EvoluCon
Analyzer
EvoluConary
Change
Data
… Snapshot
n
XML
output
1
Snapshot
extractor
CVS
ExtracCon
phase
Parsing
phase
Analysis
phase
Snapshot
1
23. Case
study:
data
Repository
Size
#Source
Code
Files
Length
of
History
#Revisions
Datatools 394MB 10,552 2
years 2,398
BIRT 810MB 13,002 4
years 19,583
Eclipse 4.2GB 56,851 8
years 82,682
24. Case
study:
experimental
setup
CPU
type #CPU
Memory
size Disk
type
Desktop Intel
Quad
Core
Q6600
@
2.40
GHz
4 2GB SATA
Server Intel
Quad
Core
Q6600
@
2.40
GHz
4 8GB RAID5
Server Intel
Core
i7
920
@
2.67
GHz
8 6GB SSD
25. Efficiency:
significant
reduc>on
of
running
>me
by
using
MapReduce
Desktop
Server(SSD)
With
MapReduce
70%
less
64%
less
Running
>me
(hour)
faste
r
59%
less
26. Scalability:
dras>c
reduc>on
of
run
>me
by
adding
machines
• When
adding
machines
– Time
to
deploy
data
increases
– Time
to
process
decreases
2nodes
3nodes
4nodes
faster
27. Adaptability:
liale
effort
to
apply
MapReduce
to
MSR
tool
• J-‐REX
logic
unchanged
• Only
300-‐400
LOC
to
implement
Map
and
Reduce
• Typical
MapReduce
examples
available
• Less
than
one
hour
for
deployment
29. Conclusions
• Distributed
frameworks
are
needed
to
– deal
with
growing
data
– make
best
use
of
available
compuCng
resources
• A
MapReduce
soluCon
of
a
typical
MSR
analysis
is:
– straight
forward
– scalable
– efficient