Scaling classical clone detection tools for ultra large datasets
1. Scaling Classical Clone
Detec/on Tools for Ultra-‐
Large Datasets
Jeffrey
Svajlenko,
Iman
Keivanloo,
Chanchal
Roy
IWSC
2013
2. Inter-‐Project Clone Detec/on
• Ac>ve
research
topic
in
the
community.
• Goal:
Construct
inter-‐project
clone
corpus.
• Applica*ons
• Study
Global
Developer
Behavior
• Discover
Poten>al
APIs
and
Libraries
• Internet-‐Scale
Clone
Search
• API
Recommenda>on
• API
Usage
Support
• …
3. Problem: Inter-‐Project Detec/on
• Many
state
of
the
art
tools
do
not
scale
to
large
datasets.
(classical
tools)
• Memory
Requirements
• Computa>onal
Complexity
• Execu>on
Time
• Underlying
limita>ons
in
their
algorithms
or
data
structures.
• Instead
novel
scalable
techniques
are
used.
• Challenging
to
develop.
• Wish
to
use
tools
from
a
variety
of
domains
when
building
an
inter-‐project
clone
corpus.
4. Goal and Mo/va/on
GOAL
To
scale
classical
clone
detec,on
tools
to
ultra
large
dataset.
MOTIVATION
To
allow
classical
clone
detec>on
tools
to
contribute
to
inter-‐project
clone
corpuses.
5. Shuffling Framework
• Scales
classical
tools
to
ultra-‐large
datasets.
• Using
standard
hardware.
• Without
modifying
the
original
tool.
• Incurs
a
loss
of
recall.
• Method:
Non-‐Determinis>c
Dataset
Par>>oning
6. Shuffling Framework -‐ Procedure
1. The
source
files
of
the
dataset
are
randomly
par>>oned
into
n
equally
sized
subsets.
Ultra-‐Large
Dataset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Subset
size
dictated
by
clone
detec>on
tool’s
scalability
limits.
7. Shuffling Framework -‐ Procedure
2. Each
subset
is
searched
independently
by
the
clone
detec>on
tool.
1
Clone
Detec>on
Tool
2
Clone
Detec>on
Tool
16
Clone
Detec>on
Tool
16
.
.
.
2
1
8. Shuffling Framework -‐ Procedure
3. The
detected
clone
pairs
are
added
to
a
clone
repository.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Detected
Clones
9. Shuffling Framework -‐ Procedure
4. Steps
(1)
through
(3)
are
repeated
for
r
rounds.
Dataset
Clone
Repository
r
rounds
n*r
detec>on
experiments
10. Shuffling Framework -‐ Evalua/on
Gold
Standard
• Clone
detec>on
report
of
a
tool
executed
na>vely
(without
shuffling).
Total
Recall
• %
of
gold
standard
found
afer
r
shuffling
rounds
of
n
par>>ons.
• Measure
for
unique
clone
pairs
or
unique
cloned
fragments.
11. Preliminary Study
• Test
with
“regular
size”
systems:
• JHotDraw
(20
KLOC,
285
files)
• ArgoUML
(190KLOC,
1845
files)
• JDK1.7
(900KLOC,
6916
files)
• Tools:
• CCFinder,
Deckard,
iClones,
NiCad,
SimCad,
Simian
• Shuffling:
15
subsets,
30
shuffling
rounds
• Measured:
total
recall
afer
each
round
13. Preliminary Study
• ~60-‐90%
total
recall
achievable
• Shuffling
performance
varies
by
detec>on
tool.
• Generally,
a
larger
gold
standard
requires
more
rounds
to
get
the
same
total
recall.
14. Main Experiment: Dataset
IJaDataset
2.0:
An
Inter-‐Project
Java
Corpus
• Keivanloo
et
al,
2012
(Proc.
MSR)
• Crawled
25,000
Open-‐Source
Java
Projects
• 3
million
java
source
files,
356
MLOC
• Outliers
(>2000
lines)
• 6238
removed
15. Experiment -‐ Hardware
Clone
detec>on
(shuffling):
• Worksta>on-‐Class
Hardware
• Quad
Core
CPU
• 12-‐16GB
of
RAM
• Above
Average
Disk
IO
• ~$1000
PC
• Allocated
on
shared
cloud
resources.
• Western
Canada
Research
Grid
(Bugaboo
Cluster)
• Amazon
EC2
Instances
25. Main Experiment Conclusions
• Shuffling
framework
finds
cloned
fragments
faster
than
the
clone
pair
rela>onships
between
them.
• A
large
number
of
rounds
may
be
needed
to
detect
a
sizable
number
of
the
clone
pairs.
• Appropriate
when
loss
of
recall
is
acceptable.
• Ex:
contribu>ng
towards
mul>-‐tool
clone
corpus.
• Processing
the
clones
found
in
a
inter-‐project
clone
corpus
can
become
itself
a
scalability
issue.
26. Clone Recovery
How
can
we
improve
clone
pair
discovery?
• Without
a
significant
increase
in
rounds?
IDEA:
Leverage
Cloned
Fragment
Detec2on
Ability
• Apply
Transi>ve
Property
on
Clone
Repository.
• If
(A,B)
and
(B,C)
then
(A,C)
• Perform
clone
search
amongst
cloned
fragments.
29. Future Work
1. Inves>gate
addi>onal
tools.
2. Inves>gate
efficient
clone
recovery
methods.
3. Directly
compare
with
determinis>c
approach.
4. Use
the
shuffling
framework
to
contribute
towards
an
inter-‐project
clone
corpus
(IJaDataset
2.0).