Scaling classical clone detection tools for ultra large datasets

Scaling Classical Clone
Detec/on Tools for Ultra-‐
Large Datasets

Jeﬀrey
Svajlenko,
Iman
Keivanloo,
Chanchal
Roy

IWSC
2013

Inter-‐Project Clone Detec/on
•  Ac>ve
research
topic
in
the
community.

•  Goal:
Construct
inter-‐project
clone
corpus.

•  Applica*ons

•  Study
Global
Developer
Behavior

•  Discover
Poten>al
APIs
and
Libraries

•  Internet-‐Scale
Clone
Search

•  API
Recommenda>on

•  API
Usage
Support

•  …

Problem: Inter-‐Project Detec/on
•  Many
state
of
the
art
tools
do
not
scale
to
large

datasets.
(classical
tools)

•  Memory
Requirements

•  Computa>onal
Complexity

•  Execu>on
Time

•  Underlying
limita>ons
in
their
algorithms
or
data
structures.

•  Instead
novel
scalable
techniques
are
used.

•  Challenging
to
develop.

•  Wish
to
use
tools
from
a
variety
of
domains
when

building
an
inter-‐project
clone
corpus.

Goal and Mo/va/on
GOAL

To
scale
classical
clone
detec,on
tools
to
ultra
large

dataset.

MOTIVATION

To
allow
classical
clone
detec>on
tools
to
contribute

to
inter-‐project
clone
corpuses.

Shuﬄing Framework
•  Scales
classical
tools
to
ultra-‐large
datasets.

•  Using
standard
hardware.

•  Without
modifying
the
original
tool.

•  Incurs
a
loss
of
recall.

•  Method:
Non-‐Determinis>c
Dataset
Par>>oning

Shuﬄing Framework -‐ Procedure
1.  The
source
ﬁles
of
the
dataset
are
randomly

par>>oned
into
n
equally
sized
subsets.

Ultra-‐Large
Dataset

1
2
3
4

5
6
7
8

9
10
11
12

13
14
15
16

Subset
size
dictated
by
clone
detec>on
tool’s
scalability
limits.

2.  Each
subset
is
searched
independently
by
the

clone
detec>on
tool.

1
Clone
Detec>on
Tool

2
Clone
Detec>on
Tool

16
Clone
Detec>on
Tool
16

.
.
.
2

1

3.  The
detected
clone
pairs
are
added
to
a
clone

repository.

1
2
3
4

5
6
7
8

9
10
11
12

13
14
15
16

Detected

Clones

4.  Steps
(1)
through
(3)
are
repeated
for
r
rounds.

Dataset
Clone

Repository

r
rounds

n*r
detec>on
experiments

Shuffling Framework -‐ Evalua/on
Gold
Standard

•  Clone
detec>on
report
of
a
tool
executed
na>vely

(without
shuffling).

Total
Recall

•  %
of
gold
standard
found
afer
r
shuffling
rounds
of
n

par>>ons.

•  Measure
for
unique
clone
pairs
or
unique
cloned

fragments.

Preliminary Study
•  Test
with
“regular
size”
systems:

•  JHotDraw
(20
KLOC,
285
files)

•  ArgoUML
(190KLOC,
1845
files)

•  JDK1.7
(900KLOC,
6916
files)

•  Tools:

•  CCFinder,
Deckard,
iClones,
NiCad,
SimCad,
Simian

•  Shuffling:
15
subsets,
30
shuffling
rounds

•  Measured:
total
recall
afer
each
round

Preliminary Study – JDK1.7
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Recall

Round

Deckard
(1834042)

iClones
(49716)

NiCad
(8105)

SimCad
(549923)

Simian
(217409)

n
=
15
subsets,
r
=
30
rounds

Preliminary Study
•  ~60-‐90%
total
recall
achievable

•  Shuﬄing
performance
varies
by
detec>on
tool.

•  Generally,
a
larger
gold
standard
requires
more

rounds
to
get
the
same
total
recall.

Main Experiment: Dataset
IJaDataset
2.0:
An
Inter-‐Project
Java
Corpus

•  Keivanloo
et
al,
2012
(Proc.
MSR)

•  Crawled
25,000
Open-‐Source
Java
Projects

•  3
million
java
source
ﬁles,
356
MLOC

•  Outliers
(>2000
lines)

•  6238
removed

Experiment -‐ Hardware
Clone
detec>on
(shuﬄing):

•  Worksta>on-‐Class
Hardware

•  Quad
Core
CPU

•  12-‐16GB
of
RAM

•  Above
Average
Disk
IO

•  ~$1000
PC

•  Allocated
on
shared
cloud
resources.

•  Western
Canada
Research
Grid
(Bugaboo
Cluster)

•  Amazon
EC2
Instances

Experiment -‐ Tools
•  Simian

•  NiCad

•  Deckard

•  CCFinderX

•  Terminated
without
explana>on.

•  SimCad

•  Execu>on
aborts
on
troublesome
ﬁle.

•  iClones

•  Compa>bility
issue.

Simian
•  IJaDataset2

•  Scalability
limit:
RAM

•  50,000
ﬁle
subsets
(58
par>>ons),
30
rounds

•  8-‐12hr
to
par>>on,
4-‐10hr
for
detec>on
(per
round)

•  Serng

•  Minimum
Clone
Size:
6
lines

•  No
source
normaliza>on
(execu>on
>me)

•  Gold
Standard

•  Amazon
EC2
instance
with
68GB
of
RAM

•  300
billion
clone
pairs,
11
million
cloned
fragments

Simian: Cloned Fragment Recall
0.166903883

0.476927684

0.626533533

0.715431474

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Clone
Fragment
Recall

Round

Considering
only
clone
classes
with
<=
100
fragments.

Simian: Clone Recall (Trim)
0.24792718

0.619514665

y
=
0.0067x
+
0.0533

R²
=
0.99585

y
=
0.1364ln(x)
+
0.1199

R²
=
0.95064

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Total
Recall

Rounds

Clone
Pairs
Cloned
Fragments
Linear
(Clone
Pairs)
Log.
(Cloned
Fragments)

NiCad
•  IJaDataset2

•  Scalability:
Limited
data-‐structure
size.

•  10,000
ﬁle
subsets,
289
par>>ons,
20
rounds

•  7-‐15hr
par>>oning,
23-‐31hr
detec>on
(per
round)

•  Serngs:

•  Clone
Size:
10-‐2500
lines.

•  Minimum
clone
similarity:
70%

•  Gold
Standard

•  Not
possible.

NiCad – Detec/on vs. Rounds
y
=
245387x
+
767852

R²
=
0.99993

0.00E+00

1.00E+05

2.00E+05

3.00E+05

4.00E+05

5.00E+05

6.00E+05

7.00E+05

8.00E+05

9.00E+05

1.00E+06

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

6.00E+06

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Unique
Cloned
Fragments
Found

Unique
Clones
Found

Round

Unique
Clones
Found

Unique
Clone
Fragments
Found

Deckard
•  IJaDataset

•  Scalability
Limit:
Execu>on
>me.

•  10,000
ﬁle
subsets,
289
par>>ons,
20
rounds

•  7-‐15hr
par>>oning,
5-‐7
days
detec>on
(per
round)

•  Serngs:

•  Minimum
Fragment
Size:
50
tokens

•  Sliding
Window:
5
tokens

•  Minimum
Clone
Similarity:
90%
(tree)

•  Gold
Standard

•  Execu>on
>me
too
long.

Deckard: Detec/on vs. Rounds
1.00E+07

1.10E+07

1.20E+07

1.30E+07

1.40E+07

1.50E+07

1.60E+07

1.70E+07

1.80E+07

1.90E+07

1
2
3
4
5
6
7
8
9
10

Unique
Reported
Clone
Fragments

Round

Deckard – Detec/on vs. Rounds (Trim)
Considering
only
clone
classes
with
<=
10
fragments.

0.00E+00

2.00E+06

4.00E+06

6.00E+06

8.00E+06

1.00E+07

1.20E+07

1.40E+07

1.60E+07

1.80E+07

0.00E+00

2.00E+07

4.00E+07

6.00E+07

8.00E+07

1.00E+08

1.20E+08

1
2
3
4
5
6
7
8
9
10

Unique
Clone
Fragments
Found

Round

Clones

Fragments

Main Experiment Conclusions
•  Shuﬄing
framework
ﬁnds
cloned
fragments
faster
than

the
clone
pair
rela>onships
between
them.

•  A
large
number
of
rounds
may
be
needed
to
detect
a

sizable
number
of
the
clone
pairs.

•  Appropriate
when
loss
of
recall
is
acceptable.

•  Ex:
contribu>ng
towards
mul>-‐tool
clone
corpus.

•  Processing
the
clones
found
in
a
inter-‐project
clone

corpus
can
become
itself
a
scalability
issue.

Clone Recovery
How
can
we
improve
clone
pair
discovery?

•  Without
a
signiﬁcant
increase
in
rounds?

IDEA:
Leverage
Cloned
Fragment
Detec2on
Ability

•  Apply
Transi>ve
Property
on
Clone
Repository.

•  If
(A,B)
and
(B,C)
then
(A,C)

•  Perform
clone
search
amongst
cloned
fragments.

Transi/ve Clone Recovery Test
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Recall

Round

Clone
Recall
Heuris>c
Recall
Recovered
Recall

NiCad,
JDK1.7

Transi/ve Clone Recovery Test
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Recall

Round

Clone
Recall
Heuris>c
Recall
Recovered
Recall

Simian,
JDK1.7

Future Work
1.  Inves>gate
addi>onal
tools.

2.  Inves>gate
eﬃcient
clone
recovery
methods.

3.  Directly
compare
with
determinis>c
approach.

4.  Use
the
shuﬄing
framework
to
contribute

towards
an
inter-‐project
clone
corpus
(IJaDataset

2.0).

Scaling classical clone detection tools for ultra large datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling classical clone detection tools for ultra large datasets

Similar to Scaling classical clone detection tools for ultra large datasets (20)

Recently uploaded

Recently uploaded (20)

Scaling classical clone detection tools for ultra large datasets