BigCloneBench

TOWARDS A BIG DATA
CURATED BENCHMARK OF
INTER-PROJECT CODE CLONES
Jeff Svajlenko, Judith F. Islam, Iman Keivanloo*,
Chanchal Roy, Mohammad Mamun Mia
Computer Science, University of Saskatchewan
*Electrical and Computer Engineering, Queen’s
University

Overview
• BigCloneBench
• Big Data Clone Benchmark
• 6 million clone pairs of 10 functionalities.
• To Evaluate Clone Detection Performance
• Emerging Big Data Clone Detection Tools
2ICSME 2014

Overview
•Background
•Motivation
•Methodology
•Data Summary
•Evaluating Clone Detectors
•Future Work
3ICSME 2014

Background – Code Clone
Code Snippets
A continuous region of code: (srcfile, startline, endline).
Clone Pair
A pair of code snippets that are similar.
Clone Class
A set of code snippets that are similar.
Clone Type
Describes the similarity.
4ICSME 2014

Background – Type 1
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a_>=_b)n
{
c = d_+_b; // MyComment 1
d = d_+_1;n
}
else
c = d_–_a; // MyComment 2
Syntactically identical code snippets, except for
differences in white space, layout and comments.
5ICSME 2014

If (a>=b) {
d = d+1;
} else
If (a >= y)
{
x = d + y; // MyComment1
d = d + 10;
}
else
x = d – a; // MyComment2
Syntactically identical code snippets, except for
differences in identifier names, literal values, white
space, layout and comments.
6ICSME 2014

If (a>=b) {
d = d+1;
} else
If (a >= y)
{
x = d + y; // MyComment1
}
else {
x = d – a - 10; // MyComment2
}
Syntactically similar code snippets that differ at the statement level.
7ICSME 2014

int i, j=1;
for (i=1; i<=VALUE; i++)
j=j*i;
int factorial(int n)
{
if (n == 0)
return 1;
else
return n * factorial(n-1);
}
Syntactically dissimilar code snippets that implement
the same functionality.
8ICSME 2014

Background – Clone Detection
• Clone Detection Tool
• Locates clones within a software system or source repository.
• Types: Classical, Semantic, Big Data, Search
• Performance Evaluation
• Recall
• Precision
• Benchmarks Focus on Recall
9ICSME 2014

Motivation
• Emerging Big Data Clone Detection and Search
• Applications:
• Building inter-project clone corpora.
• Mine new APIs
• Detect License Violation
• Find Code Examples
• …
• Problem:
• No Benchmark for Big Data Clone Detection!
• Existing benchmarks are small or outdated.
10ICSME 2014

Methodology
• Mine Clones in Big Data
• Target: IJaDataset 2.0
• 25,000 open-source systems crawled
• 2.4 million files, 365MLOC
• Challenge
• 250 million candidate function clone pairs.
• Need
• To reduce and optimize search space.
11ICSME 2014

Methodology – General Procedure
Solution: Mine Clones of Specific Functionalities
1. Select a Functionality.
2. Create a Search Heuristic
3. Identify candidate snippets using heuristic.
4. Manually tag each snippet as true or false positive.
5. Populate the benchmark with true/false clones.
12ICSME 2014

Methodology – Mine & Tag Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
13ICSME 2014

Methodology – Mine & Tag Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., Bubble Sort
14ICSME 2014

Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
15ICSME 2014

Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., “Sort any linear collection of
data using the bubble sort
algorithm.”
16ICSME 2014

Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
17ICSME 2014

Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., arr[i] = arr[i+1]
18ICSME 2014

Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
19ICSME 2014

Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
20ICSME 2014

Methodology – True Clones
# 𝑐𝑙𝑜𝑛𝑒 𝑝𝑎𝑖𝑟𝑠 =
𝑛 𝑛 − 1
2
= O 𝑛2
𝑛 = # 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠 + # 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠
Metadata:
• Clone Type
• Syntactical Similarity
True +
Sample
Benchmark
Typify Clone
PairsClone
Class
Clone Pairs
f1 f2 m
21ICSME 2014

Methodology – False Clones
Benchmark Sample False +
False Clone Pairs
sample False+
#𝑓𝑎𝑙𝑠𝑒 𝑐𝑙𝑜𝑛𝑒 𝑝𝑎𝑖𝑟𝑠 = 𝑠 ∗ 𝑓
𝑠 = # 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠
𝑓 = # 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠
22ICSME 2014

Data Summary
• 10 Functionalities
• Web Download
• Secure Cryptographic Hashing
• Copy File
• Decompress Zip File
• FTP Authenticated Login
• Bubble Sort
• Initialize Scrolling Graphical Viewer with Model
• Setup Scrolling Graphical Viewer Event Handler
• Create Java Project (Eclipse API)
• Database Update and Rollback (SQL)
• Three Judges Tagged 60 Thousand Snippets
23ICSME 2014

Data Summary
• 6.2 Million True Clones
• 259 Thousand False Clones
Clone Type # Clone Pairs
Type-1 16,168
Type-2 3,733
Strong Type-3 [70-100)% 11,286
Moderate Type-3 [50-70)% 53,880
Weak Type-3 & Type-4 [0-50%) 6,079,886
24ICSME 2014

Evaluating Clone Detectors
recall =
DÇBtc
Btc
D = Detected Candidate Clones
Btc = True Clones Pairs in Benchmark
Bfc = False Clone Pairs in Benchmark
precision =
DÇBtc
DÇ(Btc ÈBfc )
25ICSME 2014

Evaluating Clone Detectors - Usage
• Big Data Clone Detectors
• Classical Clone Detection Tools
• Semantic Clone Detectors
• Clone Search
26ICSME 2014

Future Work
• More functionalities.
• More judges.
• More Meta Data
• Better Type-3/Type-4 Separation
27ICSME 2014

Questions?
Latest Version:
github.com/clonebench/bigclonebench
28ICSME 2014

BigCloneBench

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BigCloneBench

Similar to BigCloneBench (20)

Recently uploaded

Recently uploaded (20)

BigCloneBench

Editor's Notes