Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
BigCloneBench
1. TOWARDS A BIG DATA
CURATED BENCHMARK OF
INTER-PROJECT CODE CLONES
Jeff Svajlenko, Judith F. Islam, Iman Keivanloo*,
Chanchal Roy, Mohammad Mamun Mia
Computer Science, University of Saskatchewan
*Electrical and Computer Engineering, Queen’s
University
2. Overview
• BigCloneBench
• Big Data Clone Benchmark
• 6 million clone pairs of 10 functionalities.
• To Evaluate Clone Detection Performance
• Emerging Big Data Clone Detection Tools
2ICSME 2014
4. Background – Code Clone
Code Snippets
A continuous region of code: (srcfile, startline, endline).
Clone Pair
A pair of code snippets that are similar.
Clone Class
A set of code snippets that are similar.
Clone Type
Describes the similarity.
4ICSME 2014
5. Background – Type 1
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a_>=_b)n
{
c = d_+_b; // MyComment 1
d = d_+_1;n
}
else
c = d_–_a; // MyComment 2
Syntactically identical code snippets, except for
differences in white space, layout and comments.
5ICSME 2014
6. Background – Type 2
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a >= y)
{
x = d + y; // MyComment1
d = d + 10;
}
else
x = d – a; // MyComment2
Syntactically identical code snippets, except for
differences in identifier names, literal values, white
space, layout and comments.
6ICSME 2014
7. Background – Type 3
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a >= y)
{
x = d + y; // MyComment1
}
else {
x = d – a - 10; // MyComment2
}
Syntactically similar code snippets that differ at the statement level.
7ICSME 2014
8. Background – Type 4
int i, j=1;
for (i=1; i<=VALUE; i++)
j=j*i;
int factorial(int n)
{
if (n == 0)
return 1;
else
return n * factorial(n-1);
}
Syntactically dissimilar code snippets that implement
the same functionality.
8ICSME 2014
9. Background – Clone Detection
• Clone Detection Tool
• Locates clones within a software system or source repository.
• Types: Classical, Semantic, Big Data, Search
• Performance Evaluation
• Recall
• Precision
• Benchmarks Focus on Recall
9ICSME 2014
10. Motivation
• Emerging Big Data Clone Detection and Search
• Applications:
• Building inter-project clone corpora.
• Mine new APIs
• Detect License Violation
• Find Code Examples
• …
• Problem:
• No Benchmark for Big Data Clone Detection!
• Existing benchmarks are small or outdated.
10ICSME 2014
11. Methodology
• Mine Clones in Big Data
• Target: IJaDataset 2.0
• 25,000 open-source systems crawled
• 2.4 million files, 365MLOC
• Challenge
• 250 million candidate function clone pairs.
• Need
• To reduce and optimize search space.
11ICSME 2014
12. Methodology – General Procedure
Solution: Mine Clones of Specific Functionalities
1. Select a Functionality.
2. Create a Search Heuristic
3. Identify candidate snippets using heuristic.
4. Manually tag each snippet as true or false positive.
5. Populate the benchmark with true/false clones.
12ICSME 2014
13. Methodology – Mine & Tag Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
13ICSME 2014
14. Methodology – Mine & Tag Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., Bubble Sort
14ICSME 2014
15. Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
15ICSME 2014
16. Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., “Sort any linear collection of
data using the bubble sort
algorithm.”
16ICSME 2014
17. Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
17ICSME 2014
Hello, my name is Jeff Svajlenko, and today I am presenting the paper Towards a Big Data Curated Benchmark of Inter-Project Code Clones. Contributors to this project include myself, Judith Islam, Iman Keivanloo, Chanchal Roy, and Mohammad Mamun Mia.
Today I will be talking about BigCloneBench, which is a big data clone benchmark containing 6 million clone pairs of 10 distinct functionalities. We created this benchmark for evaluating clone detection performance, especially that of emerging big data clone detection algorithms.
I will begin by discussing some background details, before proceeding into the motivation of our work. I will then discuss our methodology for building this benchmark, and a summary of its contents. I will discuss how the benchmark can be used to evaluate the clone detectors, before proceeding into future work.
I will begin the background with some code clone definitions.
A code snippet is a continuous region of code, specified by the source file, and start and end lines.
A clone pair is a pair of code snippets that are similar.
Clone pairs are sometimes summarized as a clone class, which is a set of code snippets that are similar.
Clones are assigned a clone type, which describes this similarity. The community agrees upon four fundamental clone types.
A Type-1 clone contains code snippets that are syntactically identical, except for differences in white-space, layout and comments.
A Type-2 clone extends this definition to also allow differences in variable names and literal values.
A Type-3 clone includes snippets that are syntactical similar, but contains differences at the statement level. Snippets may have statements added, removed or modified with respect to each other. There is no agreement on the required similarity of a type-3 clone, although studies typically require them to be 70-90% similar by syntax, using some similarity metric.
A Type-4 clone includes code snippets that syntactically similar, but implement the same functionality. For example, here we have two very syntactically dissimilar implementations of factorial.
Clone detection tools locate clones within a software system or source repository. There are multiple types of clone detection tools. Classical clone detection tools locate syntactically similar clones, the first three clone types, in a single or between a handful of subject systems. Semantic clone detectors aim to expand detection to type-4 clones, by using methods such as program dependency graphs. Big data clone detection is concerned with the detection of clones between thousands of subject systems. Search algorithms search for clones of a target code snippets, often within big data repositories.
All of these clone detectors are evaluated using the information retrieval metrics recall and precision. Where recall is the ratio of the clones within a subject system or repository that a clone detector is able to detect. While precision is the ratio of the clones reported by the clone detectors that are in fact true clones, not false positives.
Benchmarks typically focus on enabling the measurement of recall. Tool developers can measure their tool precision by validating their tool’s output for a variety of subject systems. However, it is impossible for them to measure recall without pre-knowledge of the clones that exist within a subject system.
The motivation for our work is emerging big data clone detection and search algorithms, which have numerous applications, such as: building inter-project clone corpora for research studies and as a data set for some tools, mining for new APIs, detecting licensing violations, finding code examples, and so on.
However, there are currently no benchmarks for big data clone tools. Additionally, the current benchmarks that target the classical clone detection tools are either too small for big data, or are old. For example, the most common benchmark was published in 2007, and was built using tools circa 2002.
We built our benchmark by mining clones in big data. Our target was the inter-project java dataset IJaDataset 2.0., which contains source code crawled from 25,000 open-source systems, includings 2.4 million files and 365 million lines of code.
The challenge is that even if we only consider function clones, there are 250 million candidate function clone pairs to examine. Looking at any significant number of these candidates is impractical, and a random sample of them is also not efficient due to the rarity of inter-project clones.
We therefore needed to reduce and optimize the search space for locating clones.
Our solution was to mine clones of specific functionalities.
This is an overview of our procedure. We begin by selecting a functionality. We then create a search heuristic that can identify if a snippet *might* implement the functionality. We use this heuristic to identify the candidate snippets in the dataset. The candidates are tagged by expert judges as true or false positives of the functionality. We then use these snippets to populate the benchmark with true and false clones. This process is repeated for any number of functionalities.
Next I will go into the details of these stages.
This diagram shows the snippet mining and tagging steps.
We begin by selecting a functionality that we believe will appear many times in the dataset.
For example, who chose the functionality Bubble Sort.
Next, we research the different ways this functionality can be implemented in Java. We investigate official and 3rd party libraries, instructional material such as text books and online tutorials, as well as online discussions such as Stack Overflow.
For bubble sort, we found many possible implementations. With variations in data types, data structures, loop structures, and data comparison logic.
Next, considering the previous research, we create a formal specification for the functionality. This is the minimum set of features or steps a snippet must realize to be a true positive of this functionality.
For bubble sort, this was “Sort any linear collection of data using the bubble sort algorithm”.
As part of our research, we collect sample snippets that implement the functionality. These are minimum working examples that meet the specification. These are added to the dataset as additional crawled source code, and play a role in populating the benchmark.
We create a search heuristic, which is a logical combination of keywords and source code patterns which are intrinsic to the identifier implementations of the functionality. The search heuristic can therefore identify if a snippet might implement the funcitonlaity.
For example, bubble sort.
The search heuristic is executed for each snippets in the dataset producing a set of candidates that might implement the functionality.
For this version of the benchmark we have chosen the function snippet granularity, as functions best encapsulate whole functionalities, and it’s a granularity that pretty much all clone detectors can detect at.
The candidate snippets are then given to expert judges, who tag them as true or false positives using the formal specification. A snippet is tagged as a true positive so long as it meets the specification, even if it performs additional related or unrelated steps. Otherwise it is tagged as a false positives. The specification helps reduce individual bias in the tagging process.
From the previous tasks, we have identified a set of code snippets that implement a functionality. This includes the sample snippets of the functionality, and the candidate snippets tagged as true positives. Together they form a clone class of snippets that are similar by functionality. If this clone class has n such snippets, we have found order n squared clone pairs. These clones are then automatically typified using a TXL-based clone typifier. This perfectly typifies type 1 and type 2 clones, but the separation of type 3 and type 4 clones is a bit more tricky. Which I will discuss in a few slides.
We have also identified many false positives of the functionality. We know that the sample snippets exactly meet the specification of the functionality, and the false positives do not. So each pair of sample snippet and false positive snippet is a false clone. While they may share some syntactical similarity, this similarity is coincidental.
We executed this procedure for ten distinct functionalities. In total 60 thousand snippets were tagged by three graduate student judges.
In total we have identified 6.2 million clone pairs.
Each of these clone pairs share semantics, specifically the functionality they implement.
Here we summarize the syntactical similarity by dividing them by their clone type. Type 1 and type 2 is easy, but it is difficult to separate to the type 3 and type 4 clones, due to lack of agreement of when type 3 ends and type 4 begins. Instead we separate them by their syntactical similarity. Syntactical clone detectors are typically configured to locate clones that share 70% or more sytnax, which we consider strong Type-3 clones. We consider the clones that share at least half their syntax, but less than 70%, moderate Type-3 clones. We then consider the clones which share less than half their syntax to be subjectively either a weak type-3 clone, or a type 4 clone.
We also identified 259 thousand false clones.
The benchmark can evaluate the clone detectors by measuring recall and precision.
We measure recall as the ratio of the true clones in the benchmark that the detector was able to detect.
While our focus was recall, we provide a limited precision measure. Precision is measured as the ratio of the known clones detected by the tool that are true clones. It ignores any clones that are unknown in the benchmark. The dataset contains many orders of mangnitude more false clones than true clones, so our sampling of the false clones is small. So this is not a replacement for a standard measurement of precision, but does provide some hints.
It also has a limited ability to measure precision, as the ratio of the clones detected by the tool that are true clones. However, this only considers the detected clones that match known clones n the benchmark. The dataset contains many orders of magnitude more false clones than true clones, so we have not sampled the false clones as rigorously. So this is not a replacement for a standard measurement of precision, but may provide some hints.
The benchmark was designed for the big data tools, but it may also be used to evaluate the other classifications of tools.
The big data tools can be executed for the entire dataset. Recall can then be measured for the full benchmark, but also per clone type, per functionality, or even per ranges of syntactical similarity.
Classical clone detection tools won’t be scalable to the full benchmark. However, they could be evaluated for subsets of the benchmark. Confidence can be achieved by evaluating the tool for many subsets. Subsets could be randomly selected, or even be the intra-project clones within one of the subject systems in the dataset.
Semantic clone detectors will also need to be evaluated for subsets due to potential scalability issues. Good subsets would be the clones of a functionality.
Clone search algorithms can be evaluated by using one of the sample snippets as a target, and evaluating which of the true positive snippets the tool locates.
As future work, we plan to expand the benchmark in a number of ways. We will grow the benchmark by adding new functionalities, which will increase the number and variety of clones in the benchmark. We plan to increase the number of judges, including using multiple judges per functionality. This will improve our data confidence, and allow us to measure the tagging accuracy. We will also plan to investigate and add more clone meta-data (such as additional similarity metrics) to better describe our clones. We also plan to investigate how we can better separate our type 3 and type 4 clones.
The benchmark is available at the following URL. Please feel free to ask me any questions.