SlideShare a Scribd company logo
1 of 28
TOWARDS A BIG DATA
CURATED BENCHMARK OF
INTER-PROJECT CODE CLONES
Jeff Svajlenko, Judith F. Islam, Iman Keivanloo*,
Chanchal Roy, Mohammad Mamun Mia
Computer Science, University of Saskatchewan
*Electrical and Computer Engineering, Queen’s
University
Overview
• BigCloneBench
• Big Data Clone Benchmark
• 6 million clone pairs of 10 functionalities.
• To Evaluate Clone Detection Performance
• Emerging Big Data Clone Detection Tools
2ICSME 2014
Overview
•Background
•Motivation
•Methodology
•Data Summary
•Evaluating Clone Detectors
•Future Work
3ICSME 2014
Background – Code Clone
Code Snippets
A continuous region of code: (srcfile, startline, endline).
Clone Pair
A pair of code snippets that are similar.
Clone Class
A set of code snippets that are similar.
Clone Type
Describes the similarity.
4ICSME 2014
Background – Type 1
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a_>=_b)n
{
c = d_+_b; // MyComment 1
d = d_+_1;n
}
else
c = d_–_a; // MyComment 2
Syntactically identical code snippets, except for
differences in white space, layout and comments.
5ICSME 2014
Background – Type 2
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a >= y)
{
x = d + y; // MyComment1
d = d + 10;
}
else
x = d – a; // MyComment2
Syntactically identical code snippets, except for
differences in identifier names, literal values, white
space, layout and comments.
6ICSME 2014
Background – Type 3
If (a>=b) {
c = d+b; // Comment 1
d = d+1;
} else
c = d-a; // Comment 2
If (a >= y)
{
x = d + y; // MyComment1
}
else {
x = d – a - 10; // MyComment2
}
Syntactically similar code snippets that differ at the statement level.
7ICSME 2014
Background – Type 4
int i, j=1;
for (i=1; i<=VALUE; i++)
j=j*i;
int factorial(int n)
{
if (n == 0)
return 1;
else
return n * factorial(n-1);
}
Syntactically dissimilar code snippets that implement
the same functionality.
8ICSME 2014
Background – Clone Detection
• Clone Detection Tool
• Locates clones within a software system or source repository.
• Types: Classical, Semantic, Big Data, Search
• Performance Evaluation
• Recall
• Precision
• Benchmarks Focus on Recall
9ICSME 2014
Motivation
• Emerging Big Data Clone Detection and Search
• Applications:
• Building inter-project clone corpora.
• Mine new APIs
• Detect License Violation
• Find Code Examples
• …
• Problem:
• No Benchmark for Big Data Clone Detection!
• Existing benchmarks are small or outdated.
10ICSME 2014
Methodology
• Mine Clones in Big Data
• Target: IJaDataset 2.0
• 25,000 open-source systems crawled
• 2.4 million files, 365MLOC
• Challenge
• 250 million candidate function clone pairs.
• Need
• To reduce and optimize search space.
11ICSME 2014
Methodology – General Procedure
Solution: Mine Clones of Specific Functionalities
1. Select a Functionality.
2. Create a Search Heuristic
3. Identify candidate snippets using heuristic.
4. Manually tag each snippet as true or false positive.
5. Populate the benchmark with true/false clones.
12ICSME 2014
Methodology – Mine & Tag Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
13ICSME 2014
Methodology – Mine & Tag Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., Bubble Sort
14ICSME 2014
Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
15ICSME 2014
Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., “Sort any linear collection of
data using the bubble sort
algorithm.”
16ICSME 2014
Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
17ICSME 2014
Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
e.g., arr[i] = arr[i+1]
18ICSME 2014
Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
19ICSME 2014
Methodology – Mine & Judge Snippets
Sample
Select
Functionality
Identify Possible
Implementations
Create
Specification
Create Sample
Snippets
Create Search
Heuristic
Search
Heuristic
Specification
Build
Candidate Set
IJaDataset
Tag
Judges
True +
False +
Candi-
date
20ICSME 2014
Methodology – True Clones
# 𝑐𝑙𝑜𝑛𝑒 𝑝𝑎𝑖𝑟𝑠 =
𝑛 𝑛 − 1
2
= O 𝑛2
𝑛 = # 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠 + # 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠
Metadata:
• Clone Type
• Syntactical Similarity
True +
Sample
Benchmark
Typify Clone
PairsClone
Class
Clone Pairs
f1 f2 m
21ICSME 2014
Methodology – False Clones
Benchmark Sample False +
False Clone Pairs
sample False+
#𝑓𝑎𝑙𝑠𝑒 𝑐𝑙𝑜𝑛𝑒 𝑝𝑎𝑖𝑟𝑠 = 𝑠 ∗ 𝑓
𝑠 = # 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠
𝑓 = # 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠
22ICSME 2014
Data Summary
• 10 Functionalities
• Web Download
• Secure Cryptographic Hashing
• Copy File
• Decompress Zip File
• FTP Authenticated Login
• Bubble Sort
• Initialize Scrolling Graphical Viewer with Model
• Setup Scrolling Graphical Viewer Event Handler
• Create Java Project (Eclipse API)
• Database Update and Rollback (SQL)
• Three Judges Tagged 60 Thousand Snippets
23ICSME 2014
Data Summary
• 6.2 Million True Clones
• 259 Thousand False Clones
Clone Type # Clone Pairs
Type-1 16,168
Type-2 3,733
Strong Type-3 [70-100)% 11,286
Moderate Type-3 [50-70)% 53,880
Weak Type-3 & Type-4 [0-50%) 6,079,886
24ICSME 2014
Evaluating Clone Detectors
recall =
DÇBtc
Btc
D = Detected Candidate Clones
Btc = True Clones Pairs in Benchmark
Bfc = False Clone Pairs in Benchmark
precision =
DÇBtc
DÇ(Btc ÈBfc )
25ICSME 2014
Evaluating Clone Detectors - Usage
• Big Data Clone Detectors
• Classical Clone Detection Tools
• Semantic Clone Detectors
• Clone Search
26ICSME 2014
Future Work
• More functionalities.
• More judges.
• More Meta Data
• Better Type-3/Type-4 Separation
27ICSME 2014
Questions?
Latest Version:
github.com/clonebench/bigclonebench
28ICSME 2014

More Related Content

What's hot

Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Universitat Politècnica de Catalunya
 
Case study on deep learning
Case study on deep learningCase study on deep learning
Case study on deep learningHarshitBarde
 
Graph neural network 2부 recommendation 개요
Graph neural network  2부  recommendation 개요Graph neural network  2부  recommendation 개요
Graph neural network 2부 recommendation 개요seungwoo kim
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis PresentationWajdi Khattel
 
Style space analysis paper review !
Style space analysis paper review !Style space analysis paper review !
Style space analysis paper review !taeseon ryu
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers leopauly
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptxYanhuaSi
 
Handling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMHandling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMMin-Yih Hsu
 
Notes of AI for everyone - by Andrew Ng
Notes of AI for everyone - by Andrew NgNotes of AI for everyone - by Andrew Ng
Notes of AI for everyone - by Andrew Ngmgopalani
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
 
Skip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesSkip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesfgodin
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of FunctionsJaeJun Yoo
 
用 Kotlin 做自動化工具
用 Kotlin 做自動化工具用 Kotlin 做自動化工具
用 Kotlin 做自動化工具Shengyou Fan
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 

What's hot (20)

Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
 
Case study on deep learning
Case study on deep learningCase study on deep learning
Case study on deep learning
 
Graph neural network 2부 recommendation 개요
Graph neural network  2부  recommendation 개요Graph neural network  2부  recommendation 개요
Graph neural network 2부 recommendation 개요
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
Style space analysis paper review !
Style space analysis paper review !Style space analysis paper review !
Style space analysis paper review !
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
LSTM
LSTMLSTM
LSTM
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
 
Handling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMHandling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVM
 
Notes of AI for everyone - by Andrew Ng
Notes of AI for everyone - by Andrew NgNotes of AI for everyone - by Andrew Ng
Notes of AI for everyone - by Andrew Ng
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
Skip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesSkip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architectures
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
用 Kotlin 做自動化工具
用 Kotlin 做自動化工具用 Kotlin 做自動化工具
用 Kotlin 做自動化工具
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 

Similar to BigCloneBench

A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...PAPIs.io
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingLionel Briand
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneNick Pentreath
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at ScaleDatabricks
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementDatabricks
 
Back to the Basics: Principles for Constructing Quality Software
Back to the Basics: Principles for Constructing Quality SoftwareBack to the Basics: Principles for Constructing Quality Software
Back to the Basics: Principles for Constructing Quality SoftwareTechWell
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"Naomi Dushay
 
CTFs, Bugbounty and your security career
CTFs, Bugbounty and your security careerCTFs, Bugbounty and your security career
CTFs, Bugbounty and your security careerIbrahim El-Sayed
 
Proactive Security AppSec Case Study
Proactive Security AppSec Case StudyProactive Security AppSec Case Study
Proactive Security AppSec Case StudyAndy Hoernecke
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Anubhav Dhiman
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesLeo Loobeek
 
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Lviv Startup Club
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databasesjexp
 
2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheon2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheonMark Reynolds
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDaveEdwards12
 

Similar to BigCloneBench (20)

A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for Everyone
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
 
Back to the Basics: Principles for Constructing Quality Software
Back to the Basics: Principles for Constructing Quality SoftwareBack to the Basics: Principles for Constructing Quality Software
Back to the Basics: Principles for Constructing Quality Software
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"
 
CTFs, Bugbounty and your security career
CTFs, Bugbounty and your security careerCTFs, Bugbounty and your security career
CTFs, Bugbounty and your security career
 
Proactive Security AppSec Case Study
Proactive Security AppSec Case StudyProactive Security AppSec Case Study
Proactive Security AppSec Case Study
 
EpiServer find Macaw
EpiServer find MacawEpiServer find Macaw
EpiServer find Macaw
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python” Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
 
ICPC06.ppt
ICPC06.pptICPC06.ppt
ICPC06.ppt
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheon2016 03-16 digital energy luncheon
2016 03-16 digital energy luncheon
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

BigCloneBench

  • 1. TOWARDS A BIG DATA CURATED BENCHMARK OF INTER-PROJECT CODE CLONES Jeff Svajlenko, Judith F. Islam, Iman Keivanloo*, Chanchal Roy, Mohammad Mamun Mia Computer Science, University of Saskatchewan *Electrical and Computer Engineering, Queen’s University
  • 2. Overview • BigCloneBench • Big Data Clone Benchmark • 6 million clone pairs of 10 functionalities. • To Evaluate Clone Detection Performance • Emerging Big Data Clone Detection Tools 2ICSME 2014
  • 4. Background – Code Clone Code Snippets A continuous region of code: (srcfile, startline, endline). Clone Pair A pair of code snippets that are similar. Clone Class A set of code snippets that are similar. Clone Type Describes the similarity. 4ICSME 2014
  • 5. Background – Type 1 If (a>=b) { c = d+b; // Comment 1 d = d+1; } else c = d-a; // Comment 2 If (a_>=_b)n { c = d_+_b; // MyComment 1 d = d_+_1;n } else c = d_–_a; // MyComment 2 Syntactically identical code snippets, except for differences in white space, layout and comments. 5ICSME 2014
  • 6. Background – Type 2 If (a>=b) { c = d+b; // Comment 1 d = d+1; } else c = d-a; // Comment 2 If (a >= y) { x = d + y; // MyComment1 d = d + 10; } else x = d – a; // MyComment2 Syntactically identical code snippets, except for differences in identifier names, literal values, white space, layout and comments. 6ICSME 2014
  • 7. Background – Type 3 If (a>=b) { c = d+b; // Comment 1 d = d+1; } else c = d-a; // Comment 2 If (a >= y) { x = d + y; // MyComment1 } else { x = d – a - 10; // MyComment2 } Syntactically similar code snippets that differ at the statement level. 7ICSME 2014
  • 8. Background – Type 4 int i, j=1; for (i=1; i<=VALUE; i++) j=j*i; int factorial(int n) { if (n == 0) return 1; else return n * factorial(n-1); } Syntactically dissimilar code snippets that implement the same functionality. 8ICSME 2014
  • 9. Background – Clone Detection • Clone Detection Tool • Locates clones within a software system or source repository. • Types: Classical, Semantic, Big Data, Search • Performance Evaluation • Recall • Precision • Benchmarks Focus on Recall 9ICSME 2014
  • 10. Motivation • Emerging Big Data Clone Detection and Search • Applications: • Building inter-project clone corpora. • Mine new APIs • Detect License Violation • Find Code Examples • … • Problem: • No Benchmark for Big Data Clone Detection! • Existing benchmarks are small or outdated. 10ICSME 2014
  • 11. Methodology • Mine Clones in Big Data • Target: IJaDataset 2.0 • 25,000 open-source systems crawled • 2.4 million files, 365MLOC • Challenge • 250 million candidate function clone pairs. • Need • To reduce and optimize search space. 11ICSME 2014
  • 12. Methodology – General Procedure Solution: Mine Clones of Specific Functionalities 1. Select a Functionality. 2. Create a Search Heuristic 3. Identify candidate snippets using heuristic. 4. Manually tag each snippet as true or false positive. 5. Populate the benchmark with true/false clones. 12ICSME 2014
  • 13. Methodology – Mine & Tag Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date 13ICSME 2014
  • 14. Methodology – Mine & Tag Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date e.g., Bubble Sort 14ICSME 2014
  • 15. Methodology – Mine & Judge Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date 15ICSME 2014
  • 16. Methodology – Mine & Judge Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date e.g., “Sort any linear collection of data using the bubble sort algorithm.” 16ICSME 2014
  • 17. Methodology – Mine & Judge Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date 17ICSME 2014
  • 18. Methodology – Mine & Judge Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date e.g., arr[i] = arr[i+1] 18ICSME 2014
  • 19. Methodology – Mine & Judge Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date 19ICSME 2014
  • 20. Methodology – Mine & Judge Snippets Sample Select Functionality Identify Possible Implementations Create Specification Create Sample Snippets Create Search Heuristic Search Heuristic Specification Build Candidate Set IJaDataset Tag Judges True + False + Candi- date 20ICSME 2014
  • 21. Methodology – True Clones # 𝑐𝑙𝑜𝑛𝑒 𝑝𝑎𝑖𝑟𝑠 = 𝑛 𝑛 − 1 2 = O 𝑛2 𝑛 = # 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠 + # 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠 Metadata: • Clone Type • Syntactical Similarity True + Sample Benchmark Typify Clone PairsClone Class Clone Pairs f1 f2 m 21ICSME 2014
  • 22. Methodology – False Clones Benchmark Sample False + False Clone Pairs sample False+ #𝑓𝑎𝑙𝑠𝑒 𝑐𝑙𝑜𝑛𝑒 𝑝𝑎𝑖𝑟𝑠 = 𝑠 ∗ 𝑓 𝑠 = # 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠 𝑓 = # 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡𝑠 22ICSME 2014
  • 23. Data Summary • 10 Functionalities • Web Download • Secure Cryptographic Hashing • Copy File • Decompress Zip File • FTP Authenticated Login • Bubble Sort • Initialize Scrolling Graphical Viewer with Model • Setup Scrolling Graphical Viewer Event Handler • Create Java Project (Eclipse API) • Database Update and Rollback (SQL) • Three Judges Tagged 60 Thousand Snippets 23ICSME 2014
  • 24. Data Summary • 6.2 Million True Clones • 259 Thousand False Clones Clone Type # Clone Pairs Type-1 16,168 Type-2 3,733 Strong Type-3 [70-100)% 11,286 Moderate Type-3 [50-70)% 53,880 Weak Type-3 & Type-4 [0-50%) 6,079,886 24ICSME 2014
  • 25. Evaluating Clone Detectors recall = DÇBtc Btc D = Detected Candidate Clones Btc = True Clones Pairs in Benchmark Bfc = False Clone Pairs in Benchmark precision = DÇBtc DÇ(Btc ÈBfc ) 25ICSME 2014
  • 26. Evaluating Clone Detectors - Usage • Big Data Clone Detectors • Classical Clone Detection Tools • Semantic Clone Detectors • Clone Search 26ICSME 2014
  • 27. Future Work • More functionalities. • More judges. • More Meta Data • Better Type-3/Type-4 Separation 27ICSME 2014

Editor's Notes

  1. Hello, my name is Jeff Svajlenko, and today I am presenting the paper Towards a Big Data Curated Benchmark of Inter-Project Code Clones. Contributors to this project include myself, Judith Islam, Iman Keivanloo, Chanchal Roy, and Mohammad Mamun Mia.
  2. Today I will be talking about BigCloneBench, which is a big data clone benchmark containing 6 million clone pairs of 10 distinct functionalities. We created this benchmark for evaluating clone detection performance, especially that of emerging big data clone detection algorithms.
  3. I will begin by discussing some background details, before proceeding into the motivation of our work. I will then discuss our methodology for building this benchmark, and a summary of its contents. I will discuss how the benchmark can be used to evaluate the clone detectors, before proceeding into future work.
  4. I will begin the background with some code clone definitions. A code snippet is a continuous region of code, specified by the source file, and start and end lines. A clone pair is a pair of code snippets that are similar. Clone pairs are sometimes summarized as a clone class, which is a set of code snippets that are similar. Clones are assigned a clone type, which describes this similarity. The community agrees upon four fundamental clone types.
  5. A Type-1 clone contains code snippets that are syntactically identical, except for differences in white-space, layout and comments.
  6. A Type-2 clone extends this definition to also allow differences in variable names and literal values.
  7. A Type-3 clone includes snippets that are syntactical similar, but contains differences at the statement level. Snippets may have statements added, removed or modified with respect to each other. There is no agreement on the required similarity of a type-3 clone, although studies typically require them to be 70-90% similar by syntax, using some similarity metric.
  8. A Type-4 clone includes code snippets that syntactically similar, but implement the same functionality. For example, here we have two very syntactically dissimilar implementations of factorial.
  9. Clone detection tools locate clones within a software system or source repository. There are multiple types of clone detection tools. Classical clone detection tools locate syntactically similar clones, the first three clone types, in a single or between a handful of subject systems. Semantic clone detectors aim to expand detection to type-4 clones, by using methods such as program dependency graphs. Big data clone detection is concerned with the detection of clones between thousands of subject systems. Search algorithms search for clones of a target code snippets, often within big data repositories. All of these clone detectors are evaluated using the information retrieval metrics recall and precision. Where recall is the ratio of the clones within a subject system or repository that a clone detector is able to detect. While precision is the ratio of the clones reported by the clone detectors that are in fact true clones, not false positives. Benchmarks typically focus on enabling the measurement of recall. Tool developers can measure their tool precision by validating their tool’s output for a variety of subject systems. However, it is impossible for them to measure recall without pre-knowledge of the clones that exist within a subject system.
  10. The motivation for our work is emerging big data clone detection and search algorithms, which have numerous applications, such as: building inter-project clone corpora for research studies and as a data set for some tools, mining for new APIs, detecting licensing violations, finding code examples, and so on. However, there are currently no benchmarks for big data clone tools. Additionally, the current benchmarks that target the classical clone detection tools are either too small for big data, or are old. For example, the most common benchmark was published in 2007, and was built using tools circa 2002.
  11. We built our benchmark by mining clones in big data. Our target was the inter-project java dataset IJaDataset 2.0., which contains source code crawled from 25,000 open-source systems, includings 2.4 million files and 365 million lines of code. The challenge is that even if we only consider function clones, there are 250 million candidate function clone pairs to examine. Looking at any significant number of these candidates is impractical, and a random sample of them is also not efficient due to the rarity of inter-project clones. We therefore needed to reduce and optimize the search space for locating clones.
  12. Our solution was to mine clones of specific functionalities. This is an overview of our procedure. We begin by selecting a functionality. We then create a search heuristic that can identify if a snippet *might* implement the functionality. We use this heuristic to identify the candidate snippets in the dataset. The candidates are tagged by expert judges as true or false positives of the functionality. We then use these snippets to populate the benchmark with true and false clones. This process is repeated for any number of functionalities. Next I will go into the details of these stages.
  13. This diagram shows the snippet mining and tagging steps.
  14. We begin by selecting a functionality that we believe will appear many times in the dataset. For example, who chose the functionality Bubble Sort.
  15. Next, we research the different ways this functionality can be implemented in Java. We investigate official and 3rd party libraries, instructional material such as text books and online tutorials, as well as online discussions such as Stack Overflow. For bubble sort, we found many possible implementations. With variations in data types, data structures, loop structures, and data comparison logic.
  16. Next, considering the previous research, we create a formal specification for the functionality. This is the minimum set of features or steps a snippet must realize to be a true positive of this functionality. For bubble sort, this was “Sort any linear collection of data using the bubble sort algorithm”.
  17. As part of our research, we collect sample snippets that implement the functionality. These are minimum working examples that meet the specification. These are added to the dataset as additional crawled source code, and play a role in populating the benchmark.
  18. We create a search heuristic, which is a logical combination of keywords and source code patterns which are intrinsic to the identifier implementations of the functionality. The search heuristic can therefore identify if a snippet might implement the funcitonlaity. For example, bubble sort.
  19. The search heuristic is executed for each snippets in the dataset producing a set of candidates that might implement the functionality. For this version of the benchmark we have chosen the function snippet granularity, as functions best encapsulate whole functionalities, and it’s a granularity that pretty much all clone detectors can detect at.
  20. The candidate snippets are then given to expert judges, who tag them as true or false positives using the formal specification. A snippet is tagged as a true positive so long as it meets the specification, even if it performs additional related or unrelated steps. Otherwise it is tagged as a false positives. The specification helps reduce individual bias in the tagging process.
  21. From the previous tasks, we have identified a set of code snippets that implement a functionality. This includes the sample snippets of the functionality, and the candidate snippets tagged as true positives. Together they form a clone class of snippets that are similar by functionality. If this clone class has n such snippets, we have found order n squared clone pairs. These clones are then automatically typified using a TXL-based clone typifier. This perfectly typifies type 1 and type 2 clones, but the separation of type 3 and type 4 clones is a bit more tricky. Which I will discuss in a few slides.
  22. We have also identified many false positives of the functionality. We know that the sample snippets exactly meet the specification of the functionality, and the false positives do not. So each pair of sample snippet and false positive snippet is a false clone. While they may share some syntactical similarity, this similarity is coincidental.
  23. We executed this procedure for ten distinct functionalities. In total 60 thousand snippets were tagged by three graduate student judges.
  24. In total we have identified 6.2 million clone pairs. Each of these clone pairs share semantics, specifically the functionality they implement. Here we summarize the syntactical similarity by dividing them by their clone type. Type 1 and type 2 is easy, but it is difficult to separate to the type 3 and type 4 clones, due to lack of agreement of when type 3 ends and type 4 begins. Instead we separate them by their syntactical similarity. Syntactical clone detectors are typically configured to locate clones that share 70% or more sytnax, which we consider strong Type-3 clones. We consider the clones that share at least half their syntax, but less than 70%, moderate Type-3 clones. We then consider the clones which share less than half their syntax to be subjectively either a weak type-3 clone, or a type 4 clone. We also identified 259 thousand false clones.
  25. The benchmark can evaluate the clone detectors by measuring recall and precision. We measure recall as the ratio of the true clones in the benchmark that the detector was able to detect. While our focus was recall, we provide a limited precision measure. Precision is measured as the ratio of the known clones detected by the tool that are true clones. It ignores any clones that are unknown in the benchmark. The dataset contains many orders of mangnitude more false clones than true clones, so our sampling of the false clones is small. So this is not a replacement for a standard measurement of precision, but does provide some hints. It also has a limited ability to measure precision, as the ratio of the clones detected by the tool that are true clones. However, this only considers the detected clones that match known clones n the benchmark. The dataset contains many orders of magnitude more false clones than true clones, so we have not sampled the false clones as rigorously. So this is not a replacement for a standard measurement of precision, but may provide some hints.
  26. The benchmark was designed for the big data tools, but it may also be used to evaluate the other classifications of tools. The big data tools can be executed for the entire dataset. Recall can then be measured for the full benchmark, but also per clone type, per functionality, or even per ranges of syntactical similarity. Classical clone detection tools won’t be scalable to the full benchmark. However, they could be evaluated for subsets of the benchmark. Confidence can be achieved by evaluating the tool for many subsets. Subsets could be randomly selected, or even be the intra-project clones within one of the subject systems in the dataset. Semantic clone detectors will also need to be evaluated for subsets due to potential scalability issues. Good subsets would be the clones of a functionality. Clone search algorithms can be evaluated by using one of the sample snippets as a target, and evaluating which of the true positive snippets the tool locates.
  27. As future work, we plan to expand the benchmark in a number of ways. We will grow the benchmark by adding new functionalities, which will increase the number and variety of clones in the benchmark. We plan to increase the number of judges, including using multiple judges per functionality. This will improve our data confidence, and allow us to measure the tagging accuracy. We will also plan to investigate and add more clone meta-data (such as additional similarity metrics) to better describe our clones. We also plan to investigate how we can better separate our type 3 and type 4 clones.
  28. The benchmark is available at the following URL. Please feel free to ask me any questions.