SlideShare a Scribd company logo
A Tree Kernel Based
Approach for
Clone Detection
1) University of Naples Federico II
2) University of Basilicata
Anna Corazza1
, Sergio Di Martino1
,
Valerio Maggio1
, Giuseppe Scanniello2
Outline
►Background
○ Clone detection definition
○ State of the Art Techniques Taxonomy
►Our Abstract Syntax Tree based Proposal
○ A Tree Kernel based approach for clone detection
►A preliminary evaluation
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
► Program Text can be further distinguished by their degree of similarity1
○ Type 1 Clone: Exact Copy
○ Type 2 Clone: Parameter Substituted Clone
○ Type 3 Clone: Modified/Structure Substituted Clone
1. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
State of the Art Techniques
► Classified in terms of Program Text representation2
○ String, token, syntax tree, control structures, metric vectors
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
2
2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
State of the Art Techniques
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
► Combined Techniques (a.k.a. Hybrid)
○Combine different representations
○Combine different techniques
○Combine different sources of information
●Tree Kernel based approach (Our approach :)
2
The Proposed Approach
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
► As a measure we propose the use of a
(Tree) Kernel Function
3
Kernels for Structured Data
► Kernels are a class of functions with many appealing features:
○ Are based on the idea that a complex object can be described in terms of
its constituent parts
○ Can be easily tailored to a specific domain
► There exist different classes of Kernels:
○ String Kernels
○ Graph Kernels
○ …
○ Tree Kernels
● Applied to NLP Parse Trees (Collins and Duffy 2004)
4
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of
compared trees
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
(3) A proper Kernel Function to compare subparts of
trees
5
(1) The defined features
► We annotate each node of AST by 4 features:
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is
enclosed
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is enclosed
○ Lexemes
● Lexical information within the code
6
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Lexemes Feature
► For leaf nodes:
○ It is the lexeme associated to the node
► For internal nodes:
○ It is the set of lexemes that recursively comes from
subtrees with minimum height
8
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x 0
x y
y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
y, return
9
(2) Applying features in a Kernel
We exploits these features to compute similarity among pairs of
nodes, as follows:
► Instruction Class filters comparable nodes
○ We compare only nodes with the same Instruction Class
► Instruction, Context and Lexemes are used to define a value of
similarity between compared nodes
10
(Primitive) Kernel Function between nodes
1.0 If two nodes have the same values of
features
0.8 If two nodes differ in lexemes
(same instruction and context)
0.7 If two nodes share lexemes and are
the same instruction
0.5 If two nodes share lexemes and are
enclosed in the same context
0.25 If two nodes have at least one feature
in common
0.0 no match
s(n1,n2)=
11
(3) Tree Kernel: Kernel on entire Tree Structures
►We apply nodes comparison recursively to compute
similarity between subtrees
►We aim to identify the maximum isomorphic
tree/subtree
12
Overall Process
1. Preprocessing 2. Extraction
3. Match Detection 4. Aggregation
13
A Preliminary evaluation
Evaluation Description
► We considered a small Java software system
○ We choose to identify clones at method level
► We checked system against the presence of up to Type 3 clones
○ Removed all detected clones through refactoring operations
► We manually and randomly injected a set of artificially created clones
○ One set for each type of clones
► We applied our prototype and CloneDigger* to mutated systems
► We evaluated performances in terms of Precision, Recall and F1
*http://clonedigger.sourceforge.net/
14
Results (1)
► Type 1 and Type 2 Clones:
○ We were able to detect all clones without any false
positive
○ This was obtained also by CloneDigger
○ Both tools expressed the potential of AST-based
approaches
15
Results (2)
► Type 3 clones:
○ We classified results as “true Type 3 clones” according to
different thresholds on similarity values
○ We measured performance on different thresholds
We get best results with
threshold equals to 0.70
16
Conclusions and Future Works
► Measure performance on real systems and projects
○ Bellon's Benchmark
○ Investigate best results with 0.7 as threshold
○ Measure Time Performances
► Improve the scalability of the approach
○ Avoid to compare all pairs
► Improve similarity computation
○ Avoid manual weighting features
► Extend Supported Languages
○ Now we support Java, C, Python
17
Thank you for listening.
Questions?
18

More Related Content

What's hot

Chapter ii(oop)
Chapter ii(oop)Chapter ii(oop)
Chapter ii(oop)
Chhom Karath
 
Data structures in scala
Data structures in scalaData structures in scala
Data structures in scalaMeetu Maltiar
 
Java Unit 2(Part 1)
Java Unit 2(Part 1)Java Unit 2(Part 1)
Java Unit 2(Part 1)
SURBHI SAROHA
 
Java Unit 2(part 3)
Java Unit 2(part 3)Java Unit 2(part 3)
Java Unit 2(part 3)
SURBHI SAROHA
 
Java Unit 2 (Part 2)
Java Unit 2 (Part 2)Java Unit 2 (Part 2)
Java Unit 2 (Part 2)
SURBHI SAROHA
 
JAVA CONCEPTS AND PRACTICES
JAVA CONCEPTS AND PRACTICESJAVA CONCEPTS AND PRACTICES
JAVA CONCEPTS AND PRACTICES
Nikunj Parekh
 
Jist of Java
Jist of JavaJist of Java
Jist of Java
Nikunj Parekh
 
Java Serialization Deep Dive
Java Serialization Deep DiveJava Serialization Deep Dive
Java Serialization Deep Dive
Martijn Dashorst
 
Class
ClassClass
Iterator Design Pattern
Iterator Design PatternIterator Design Pattern
Iterator Design Pattern
Varun Arora
 
Collections - Array List
Collections - Array List Collections - Array List
Collections - Array List
Hitesh-Java
 
Bin Sorting And Bubble Sort By Luisito G. Trinidad
Bin Sorting And Bubble Sort By Luisito G. TrinidadBin Sorting And Bubble Sort By Luisito G. Trinidad
Bin Sorting And Bubble Sort By Luisito G. Trinidad
LUISITO TRINIDAD
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
Young Alista
 

What's hot (17)

Chapter ii(oop)
Chapter ii(oop)Chapter ii(oop)
Chapter ii(oop)
 
Data structures in scala
Data structures in scalaData structures in scala
Data structures in scala
 
Euclideus_Language
Euclideus_LanguageEuclideus_Language
Euclideus_Language
 
Java Unit 2(Part 1)
Java Unit 2(Part 1)Java Unit 2(Part 1)
Java Unit 2(Part 1)
 
Java Unit 2(part 3)
Java Unit 2(part 3)Java Unit 2(part 3)
Java Unit 2(part 3)
 
Java Unit 2 (Part 2)
Java Unit 2 (Part 2)Java Unit 2 (Part 2)
Java Unit 2 (Part 2)
 
Vectors in Java
Vectors in JavaVectors in Java
Vectors in Java
 
JAVA CONCEPTS AND PRACTICES
JAVA CONCEPTS AND PRACTICESJAVA CONCEPTS AND PRACTICES
JAVA CONCEPTS AND PRACTICES
 
Jist of Java
Jist of JavaJist of Java
Jist of Java
 
What Do You Mean By NUnit
What Do You Mean By NUnitWhat Do You Mean By NUnit
What Do You Mean By NUnit
 
Java Serialization Deep Dive
Java Serialization Deep DiveJava Serialization Deep Dive
Java Serialization Deep Dive
 
Class
ClassClass
Class
 
Iterator Design Pattern
Iterator Design PatternIterator Design Pattern
Iterator Design Pattern
 
LectureNotes-05-DSA
LectureNotes-05-DSALectureNotes-05-DSA
LectureNotes-05-DSA
 
Collections - Array List
Collections - Array List Collections - Array List
Collections - Array List
 
Bin Sorting And Bubble Sort By Luisito G. Trinidad
Bin Sorting And Bubble Sort By Luisito G. TrinidadBin Sorting And Bubble Sort By Luisito G. Trinidad
Bin Sorting And Bubble Sort By Luisito G. Trinidad
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
 

Viewers also liked

Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
Valerio Maggio
 
Web frameworks
Web frameworksWeb frameworks
Web frameworks
Valerio Maggio
 
Junit in action
Junit in actionJunit in action
Junit in action
Valerio Maggio
 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and Refactoring
Valerio Maggio
 
Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software Maintainability
Valerio Maggio
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
Valerio Maggio
 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviations
Valerio Maggio
 
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detection
Valerio Maggio
 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMock
Valerio Maggio
 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffolding
Valerio Maggio
 
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Valerio Maggio
 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing code
Valerio Maggio
 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with Junit
Valerio Maggio
 

Viewers also liked (13)

Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
 
Web frameworks
Web frameworksWeb frameworks
Web frameworks
 
Junit in action
Junit in actionJunit in action
Junit in action
 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and Refactoring
 
Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software Maintainability
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviations
 
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detection
 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMock
 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffolding
 
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing code
 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with Junit
 

Similar to A Tree Kernel based approach for clone detection

Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
nadimhossain24
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1
Gaston Liberman
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
Ilias Flaounas
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
Luis Galárraga
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
Rakibul Hasan Pranto
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
sudesh regmi
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
Sang Jun Lee
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
midi
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
Vara Prasad
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
Vahid Mirjalili
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
Paul Sterk
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
Mahmoud Alfarra
 
Manipulating string data with a pattern in R
Manipulating string data with  a pattern in RManipulating string data with  a pattern in R
Manipulating string data with a pattern in R
Lun-Hsien Chang
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
hoangminhdong
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
Tiago Sousa
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
CSCJournals
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
Bhuvanya Raghunathan
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
qwerty432737
 

Similar to A Tree Kernel based approach for clone detection (20)

Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
Text clustering
Text clusteringText clustering
Text clustering
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 
Q
QQ
Q
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Manipulating string data with a pattern in R
Manipulating string data with  a pattern in RManipulating string data with  a pattern in R
Manipulating string data with a pattern in R
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 

A Tree Kernel based approach for clone detection

  • 1. A Tree Kernel Based Approach for Clone Detection 1) University of Naples Federico II 2) University of Basilicata Anna Corazza1 , Sergio Di Martino1 , Valerio Maggio1 , Giuseppe Scanniello2
  • 2. Outline ►Background ○ Clone detection definition ○ State of the Art Techniques Taxonomy ►Our Abstract Syntax Tree based Proposal ○ A Tree Kernel based approach for clone detection ►A preliminary evaluation
  • 3. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 4. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 5. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” ► Program Text can be further distinguished by their degree of similarity1 ○ Type 1 Clone: Exact Copy ○ Type 2 Clone: Parameter Substituted Clone ○ Type 3 Clone: Modified/Structure Substituted Clone 1. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 6. State of the Art Techniques ► Classified in terms of Program Text representation2 ○ String, token, syntax tree, control structures, metric vectors ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... 2 2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
  • 7. State of the Art Techniques ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... ► Combined Techniques (a.k.a. Hybrid) ○Combine different representations ○Combine different techniques ○Combine different sources of information ●Tree Kernel based approach (Our approach :) 2
  • 9. The Goal ► Define an AST based technique able to detect up to Type 3 Clones 3
  • 10. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information 3
  • 11. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information ► As a measure we propose the use of a (Tree) Kernel Function 3
  • 12. Kernels for Structured Data ► Kernels are a class of functions with many appealing features: ○ Are based on the idea that a complex object can be described in terms of its constituent parts ○ Can be easily tailored to a specific domain ► There exist different classes of Kernels: ○ String Kernels ○ Graph Kernels ○ … ○ Tree Kernels ● Applied to NLP Parse Trees (Collins and Duffy 2004) 4
  • 13. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees 5
  • 14. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes 5
  • 15. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes (3) A proper Kernel Function to compare subparts of trees 5
  • 16. (1) The defined features ► We annotate each node of AST by 4 features: 6
  • 17. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... 6
  • 18. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... 6
  • 19. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed 6
  • 20. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed ○ Lexemes ● Lexical information within the code 6
  • 21. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 22. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 23. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 24. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 25. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 26. Lexemes Feature ► For leaf nodes: ○ It is the lexeme associated to the node ► For internal nodes: ○ It is the set of lexemes that recursively comes from subtrees with minimum height 8
  • 31. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return 9
  • 32. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return y, return 9
  • 33. (2) Applying features in a Kernel We exploits these features to compute similarity among pairs of nodes, as follows: ► Instruction Class filters comparable nodes ○ We compare only nodes with the same Instruction Class ► Instruction, Context and Lexemes are used to define a value of similarity between compared nodes 10
  • 34. (Primitive) Kernel Function between nodes 1.0 If two nodes have the same values of features 0.8 If two nodes differ in lexemes (same instruction and context) 0.7 If two nodes share lexemes and are the same instruction 0.5 If two nodes share lexemes and are enclosed in the same context 0.25 If two nodes have at least one feature in common 0.0 no match s(n1,n2)= 11
  • 35. (3) Tree Kernel: Kernel on entire Tree Structures ►We apply nodes comparison recursively to compute similarity between subtrees ►We aim to identify the maximum isomorphic tree/subtree 12
  • 36. Overall Process 1. Preprocessing 2. Extraction 3. Match Detection 4. Aggregation 13
  • 38. Evaluation Description ► We considered a small Java software system ○ We choose to identify clones at method level ► We checked system against the presence of up to Type 3 clones ○ Removed all detected clones through refactoring operations ► We manually and randomly injected a set of artificially created clones ○ One set for each type of clones ► We applied our prototype and CloneDigger* to mutated systems ► We evaluated performances in terms of Precision, Recall and F1 *http://clonedigger.sourceforge.net/ 14
  • 39. Results (1) ► Type 1 and Type 2 Clones: ○ We were able to detect all clones without any false positive ○ This was obtained also by CloneDigger ○ Both tools expressed the potential of AST-based approaches 15
  • 40. Results (2) ► Type 3 clones: ○ We classified results as “true Type 3 clones” according to different thresholds on similarity values ○ We measured performance on different thresholds We get best results with threshold equals to 0.70 16
  • 41. Conclusions and Future Works ► Measure performance on real systems and projects ○ Bellon's Benchmark ○ Investigate best results with 0.7 as threshold ○ Measure Time Performances ► Improve the scalability of the approach ○ Avoid to compare all pairs ► Improve similarity computation ○ Avoid manual weighting features ► Extend Supported Languages ○ Now we support Java, C, Python 17
  • 42. Thank you for listening. Questions? 18