UNSUPERVISED
MACHINE
LEARNING FOR
CLONE DETECTION
Valerio Maggio, Ph.D.
June 25, 2013
valerio.maggio@unina.it
General Disclaimer:
All the Maths appearing in the next slides is only intended to better introduce the considered case st...
Number one in the stink parade is duplicated code.
If you see the same code structure in more than one
place, you can be s...
ImageMapOutputFormat.java SVGOutputFormat.java
JHOTDRAW
CPYTHON2.5.1
PYTHON (NLTK)
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Software clones are fragments of code that are similar according
to some predef...
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Textual Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Functional Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones affect the reliability of the system!
Sneaky Bug!
DIFFERENT TYPES OF
CLONES
THE ORIGINAL ONE
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! wit...
TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
def do_something_cool_in_Python (filepath, marker='---end---'):
! lines = list() # This list is initially empty
! with ope...
TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, w...
# Type 2 Clone
def do_something_cool_in_Python(path, end='---end---'):
! targets = list()
! with open(path) as data_file:
...
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements...
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.p...
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.p...
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.p...
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.p...
TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by di...
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) ...
HTTPD2.2.14:TYPE1
HTTPD2.2.14:TYPE2
HTTPD2.2.14:TYPE3
SOURCECODEINFORMATION
SOURCECODEINFORMATION
SOURCECODEINFORMATION
FUNCTION
parser_compare PARAMS
PARAMPARAM
node *left node *right
IF-STMT IF-STMT RETURN-STMT
BODY
CA...
SOURCECODEINFORMATION ENTRY EXIT
FORMAL-IN
ACTUAL-IN
ACTUAL-IN
FORMAL-IN
BODY
CONTROL-POINT
EXPR
CONTROL-POINT CONTROL-POI...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
STATEOFTHEART
TECHNIQUES
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pr...
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pr...
USE
MACHINE
LEARNING
L U K E
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions ...
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions ...
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions ...
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions ...
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from...
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from...
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Computation of the dot product between (Graph) Structures
K( ),
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the...
CODE
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNEL
KERNE...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
IC = Conditional-Expr
I = Less-operator
C = Loop
Ls= [x,y...
CLONE DETECTION
• Comparison with another (pure) AST-based clone detector
• Comparison on a system with randomly seeded cl...
0
0.25
0.50
0.75
1.00
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Prec...
CODE
STRUCTURES
PDG NODES AND EDGES
while call-site
argexpr
CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data...
CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data...
• Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Fe...
• Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Fe...
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
•...
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
•...
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
•...
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
•...
SCENARIO-BASED
EVALUATION
FUTURE
RESEARCH
DIRECTIONS
PROBL
EM
S T A T E
M E N T
(MODEL) CLONE
DETECTION
Models: models are typically represented visually, as box-and-arrow dia...
REFERENCEEXAMPLE
TYPE 1C L O N E S
(MODEL) CLONE
DETECTION
• Type 1 (exact) model clones: Identical model fragments except for
variations i...
TYPE 2C L O N E S
(MODEL) CLONE
DETECTION
Type 2 (renamed) model clones: Structurally identical model fragments except
for...
TYPE 3C L O N E S
(MODEL) CLONE
DETECTION
Type 3 (near-miss) model clones: Model fragments with further modifications,
suc...
MODELSASSOURCECODE
THANK YOU
Valerio Maggio
Ph.D., University of Naples “Federico II”
valerio.maggio@unina.it
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detection
Upcoming SlideShare
Loading in...5
×

Unsupervised Machine Learning for clone detection

606

Published on

"Unsupervised Machine Learning for clone detection" highlights the main topics of using Unsupervised Machine Learning techniques (Kernel methods and data clustering) for the code clones detection task.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
606
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
54
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Unsupervised Machine Learning for clone detection"

  1. 1. UNSUPERVISED MACHINE LEARNING FOR CLONE DETECTION Valerio Maggio, Ph.D. June 25, 2013 valerio.maggio@unina.it
  2. 2. General Disclaimer: All the Maths appearing in the next slides is only intended to better introduce the considered case studies. Speakers are not responsible for any possible disease or “brain consumption” caused by too much formulas. So BEWARE; use this information at your own risk! It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or health professional. AwfulMaths
  3. 3. Number one in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.
  4. 4. ImageMapOutputFormat.java SVGOutputFormat.java JHOTDRAW
  5. 5. CPYTHON2.5.1
  6. 6. PYTHON (NLTK)
  7. 7. PROBL EM S T A T E M E N T CLONE DETECTION Software clones are fragments of code that are similar according to some predefined measure of similarity I.D. Baxter, 1998
  8. 8. PROBL EM S T A T E M E N T CLONE DETECTION
  9. 9. PROBL EM S T A T E M E N T CLONE DETECTION Clones Textual Similarity
  10. 10. PROBL EM S T A T E M E N T CLONE DETECTION Clones Functional Similarity
  11. 11. PROBL EM S T A T E M E N T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  12. 12. DIFFERENT TYPES OF CLONES
  13. 13. THE ORIGINAL ONE # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  14. 14. TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments
  15. 15. def do_something_cool_in_Python (filepath, marker='---end---'): ! lines = list() # This list is initially empty ! with open(filepath) as report: ! ! for l in report: # It goes through the lines of the file ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) ! return lines TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  16. 16. TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  17. 17. # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): ! targets = list() ! with open(path) as data_file: ! ! for t in datae: ! ! ! if l.endswith(end): ! ! ! ! targets.append(t) # Stores only lines that ends with "marker" ! #Return the list of different lines ! return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  18. 18. TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  19. 19. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  20. 20. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  21. 21. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  22. 22. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  23. 23. TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  24. 24. # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): ! report = open(filepath) ! file_lines = report.readlines() ! report.close() ! #Filters only the lines ending with marker ! return filter(lambda l: len(l) and l.endswith(marker), file_lines) TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  25. 25. HTTPD2.2.14:TYPE1
  26. 26. HTTPD2.2.14:TYPE2
  27. 27. HTTPD2.2.14:TYPE3
  28. 28. SOURCECODEINFORMATION
  29. 29. SOURCECODEINFORMATION
  30. 30. SOURCECODEINFORMATION FUNCTION parser_compare PARAMS PARAMPARAM node *left node *right IF-STMT IF-STMT RETURN-STMT BODY CALL-STMT parser_compare_node PARAMS STRUCT-OP right st_nodeleft st_node BODY BODYCOND COND OR ==== left right0 0 == rightleft RETURN- STMTRETURN-STMT 00
  31. 31. SOURCECODEINFORMATION ENTRY EXIT FORMAL-IN ACTUAL-IN ACTUAL-IN FORMAL-IN BODY CONTROL-POINT EXPR CONTROL-POINT CONTROL-POINT CALL-SITE RETURN ACTUAL-OUT RETURN EXPR EXPR FORMAL-OUT
  32. 32. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD STATEOFTHEARTTOOLS
  33. 33. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Text Based Tools: Text is compared line by line STATEOFTHEARTTOOLS
  34. 34. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Token Based Tools: Token sequences are compared to sequences STATEOFTHEARTTOOLS
  35. 35. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Syntax Based Tools: Syntax subtrees are compared to each other STATEOFTHEARTTOOLS
  36. 36. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Graph Based Tools: (sub) graphs are compared to each other STATEOFTHEARTTOOLS
  37. 37. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones STATEOFTHEART TECHNIQUES
  38. 38. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones STATEOFTHEART TECHNIQUES
  39. 39. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones • Graph based Techniques: • Pros: The only one able to deal with Type 4 Clones • Cons: Performance Issues STATEOFTHEART TECHNIQUES
  40. 40. USE MACHINE LEARNING L U K E
  41. 41. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets
  42. 42. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  43. 43. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in:
  44. 44. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  45. 45. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  46. 46. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  47. 47. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  48. 48. CODE STRUCTURES KERNELSFORSTRUCTURES Computation of the dot product between (Graph) Structures K( ),
  49. 49. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  50. 50. CODE KERNELFORCLONES
  51. 51. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNELFORCLONES
  52. 52. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNELFORCLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  53. 53. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES
  54. 54. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  55. 55. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  56. 56. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  57. 57. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  58. 58. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  59. 59. CLONE DETECTION • Comparison with another (pure) AST-based clone detector • Comparison on a system with randomly seeded clones 0 0.25 0.5 0.75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  60. 60. 0 0.25 0.50 0.75 1.00 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?
  61. 61. CODE STRUCTURES PDG NODES AND EDGES while call-site argexpr
  62. 62. CODE STRUCTURES PDG • Two Types of Nodes • Control Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... NODES AND EDGES while call-site argexpr
  63. 63. CODE STRUCTURES PDG • Two Types of Nodes • Control Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... • Two Types of Edges (i.e., dependencies) • Control edges (Dashed ones) • Data edges NODES AND EDGES while call-site argexpr
  64. 64. • Features of nodes: • Node Label • i.e., , WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG GRAPH KERNELS FOR PDG while call-site arg expr expr
  65. 65. • Features of nodes: • Node Label • i.e., , WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG Node Label = WHILE Node Type = Control Node GRAPH KERNELS FOR PDG while call-site arg expr expr Control Edge Data Edge
  66. 66. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  67. 67. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  68. 68. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  69. 69. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  70. 70. SCENARIO-BASED EVALUATION
  71. 71. FUTURE RESEARCH DIRECTIONS
  72. 72. PROBL EM S T A T E M E N T (MODEL) CLONE DETECTION Models: models are typically represented visually, as box-and-arrow diagrams, and the clones we are searching for are similar subgraphs of these diagrams. Model Granularity: models could be represented at different levels of granularity (such as the source code) corresponding to different syntactic (and semantic) units. Models Clones are categorized in (three) different Types
  73. 73. REFERENCEEXAMPLE
  74. 74. TYPE 1C L O N E S (MODEL) CLONE DETECTION • Type 1 (exact) model clones: Identical model fragments except for variations in visual presentation, layout and formatting.
  75. 75. TYPE 2C L O N E S (MODEL) CLONE DETECTION Type 2 (renamed) model clones: Structurally identical model fragments except for variations in labels, values, types, visual presentation, layout and formatting. model@Friction Mode Logic/Break Apart Detection model@Friction Mode Logic/Lockup Detection/Required Friction for Lockup
  76. 76. TYPE 3C L O N E S (MODEL) CLONE DETECTION Type 3 (near-miss) model clones: Model fragments with further modifications, such as changes in position or connection with respect to other model fragments and small additions or removals of blocks or lines in addition to variations in labels, values, types, visual presentation, layout and formatting. model@Speed.speed_estimation model@Throttle.throttle_estimation
  77. 77. MODELSASSOURCECODE
  78. 78. THANK YOU Valerio Maggio Ph.D., University of Naples “Federico II” valerio.maggio@unina.it
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×