SlideShare a Scribd company logo
This is not the original version given in the WCRE 2012 conference (no animation etc.)




               Detecting Clones across
        Microsoft .NET Programming Languages
                                                                                                                   Farouq Al-omari
                                                                                                                    Iman Keivanloo
                                                                                                                   Chanchal K. Roy
                                                                                                                     Juergen Rilling
                                                                                                                       Contact:
                                                                                                                       keivanloo@ieee.org




                                                              Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo
Mergesort   Merge
Mergesort




            Clones (Mergesort)




Mergesort
Clone Detection across Languages
General Solution
 •   C#


 •   VB.NET
 •

 •   J#
                      Intermediate Language (IL)
                      (low level)
 •   F#

                            The solution is to use this
 •   COBOL (.NET)           (instead of dealing with
                            several languages)


 •   Java
                                                          3
Clone Detection across Languages using IL
Is there any chance to work?
                               Input Data Type
                        CIL                  Source Code
Dataset       # Clone      # Clone      # Clone Class    # Clone
               Class      Fragment                      Fragment


ASXGUI    9             393            69               261
Mono      37            4373           369              1523
• Up to 3 times more cloned fragment detected
  using IL

                                                                   4
Clone Detection across Languages using IL
Observed Challenges (using an example)


VB.NET                                         C#
Sub Main()                                     static void main(string[] args){
Dim x As Integer x = 10                        int x=10;
If x < 0 Then                                  if(x<0)
   x += 1                                      x++;
   Else Console.WriteLine("Positive number")   else
 End If                                        console.WriteLine ("Positive number");
End Sub                                        }




                                                                                5
Clone Detection across Languages using IL
Observed Challenges (using an example)
                               VB              IL from VB       C#         IL from C#



VB.NET                                           C#
Sub Main()                                       static void main(string[] args){
Dim x As Integer x = 10                          int x=10;
If x < 0 Then                                    if(x<0)
   x += 1                                        x++;
   Else Console.WriteLine("Positive number")     else
 End If                                          console.WriteLine ("Positive number");
End Sub                                          }




                                                                                  6
Clone Detection across Languages using IL
  Observed Challenges


   VB.NET                                        C#
  Sub Main()                                     static void main(string[] args){
  Dim x As Integer x = 10                        int x=10;
  If x < 0 Then                                  if(x<0)
     x += 1                                      x++;
Observed Challenges
     Else Console.WriteLine("Positive number")   else
                                                 console.WriteLine ("Positive number");
1- Larger unpredictable size at IL level
   End If
  End Sub                                        }            [Keivanloo IWSC’12]
2- Higher dissimilarity at IL level

                                                                                  7
Observed Challenges #2: High Dissimilarity
Noise
• Sample IL                                      Major noise types:
                                                 •   Line numbers
                                                 •   Pointers to line number
IL_000c: ldloc.0
                                                 •   Push, Pop …
IL_000d: ldc.i4.1
                                                 •   Detailed Data Type data
IL_000e: add.ovf
IL_000f: stloc.0
IL_0010: br.s     IL_0024
IL_0012: nop
IL_0013: ldstr "Positive number"
IL_0018: call    void [mscorlib]System.Console::WriteLine(string)




                                                                               8
Clone Detection across Languages using IL
 The Core Solution
 • The Challenge: Noise
 • Solution: Data cleansing (filtering noises)
 • Why? (Answer: to increase recall)




Source Code           IL + noise                 IL - noise   9
Our                 Before                After             Example
Filter Set             Filtering            Filtering        Description
Filter 1         Filters for noise reduction
             IL_0003: stloc.0              stloc.0    IL_0003 (instruction address)
Filter 2     brtrue.s IL_0015              brtrue.s   The IL_0015 address of the
                                                        branch destination
Filter 3     ldarg 3                       ldarg        The value 3&1 represent
             starg 1                       starg        argument number
Filter 4     ldc.i4.s 10                   ldc.i4.s     10 is the number (pushed to
                                                        the stack)
Filter 5     ldstr "Positive number"       ldstr        “positive number” is the
                                                        printed string constant
Filter 6     stloc 7                       stloc        7 represents variable index
Filter 7     ldc.i4.s 10                   ldc          i4 represent the int32 data
                                                        type in CIL and s for Short
Filter 8     IL_0011: add                  add
             IL_0012: stloc.0              stloc        Note that Filter 8 is just a nick
             IL_0013: br.s     IL_0020                  name. Refer to the Filter 8
                                           br           description section for more
             IL_001a: call     void
             [mscorlib]System.Console::W   call         details.
                                                                                  10
             riteLine (string)
Clone Detection across Languages using IL
     Filtering Advantage: Recall Improvement


Before Filtering Noises:
    VB.NET                                        C#
~50% similarity
    Sub Main()
    Dim x As Integer x = 10
    If x < 0 Then
       x += 1
After:Else Console.WriteLine("Positive number")
     End If
~90% similarity
    End Sub




                                                       11
Disadvantage of Noise reduction
Danger!
• Data Loss
• What if we remove important
    data during data cleansing
• Might mislead the detection by
  making non-cloned pairs identical
  Possible negative effect on Precision

                 Filtering Color Data

                                           12
RQ: Are They (Filters) Dangerous?
Evaluation Preparation
1. Filter Contribution Formula:




2. Dataset preparation:
  – Controlled dataset (iText.NET J#) 25 pairs * 3 Lang.
    1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#)
    2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#)

                                                         13
RQ: Are They (Filters) Dangerous?
      Filter Contribution - Study #1
      • Are they harmful?                   (The answer is NO - based on following graphs, filters
       do not remove similar amount of data from actual clones vs. NONcloned code fragments)

          Cloned Dataset                                       NonCloned Dataset
                                                A strong threshold for the Judge to decide




0.3                                                   0.2




                                                                                                     14
RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #2
• Are they useful?
(The answer is YES - based on
the given figure, our filters help to
    discriminate among actual clones
    and NONcloned
    fragments, therefore it is possible
    to separate them with high
    confidence with the chosen
    threshold)




                                          15
RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #3
• Does filtering make actual clone-pairs and noncloned
  pairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to
    cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in
    most cases)




                                             Final Conclusion:

 Filters contribute to   discriminate between cloned and noncloned fragments
                                                                                                               16
An Interesting Unexpected Discovery
                            Language-dependency!!!



  Corresponding faces in each group are
     not similar, while all of them are
    extracted from single language (IL).
 Specially look at C#-J# faces, all of them
are different from other groups. This is an
  interesting discovery that the original
high-level programming languages affect
          similarity at the IL level




                                                     17
Clone Detection across Languages using IL
    Our Clone Detection Framework


Input: .NET Code      CIL Manipulation     Clone Detection   Clone Analysis      Reporting
                     for Clone Detection     Algorithms
  Source Code                                 LCS-based
                                                             Clone Clusters
                                             (from NiCad)
    MS .NET                                                                     Report (CIL)
                      Proposed Filtering   SimHash-based
   EXE & DLL             Mechanism         (from SimCad)        Merging
                                                                              Report (Src Code)
   IlDasm.exe
                                            Levenshtein       Source Code
  CIL (plain text)                         Distance-based       Mapping




                                                                                          18
The Selected Datasets for Performance
Evaluation
                 language   File    LOC        Method
     ASXGUI 2.5 VB.NET      47      32,594     303
     ASXGUI 3.0 C#          19      2088       78

                 language   File    LOC        Method
     Mono 2.10 VB.NET       375     -          -
     Mono 2.10 C#           57      -          -
     Total                  432     -          4998

                 language   File    LOC        Method
     iText       C#         -       -          -
     iText.NET   J#         -       -          -
     Total                  2.5 K   600 K

    4th Dataset: iText.NET dataset from 1st case study19
Clone Detection across Languages using IL
Our Clone Detection Framework Performance




              Pay attention to
              changes within
                 0.6 … 0.8
                                        20
Clone Detection across Languages using IL
Our Clone Detection Framework
• 2K clone-pair manually investigated
                                       Precision
                               The optimum, considering the
                               trade-off
                               between precision and recall,
                               was        achieved     using
                               Levenshtein Distance-based
                               comparison with the High
                               threshold (80% TP)

                                          Recall
0.6 Normal                     (iText.NET API) 76% using High
0.7 High      TP = {E and S}   threshold     between       three
0.8 Extreme                                               21
                               languages (C#, J#, and VB.NET).
An Interesting Clone
Detected by Our Approach
          private static string filename_nodir(string name)
           {
              int slash = -1, len = name.Length;
              for (int i = 0; i < len; i++)
              {
                 string sub = name.Substring(i, 1);
                 if (sub == "" || sub == "/")
C#




                    slash = i;
              }
              slash++;
              return name.Substring(slash, len - slash);
           }
            *The matching algorithm was limited to the content available
            within the boxes (it was NOT aware of same method names)
         Function Filename_Nodir() As String
           Dim intFileName As Integer, intSlash As Integer, strFilename As String
           strFileName = editvid.video
           For intFilename = 1 To len(strFileName)
VB.NET




              If mid(strfilename, intfilename, 1) = "" Or mid(strfilename, intfilename, 1) = "/" Then
                 intslash = intFilename
              End If
           Next
           Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash)                          22
         End Function
Summary
• The first comprehensive research focusing on,
    (1) .NET clone detection,
    (2) across programming languages,
    and (3) using Intermediate Language
• Identified challenges in cross language clone detection + IL




Input: .NET Code      CIL Manipulation     Clone Detection   Clone Analysis      Reporting
                     for Clone Detection     Algorithms
  Source Code                                 LCS-based
                                                             Clone Clusters
                                             (from NiCad)
    MS .NET                                                                     Report (CIL)
                      Proposed Filtering   SimHash-based
   EXE & DLL             Mechanism         (from SimCad)        Merging
                                                                              Report (Src Code)
   IlDasm.exe
                                            Levenshtein       Source Code
  CIL (plain text)                         Distance-based       Mapping




                                                                                               23
Related Publication
Iman Keivanloo, Chanchal K. Roy, Juergen Rilling,
“Java Bytecode Clone Detection via Relaxation on Code
  Fingerprint and Semantic Web Reasoning,”
6th International Workshop on Software Clones (IWSC), 2012.
Contact: keivanloo@ieee.org

ANY QUESTION?




                              25

More Related Content

What's hot

14. Java defining classes
14. Java defining classes14. Java defining classes
14. Java defining classes
Intro C# Book
 
Clanguage
ClanguageClanguage
Clanguage
Abhishek Khune
 
03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...
Intro C# Book
 
02 Primitive data types and variables
02 Primitive data types and variables02 Primitive data types and variables
02 Primitive data types and variables
maznabili
 
Cybersecurity Research Paper
Cybersecurity Research PaperCybersecurity Research Paper
Cybersecurity Research Paper
Shubham Gupta
 
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Adam Mukharil Bachtiar
 
introduction to python
introduction to pythonintroduction to python
introduction to python
Sardar Alam
 
Introduction To Csharp
Introduction To CsharpIntroduction To Csharp
Introduction To Csharp
sarfarazali
 
DITEC - Programming with C#.NET
DITEC - Programming with C#.NETDITEC - Programming with C#.NET
DITEC - Programming with C#.NET
Rasan Samarasinghe
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharp
Raga Vahini
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharp
hmanjarawala
 
C# / Java Language Comparison
C# / Java Language ComparisonC# / Java Language Comparison
C# / Java Language Comparison
Robert Bachmann
 
Clean code
Clean codeClean code
Clean code
ifnu bima
 
Java ppt Gandhi Ravi (gandhiri@gmail.com)
Java ppt  Gandhi Ravi  (gandhiri@gmail.com)Java ppt  Gandhi Ravi  (gandhiri@gmail.com)
Java ppt Gandhi Ravi (gandhiri@gmail.com)
Gandhi Ravi
 
Java Fundamentals
Java FundamentalsJava Fundamentals
Java Fundamentals
Shalabh Chaudhary
 
Python Basics by Akanksha Bali
Python Basics by Akanksha BaliPython Basics by Akanksha Bali
Python Basics by Akanksha Bali
Akanksha Bali
 
Algorithm and Programming (Record)
Algorithm and Programming (Record)Algorithm and Programming (Record)
Algorithm and Programming (Record)
Adam Mukharil Bachtiar
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
nithinmohantk
 

What's hot (18)

14. Java defining classes
14. Java defining classes14. Java defining classes
14. Java defining classes
 
Clanguage
ClanguageClanguage
Clanguage
 
03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...
 
02 Primitive data types and variables
02 Primitive data types and variables02 Primitive data types and variables
02 Primitive data types and variables
 
Cybersecurity Research Paper
Cybersecurity Research PaperCybersecurity Research Paper
Cybersecurity Research Paper
 
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
 
introduction to python
introduction to pythonintroduction to python
introduction to python
 
Introduction To Csharp
Introduction To CsharpIntroduction To Csharp
Introduction To Csharp
 
DITEC - Programming with C#.NET
DITEC - Programming with C#.NETDITEC - Programming with C#.NET
DITEC - Programming with C#.NET
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharp
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharp
 
C# / Java Language Comparison
C# / Java Language ComparisonC# / Java Language Comparison
C# / Java Language Comparison
 
Clean code
Clean codeClean code
Clean code
 
Java ppt Gandhi Ravi (gandhiri@gmail.com)
Java ppt  Gandhi Ravi  (gandhiri@gmail.com)Java ppt  Gandhi Ravi  (gandhiri@gmail.com)
Java ppt Gandhi Ravi (gandhiri@gmail.com)
 
Java Fundamentals
Java FundamentalsJava Fundamentals
Java Fundamentals
 
Python Basics by Akanksha Bali
Python Basics by Akanksha BaliPython Basics by Akanksha Bali
Python Basics by Akanksha Bali
 
Algorithm and Programming (Record)
Algorithm and Programming (Record)Algorithm and Programming (Record)
Algorithm and Programming (Record)
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
 

Viewers also liked

PhD Proposal
PhD ProposalPhD Proposal
PhD Proposal
Patricia Deshane
 
ICPC Demo
ICPC DemoICPC Demo
ICPC Demo
Patricia Deshane
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Silvio Cesare
 
Empirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone DetectionEmpirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone Detection
Förderverein Technische Fakultät
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
Valerio Maggio
 
BIZI- OHITURAK
BIZI- OHITURAKBIZI- OHITURAK
BIZI- OHITURAKolatz
 

Viewers also liked (6)

PhD Proposal
PhD ProposalPhD Proposal
PhD Proposal
 
ICPC Demo
ICPC DemoICPC Demo
ICPC Demo
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
 
Empirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone DetectionEmpirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone Detection
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
BIZI- OHITURAK
BIZI- OHITURAKBIZI- OHITURAK
BIZI- OHITURAK
 

Similar to Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

C# and the Evolution of a Programming Language
C# and the Evolution of a Programming LanguageC# and the Evolution of a Programming Language
C# and the Evolution of a Programming Language
Jacinto Limjap
 
How To Code in C#
How To Code in C#How To Code in C#
How To Code in C#
David Ringsell
 
Kotlin as a Better Java
Kotlin as a Better JavaKotlin as a Better Java
Kotlin as a Better Java
Garth Gilmour
 
Compiler
CompilerCompiler
Compiler
alekhya57
 
C# Language Overview Part I
C# Language Overview Part IC# Language Overview Part I
C# Language Overview Part I
Doncho Minkov
 
C# for beginners
C# for beginnersC# for beginners
C# for beginners
application developer
 
A quick and fast intro to Kotlin
A quick and fast intro to Kotlin A quick and fast intro to Kotlin
A quick and fast intro to Kotlin
XPeppers
 
Kotlin from-scratch
Kotlin from-scratchKotlin from-scratch
Kotlin from-scratch
Franco Lombardo
 
IOS debugging
IOS debuggingIOS debugging
IOS debugging
Dawid Planeta
 
Scala Intro
Scala IntroScala Intro
Chapter i c#(console application and programming)
Chapter i c#(console application and programming)Chapter i c#(console application and programming)
Chapter i c#(console application and programming)
Chhom Karath
 
Puzles C#
Puzles C#Puzles C#
Puzles C#
lantoli
 
NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)
Ron Munitz
 
Overview Of Msil
Overview Of MsilOverview Of Msil
Overview Of Msil
Ganesh Samarthyam
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe Languages
Emery Berger
 
Pragmatic Smalltalk
Pragmatic SmalltalkPragmatic Smalltalk
Pragmatic Smalltalk
ESUG
 
IronSmalltalk
IronSmalltalkIronSmalltalk
IronSmalltalk
ESUG
 
Geecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with KotlinGeecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with Kotlin
Nicolas Fränkel
 
Tim Popl
Tim PoplTim Popl
Tim Popl
mchaar
 
The Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s PerspectiveThe Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s Perspective
guest4fd7a2
 

Similar to Detecting Clones across Microsoft .NET Programming Languages (WCRE2012) (20)

C# and the Evolution of a Programming Language
C# and the Evolution of a Programming LanguageC# and the Evolution of a Programming Language
C# and the Evolution of a Programming Language
 
How To Code in C#
How To Code in C#How To Code in C#
How To Code in C#
 
Kotlin as a Better Java
Kotlin as a Better JavaKotlin as a Better Java
Kotlin as a Better Java
 
Compiler
CompilerCompiler
Compiler
 
C# Language Overview Part I
C# Language Overview Part IC# Language Overview Part I
C# Language Overview Part I
 
C# for beginners
C# for beginnersC# for beginners
C# for beginners
 
A quick and fast intro to Kotlin
A quick and fast intro to Kotlin A quick and fast intro to Kotlin
A quick and fast intro to Kotlin
 
Kotlin from-scratch
Kotlin from-scratchKotlin from-scratch
Kotlin from-scratch
 
IOS debugging
IOS debuggingIOS debugging
IOS debugging
 
Scala Intro
Scala IntroScala Intro
Scala Intro
 
Chapter i c#(console application and programming)
Chapter i c#(console application and programming)Chapter i c#(console application and programming)
Chapter i c#(console application and programming)
 
Puzles C#
Puzles C#Puzles C#
Puzles C#
 
NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)
 
Overview Of Msil
Overview Of MsilOverview Of Msil
Overview Of Msil
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe Languages
 
Pragmatic Smalltalk
Pragmatic SmalltalkPragmatic Smalltalk
Pragmatic Smalltalk
 
IronSmalltalk
IronSmalltalkIronSmalltalk
IronSmalltalk
 
Geecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with KotlinGeecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with Kotlin
 
Tim Popl
Tim PoplTim Popl
Tim Popl
 
The Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s PerspectiveThe Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s Perspective
 

Recently uploaded

A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 

Recently uploaded (20)

A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 

Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

  • 1. This is not the original version given in the WCRE 2012 conference (no animation etc.) Detecting Clones across Microsoft .NET Programming Languages Farouq Al-omari Iman Keivanloo Chanchal K. Roy Juergen Rilling Contact: keivanloo@ieee.org Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo
  • 2. Mergesort Merge Mergesort Clones (Mergesort) Mergesort
  • 3. Clone Detection across Languages General Solution • C# • VB.NET • • J# Intermediate Language (IL) (low level) • F# The solution is to use this • COBOL (.NET) (instead of dealing with several languages) • Java 3
  • 4. Clone Detection across Languages using IL Is there any chance to work? Input Data Type CIL Source Code Dataset # Clone # Clone # Clone Class # Clone Class Fragment Fragment ASXGUI 9 393 69 261 Mono 37 4373 369 1523 • Up to 3 times more cloned fragment detected using IL 4
  • 5. Clone Detection across Languages using IL Observed Challenges (using an example) VB.NET C# Sub Main() static void main(string[] args){ Dim x As Integer x = 10 int x=10; If x < 0 Then if(x<0) x += 1 x++; Else Console.WriteLine("Positive number") else End If console.WriteLine ("Positive number"); End Sub } 5
  • 6. Clone Detection across Languages using IL Observed Challenges (using an example) VB IL from VB C# IL from C# VB.NET C# Sub Main() static void main(string[] args){ Dim x As Integer x = 10 int x=10; If x < 0 Then if(x<0) x += 1 x++; Else Console.WriteLine("Positive number") else End If console.WriteLine ("Positive number"); End Sub } 6
  • 7. Clone Detection across Languages using IL Observed Challenges VB.NET C# Sub Main() static void main(string[] args){ Dim x As Integer x = 10 int x=10; If x < 0 Then if(x<0) x += 1 x++; Observed Challenges Else Console.WriteLine("Positive number") else console.WriteLine ("Positive number"); 1- Larger unpredictable size at IL level End If End Sub } [Keivanloo IWSC’12] 2- Higher dissimilarity at IL level 7
  • 8. Observed Challenges #2: High Dissimilarity Noise • Sample IL Major noise types: • Line numbers • Pointers to line number IL_000c: ldloc.0 • Push, Pop … IL_000d: ldc.i4.1 • Detailed Data Type data IL_000e: add.ovf IL_000f: stloc.0 IL_0010: br.s IL_0024 IL_0012: nop IL_0013: ldstr "Positive number" IL_0018: call void [mscorlib]System.Console::WriteLine(string) 8
  • 9. Clone Detection across Languages using IL The Core Solution • The Challenge: Noise • Solution: Data cleansing (filtering noises) • Why? (Answer: to increase recall) Source Code IL + noise IL - noise 9
  • 10. Our Before After Example Filter Set Filtering Filtering Description Filter 1 Filters for noise reduction IL_0003: stloc.0 stloc.0 IL_0003 (instruction address) Filter 2 brtrue.s IL_0015 brtrue.s The IL_0015 address of the branch destination Filter 3 ldarg 3 ldarg The value 3&1 represent starg 1 starg argument number Filter 4 ldc.i4.s 10 ldc.i4.s 10 is the number (pushed to the stack) Filter 5 ldstr "Positive number" ldstr “positive number” is the printed string constant Filter 6 stloc 7 stloc 7 represents variable index Filter 7 ldc.i4.s 10 ldc i4 represent the int32 data type in CIL and s for Short Filter 8 IL_0011: add add IL_0012: stloc.0 stloc Note that Filter 8 is just a nick IL_0013: br.s IL_0020 name. Refer to the Filter 8 br description section for more IL_001a: call void [mscorlib]System.Console::W call details. 10 riteLine (string)
  • 11. Clone Detection across Languages using IL Filtering Advantage: Recall Improvement Before Filtering Noises: VB.NET C# ~50% similarity Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 After:Else Console.WriteLine("Positive number") End If ~90% similarity End Sub 11
  • 12. Disadvantage of Noise reduction Danger! • Data Loss • What if we remove important data during data cleansing • Might mislead the detection by making non-cloned pairs identical  Possible negative effect on Precision Filtering Color Data 12
  • 13. RQ: Are They (Filters) Dangerous? Evaluation Preparation 1. Filter Contribution Formula: 2. Dataset preparation: – Controlled dataset (iText.NET J#) 25 pairs * 3 Lang. 1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#) 2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#) 13
  • 14. RQ: Are They (Filters) Dangerous? Filter Contribution - Study #1 • Are they harmful? (The answer is NO - based on following graphs, filters do not remove similar amount of data from actual clones vs. NONcloned code fragments) Cloned Dataset NonCloned Dataset A strong threshold for the Judge to decide 0.3 0.2 14
  • 15. RQ: Are They (Filters) Dangerous? Filter Contribution - Study #2 • Are they useful? (The answer is YES - based on the given figure, our filters help to discriminate among actual clones and NONcloned fragments, therefore it is possible to separate them with high confidence with the chosen threshold) 15
  • 16. RQ: Are They (Filters) Dangerous? Filter Contribution - Study #3 • Does filtering make actual clone-pairs and noncloned pairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in most cases) Final Conclusion: Filters contribute to discriminate between cloned and noncloned fragments 16
  • 17. An Interesting Unexpected Discovery Language-dependency!!! Corresponding faces in each group are not similar, while all of them are extracted from single language (IL). Specially look at C#-J# faces, all of them are different from other groups. This is an interesting discovery that the original high-level programming languages affect similarity at the IL level 17
  • 18. Clone Detection across Languages using IL Our Clone Detection Framework Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting for Clone Detection Algorithms Source Code LCS-based Clone Clusters (from NiCad) MS .NET Report (CIL) Proposed Filtering SimHash-based EXE & DLL Mechanism (from SimCad) Merging Report (Src Code) IlDasm.exe Levenshtein Source Code CIL (plain text) Distance-based Mapping 18
  • 19. The Selected Datasets for Performance Evaluation language File LOC Method ASXGUI 2.5 VB.NET 47 32,594 303 ASXGUI 3.0 C# 19 2088 78 language File LOC Method Mono 2.10 VB.NET 375 - - Mono 2.10 C# 57 - - Total 432 - 4998 language File LOC Method iText C# - - - iText.NET J# - - - Total 2.5 K 600 K 4th Dataset: iText.NET dataset from 1st case study19
  • 20. Clone Detection across Languages using IL Our Clone Detection Framework Performance Pay attention to changes within 0.6 … 0.8 20
  • 21. Clone Detection across Languages using IL Our Clone Detection Framework • 2K clone-pair manually investigated Precision The optimum, considering the trade-off between precision and recall, was achieved using Levenshtein Distance-based comparison with the High threshold (80% TP) Recall 0.6 Normal (iText.NET API) 76% using High 0.7 High TP = {E and S} threshold between three 0.8 Extreme 21 languages (C#, J#, and VB.NET).
  • 22. An Interesting Clone Detected by Our Approach private static string filename_nodir(string name) { int slash = -1, len = name.Length; for (int i = 0; i < len; i++) { string sub = name.Substring(i, 1); if (sub == "" || sub == "/") C# slash = i; } slash++; return name.Substring(slash, len - slash); } *The matching algorithm was limited to the content available within the boxes (it was NOT aware of same method names) Function Filename_Nodir() As String Dim intFileName As Integer, intSlash As Integer, strFilename As String strFileName = editvid.video For intFilename = 1 To len(strFileName) VB.NET If mid(strfilename, intfilename, 1) = "" Or mid(strfilename, intfilename, 1) = "/" Then intslash = intFilename End If Next Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash) 22 End Function
  • 23. Summary • The first comprehensive research focusing on, (1) .NET clone detection, (2) across programming languages, and (3) using Intermediate Language • Identified challenges in cross language clone detection + IL Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting for Clone Detection Algorithms Source Code LCS-based Clone Clusters (from NiCad) MS .NET Report (CIL) Proposed Filtering SimHash-based EXE & DLL Mechanism (from SimCad) Merging Report (Src Code) IlDasm.exe Levenshtein Source Code CIL (plain text) Distance-based Mapping 23
  • 24. Related Publication Iman Keivanloo, Chanchal K. Roy, Juergen Rilling, “Java Bytecode Clone Detection via Relaxation on Code Fingerprint and Semantic Web Reasoning,” 6th International Workshop on Software Clones (IWSC), 2012.

Editor's Notes

  1. {In this paper we answered some the very basic research questions related to this topicA general clone detection framework
  2. This talk is about source code clones. And I am going to use Sheeps to present clones. Suppose that there is a ship which is doing Mergesort. And the other sheeps are also doing mergesort. I can detect them as clone groups since thery are identical. so far there is no problem, however it becomes challenging when we want to find sheeps from other planets which are doing merge sort as wellTwo code fragments that share some degree of similarityare typically considered a clone pair. Based on their actualsimilarity, clone pairs can be categorized [5, 8] as Type-1,Type-2, Type-3, and Type-4 clones. Type-1 clones are exactcopies of each other, except for possible differences inwhitespaces, layouts and comments. Type-2 clones aresyntactically identical fragments except for variations inidentifiers, literals, data types, whitespace, layouts andcomments. Copied fragments (e.g., Type-1 and Type-2clones) with further modifications such as additions,deletions and changes of statements are called Type-3clones. Type-2 and Type-3 clones are also known as nearmissclones. Code fragments that perform the samecomputation (e.g., semantically similar) but implementedthrough different syntactic variations are called Type-4clones. Note that all of these definitions were originallyintroduced for clone-pairs implemented in the sameprogramming language. In our cross-language clone researchthese definitions are no longer applicable as-is, and have tobe refined to meet our research context. For example, the VBand C# fragments in Fig. 1 would be considered Type-1clones in the cross-language clone detection since they areessentially performing the same task implemented indifferent programming languages-----------------------------------------------------------the best of our knowledge, C2D2 [10] is theonly tool capable of detecting cross-language clones. It usesNRefactory Library to generate the Unified CodeDOM graphfor both C# and VB.NET. A string is generated by traversingthis graph and targeted to string matching algorithm(focusing on singlelanguage clone detection, mostly Java). One of the firststudies on Intermediate Language clone detection is byBaker [9]. After some preprocessing (e.g., remappingoffsets), she uses three comparison techniques (e.g., Diff[22]) to find similar fragments. Davis and Godfrey [23] usethe disassembler for both Java and C/C++ to detect clones insingle language.Selim et al. use “Jimple” [24]Juricic [26] uses Intermediate Language codeto detect plagiarism and similarities. The approach is basedon Levenshtein Distance as the similarity measure tocompare disassembled C# binary, and applies some primitivepreprocessing techniques which are comparable to two of ourfilters.filters. There are also some formal approaches, such as byCuomo et al. [27] that transform Java bytecode tomathematical models for clone detection
  3. {.NET targets multi-language development vs. java multi-platform direction{Now, the problem is changed to the single-language clone detection - so the problem is solved(It seems easy)Actually, it is possible but it is not easy which show in this paper why this is the case.NET: Contrary to Java, which targets application development using one language on several platforms, .NET aims for multi-language development on a single platform. It provides language interoperability, with each program module being able to use code written in the other languages.
  4. {as far as it finds something it worth a try it is tempting to give it a try
  5. {In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
  6. {In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
  7. {In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
  8. Filter 1: Removal of the instruction address (IL_xxxx:) atthe begin of each CIL instruction, eliminating dissimilaritiesdue to application/environment specific variations.Filter 2: Removal of instruction address (IL_xxxx:) forbranching statement. As part of this filtering step we cover all33 branching statements (e.g. beq, beq.s, bge).Filter 3: Removal of integer values that represent argumentnumber in CIL. e.g. ldarg 3 is interpreted in CIL as load theargument number 3 onto the stack. Instructions included in thisfilter are: starg, starg.s, ldrag, ldrag.s, ldrags,and ldraga.s.Filter 4: This filter eliminates constants in the CIL code,e.g. “ldc.i4 num” which corresponds to a Push numof typeint32 onto the stack as int32. Instructions covered by this filterare ldc.i4, ldc.i8, ldc.r4, ldc.r8, and ldc.i4.s.Filter 5: This filter removes all print literals in the CILcode, which are identified through ldstr statements.Filter 6: This filter removes all variable indexes like stlocindex, which correspond to popping a value from stack into alocal variable. Among the instructions removed by this filterare: ldloc, ldloc.s, ldloca.s, stloc and stloc.s.Filter 7: This filter removes some additional data typesand constant integers such as i4 from “ldc.i4. 1”. The completecommand pushes 1 as an int32 onto the stack.Filter 8: Is not actually a new filter, it combines all sevenfiltering techniques mentioned above, including thepreprocessing tasks in one single filter.
  9. {Before filter (50% sim) after filter almost similarWe address this challenge by creating a setof cleaning and filtering steps for CIL to improve theperformance of Type-1, Type-2, Type-3 and Type-4 clonedetection in the CIL code. The filters are designed to improvethe detection rate (i.e., recall) since the CIL data contains asignificant amount of noise (e.g., reference numbers to stringtables, which are compilation context dependent). Due to suchnoise in the CIL files, two semantically identical source codefragments might no longer be considered as highly similar atthe CIL level (e.g., content similar VB and C# methods mighthave less than 50% similarity at the CIL level, see Fig. 1).
  10. Filters increases RECALLBut might decrease PRECisiondrasticlyA major threat to any filter-based approach is the loss ofprecision by filtering out essential data. As a result, excessiveor improper loss of data (due to filtering) can lead to situationwhere non-answers and actual answers become similar to thedecision making algorithm, which eventually leads to anincrease in the false positive ratio
  11. {measures the effectiveness of each filter. That is how much it increases the content similarity after filtering comparing to before{iText.NET (J#)  25 Feature Code Sample (C#, VB.NET, J#)-&gt;75 code fragmentsmutually created three true positive clone-pair sets (VB-C#, VB-J#, and C#-J#)The second dataset (a.k.a., NonclonedFragments Dataset) contains 25 non-clone classes andAs well, 75 false positive clone-pair candidates created in the samemanner as clone classes.---Additional Info:To answer this question, we defined a metric called FilterContribution that measures the effectiveness of each filter. Theunderlying idea is to measure the similarity degree of candidateclone-pairs before and after applying different filters. Themeasure will indicate how much a particular filter increases thesimilarity value between two fragments. Note that in the idealcase, we expect that a filter would increase the similarity valuesof true positive cases significantly more than the ones for falsepositive cases. Otherwise, a particular filter would not be usefulto discriminate (with high confidence) against false positives.The Filter Contribution (FltrCntrb) function is defined in Eq. 2,which is based on LCS-based similarity. denotes theparticipant fragments in the clone-pair under investigation and presents the filter function with x being the filter number.
  12. It has no negative effect In the most cases, the filters increased thesimilarity up to ~0.2 (max) for non-cloned pairs whileimproving the similarity of cloned pairs by at least ~0.3.F8: (non-cloned pairs less than 0.5, while for themajority of cloned pairs the similarity increases between 0.5and 0.8.Thisresult supports our research hypothesis that filtering increasesthe similarity values for true positive cases (the cloned dataset)with a higher ratio than the false positive cases (the non-cloneddataset).
  13. Not only has no negative effect but also it contributes to descriminate between themTo support our claim, we conducted another case study onthe same dataset to determine if our filters can be used toidentify an appropriate similarity threshold. Fig. 3 summarizesthe findings, showing that before applying our filters, there wasno clear distinction between similarity values of actual clonepairs(true positives) and false positives. Therefore it isimpossible to determine an adequate threshold that allowsseparating actual clones from false positives. In contrast, Fig. 3shows that filters address this problem by increasing thedistance between the two groups (tagged on the right side ofFig. 3). For example, using our filters, a threshold from 0.4 to0.55 can separate true positives from false positives with highconfidence.
  14. Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation. The idea behind using faces is that humans easily recognize faces and notice small changes without difficulty.Glyphswe produced seven facefeatures for each pair by calculating Filter Contribution on allseven filters separately. That is, each pair can be modeled usinga vector in a multi-dimensional space (in our case, sevendimensions).----Filter 1, 2, and 5 since they are mapped to: (1) theface size, (2) distance between forehead and jaw, and (3)distance between eyes respectively. Therefore, it is alsopossible to intuitively observe that Filter 1, 2, and 5 (includingFilter 7 observed in Fig. 5) play the major role incharacterization of true positives.
  15. participant source code affects the similarity in IL levelA new interesting discovery“Is filtering neutral to the participating programming languages of clone-pairs (in cross-language clone detection context)?”.That is most of the faces are notround shaped comparing to the two other groups
  16. using three editdistance methods (LCS, LEV, SimHash) to avoid comparison function dependency in further case studies
  17. The noticeable difference in project metrics (e.g., LOC)can be attributed to the (1) dissimilarities in the programminglanguages, and (2) re-engineering and refactoring tasks.PDF Lib called iText and iText.NET. While their project namesare similar, both projects are completely independent from eachother. We created our third dataset from the iText (C# branch)and iText.NET (J#) source code.
  18. it is possible to detect numerous candidate clone-pairs even for cross-language case regardlessof the underlying algorithm, -------------Additional Info(2) no candidate clone-pair isdetected for cross-language using 1.0 as the Similarity Factor(i.e., the decision making threshold), which would only reportclone-pairs with complete identical content. Therefore, evenusing filtering on highly similar cross-language clone-pairs(e.g., Fig. 1), some dissimilarities will have to be handled bythe clone detection approach. However, this is not the case forsingle language clone detection (shown in Fig. 7), (3) for alldataset, we can observe a major decrease in the number ofcandidates when the threshold value is set to a range between0.6 and 0.8 (marked by ovals).
  19. Quality evaluation is inherently challenging in our researchsince there is no clear agreement on what constitutes truepositives (TP) and the various clone types definitions.Therefore, we applied in our qualitative evaluation thefollowing approach: (1) since it is possible to easily locate withconfidence false positives among candidate clone-pairs, wefirst tag all false positives; (2) we assume the rest as truepositive. However, in order to provide a more in depth qualityassessment, we also analyze the quality of the reported truepositives.--------Fig. 9 reviews the findings of our quality evaluation frommanually assessing ~2K candidate clone-pairs (answeringRQ4). In general, using the Normal threshold all candidateclone-pairs that were reported are true positive (100% TP). Thequality decreases with less restrictive thresholds. For exampleusing SimHash and the Extreme threshold, the reported TPreduces to ~40%. The optimum, considering the trade-offbetween precision and recall, was achieved using LevenshteinDistance-based comparison with the High threshold (80% TP).Nevertheless, this result is not 100% precise
  20. {Why we need such topic in general from industry points of view, these constitutes our motivation{Application being developed in different lagnuages (customer/contract iText iText.NET, legal issues {community Hibernate &gt; NHibernate)
  21. using three editdistance methods (LCS, LEV, SimHash) to avoid comparison function dependency in further case studies
  22. Comparison -&gt; Judge -&gt; threshold -&gt; yes/no
  23. A major threat to any filter-based approach is the loss ofprecision by filtering out essential data. As a result, excessiveor improper loss of data (due to filtering) can lead to situationwhere non-answers and actual answers become similar to thedecision making algorithm, which eventually leads to anincrease in the false positive ratio
  24. It is a detailed study on challenges, possible solution and evaluation, final resultNot only a clone detection approach but also important study which gives insight for futture research
  25. the best of our knowledge, C2D2 [10] is theonly tool capable of detecting cross-language clones. It usesNRefactory Library to generate the Unified CodeDOM graphfor both C# and VB.NET. A string is generated by traversingthis graph and targeted to string matching algorithm(focusing on singlelanguage clone detection, mostly Java). One of the firststudies on Intermediate Language clone detection is byBaker [9]. After some preprocessing (e.g., remappingoffsets), she uses three comparison techniques (e.g., Diff[22]) to find similar fragments. Davis and Godfrey [23] usethe disassembler for both Java and C/C++ to detect clones insingle language.Selim et al. use “Jimple” [24]Juricic [26] uses Intermediate Language codeto detect plagiarism and similarities. The approach is basedon Levenshtein Distance as the similarity measure tocompare disassembled C# binary, and applies some primitivepreprocessing techniques which are comparable to two of ourfilters.filters. There are also some formal approaches, such as byCuomo et al. [27] that transform Java bytecode tomathematical models for clone detection