SlideShare a Scribd company logo
1 of 25
This is not the original version given in the WCRE 2012 conference (no animation etc.)




               Detecting Clones across
        Microsoft .NET Programming Languages
                                                                                                                   Farouq Al-omari
                                                                                                                    Iman Keivanloo
                                                                                                                   Chanchal K. Roy
                                                                                                                     Juergen Rilling
                                                                                                                       Contact:
                                                                                                                       keivanloo@ieee.org




                                                              Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo
Mergesort   Merge
Mergesort




            Clones (Mergesort)




Mergesort
Clone Detection across Languages
General Solution
 •   C#


 •   VB.NET
 •

 •   J#
                      Intermediate Language (IL)
                      (low level)
 •   F#

                            The solution is to use this
 •   COBOL (.NET)           (instead of dealing with
                            several languages)


 •   Java
                                                          3
Clone Detection across Languages using IL
Is there any chance to work?
                               Input Data Type
                        CIL                  Source Code
Dataset       # Clone      # Clone      # Clone Class    # Clone
               Class      Fragment                      Fragment


ASXGUI    9             393            69               261
Mono      37            4373           369              1523
• Up to 3 times more cloned fragment detected
  using IL

                                                                   4
Clone Detection across Languages using IL
Observed Challenges (using an example)


VB.NET                                         C#
Sub Main()                                     static void main(string[] args){
Dim x As Integer x = 10                        int x=10;
If x < 0 Then                                  if(x<0)
   x += 1                                      x++;
   Else Console.WriteLine("Positive number")   else
 End If                                        console.WriteLine ("Positive number");
End Sub                                        }




                                                                                5
Clone Detection across Languages using IL
Observed Challenges (using an example)
                               VB              IL from VB       C#         IL from C#



VB.NET                                           C#
Sub Main()                                       static void main(string[] args){
Dim x As Integer x = 10                          int x=10;
If x < 0 Then                                    if(x<0)
   x += 1                                        x++;
   Else Console.WriteLine("Positive number")     else
 End If                                          console.WriteLine ("Positive number");
End Sub                                          }




                                                                                  6
Clone Detection across Languages using IL
  Observed Challenges


   VB.NET                                        C#
  Sub Main()                                     static void main(string[] args){
  Dim x As Integer x = 10                        int x=10;
  If x < 0 Then                                  if(x<0)
     x += 1                                      x++;
Observed Challenges
     Else Console.WriteLine("Positive number")   else
                                                 console.WriteLine ("Positive number");
1- Larger unpredictable size at IL level
   End If
  End Sub                                        }            [Keivanloo IWSC’12]
2- Higher dissimilarity at IL level

                                                                                  7
Observed Challenges #2: High Dissimilarity
Noise
• Sample IL                                      Major noise types:
                                                 •   Line numbers
                                                 •   Pointers to line number
IL_000c: ldloc.0
                                                 •   Push, Pop …
IL_000d: ldc.i4.1
                                                 •   Detailed Data Type data
IL_000e: add.ovf
IL_000f: stloc.0
IL_0010: br.s     IL_0024
IL_0012: nop
IL_0013: ldstr "Positive number"
IL_0018: call    void [mscorlib]System.Console::WriteLine(string)




                                                                               8
Clone Detection across Languages using IL
 The Core Solution
 • The Challenge: Noise
 • Solution: Data cleansing (filtering noises)
 • Why? (Answer: to increase recall)




Source Code           IL + noise                 IL - noise   9
Our                 Before                After             Example
Filter Set             Filtering            Filtering        Description
Filter 1         Filters for noise reduction
             IL_0003: stloc.0              stloc.0    IL_0003 (instruction address)
Filter 2     brtrue.s IL_0015              brtrue.s   The IL_0015 address of the
                                                        branch destination
Filter 3     ldarg 3                       ldarg        The value 3&1 represent
             starg 1                       starg        argument number
Filter 4     ldc.i4.s 10                   ldc.i4.s     10 is the number (pushed to
                                                        the stack)
Filter 5     ldstr "Positive number"       ldstr        “positive number” is the
                                                        printed string constant
Filter 6     stloc 7                       stloc        7 represents variable index
Filter 7     ldc.i4.s 10                   ldc          i4 represent the int32 data
                                                        type in CIL and s for Short
Filter 8     IL_0011: add                  add
             IL_0012: stloc.0              stloc        Note that Filter 8 is just a nick
             IL_0013: br.s     IL_0020                  name. Refer to the Filter 8
                                           br           description section for more
             IL_001a: call     void
             [mscorlib]System.Console::W   call         details.
                                                                                  10
             riteLine (string)
Clone Detection across Languages using IL
     Filtering Advantage: Recall Improvement


Before Filtering Noises:
    VB.NET                                        C#
~50% similarity
    Sub Main()
    Dim x As Integer x = 10
    If x < 0 Then
       x += 1
After:Else Console.WriteLine("Positive number")
     End If
~90% similarity
    End Sub




                                                       11
Disadvantage of Noise reduction
Danger!
• Data Loss
• What if we remove important
    data during data cleansing
• Might mislead the detection by
  making non-cloned pairs identical
  Possible negative effect on Precision

                 Filtering Color Data

                                           12
RQ: Are They (Filters) Dangerous?
Evaluation Preparation
1. Filter Contribution Formula:




2. Dataset preparation:
  – Controlled dataset (iText.NET J#) 25 pairs * 3 Lang.
    1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#)
    2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#)

                                                         13
RQ: Are They (Filters) Dangerous?
      Filter Contribution - Study #1
      • Are they harmful?                   (The answer is NO - based on following graphs, filters
       do not remove similar amount of data from actual clones vs. NONcloned code fragments)

          Cloned Dataset                                       NonCloned Dataset
                                                A strong threshold for the Judge to decide




0.3                                                   0.2




                                                                                                     14
RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #2
• Are they useful?
(The answer is YES - based on
the given figure, our filters help to
    discriminate among actual clones
    and NONcloned
    fragments, therefore it is possible
    to separate them with high
    confidence with the chosen
    threshold)




                                          15
RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #3
• Does filtering make actual clone-pairs and noncloned
  pairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to
    cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in
    most cases)




                                             Final Conclusion:

 Filters contribute to   discriminate between cloned and noncloned fragments
                                                                                                               16
An Interesting Unexpected Discovery
                            Language-dependency!!!



  Corresponding faces in each group are
     not similar, while all of them are
    extracted from single language (IL).
 Specially look at C#-J# faces, all of them
are different from other groups. This is an
  interesting discovery that the original
high-level programming languages affect
          similarity at the IL level




                                                     17
Clone Detection across Languages using IL
    Our Clone Detection Framework


Input: .NET Code      CIL Manipulation     Clone Detection   Clone Analysis      Reporting
                     for Clone Detection     Algorithms
  Source Code                                 LCS-based
                                                             Clone Clusters
                                             (from NiCad)
    MS .NET                                                                     Report (CIL)
                      Proposed Filtering   SimHash-based
   EXE & DLL             Mechanism         (from SimCad)        Merging
                                                                              Report (Src Code)
   IlDasm.exe
                                            Levenshtein       Source Code
  CIL (plain text)                         Distance-based       Mapping




                                                                                          18
The Selected Datasets for Performance
Evaluation
                 language   File    LOC        Method
     ASXGUI 2.5 VB.NET      47      32,594     303
     ASXGUI 3.0 C#          19      2088       78

                 language   File    LOC        Method
     Mono 2.10 VB.NET       375     -          -
     Mono 2.10 C#           57      -          -
     Total                  432     -          4998

                 language   File    LOC        Method
     iText       C#         -       -          -
     iText.NET   J#         -       -          -
     Total                  2.5 K   600 K

    4th Dataset: iText.NET dataset from 1st case study19
Clone Detection across Languages using IL
Our Clone Detection Framework Performance




              Pay attention to
              changes within
                 0.6 … 0.8
                                        20
Clone Detection across Languages using IL
Our Clone Detection Framework
• 2K clone-pair manually investigated
                                       Precision
                               The optimum, considering the
                               trade-off
                               between precision and recall,
                               was        achieved     using
                               Levenshtein Distance-based
                               comparison with the High
                               threshold (80% TP)

                                          Recall
0.6 Normal                     (iText.NET API) 76% using High
0.7 High      TP = {E and S}   threshold     between       three
0.8 Extreme                                               21
                               languages (C#, J#, and VB.NET).
An Interesting Clone
Detected by Our Approach
          private static string filename_nodir(string name)
           {
              int slash = -1, len = name.Length;
              for (int i = 0; i < len; i++)
              {
                 string sub = name.Substring(i, 1);
                 if (sub == "" || sub == "/")
C#




                    slash = i;
              }
              slash++;
              return name.Substring(slash, len - slash);
           }
            *The matching algorithm was limited to the content available
            within the boxes (it was NOT aware of same method names)
         Function Filename_Nodir() As String
           Dim intFileName As Integer, intSlash As Integer, strFilename As String
           strFileName = editvid.video
           For intFilename = 1 To len(strFileName)
VB.NET




              If mid(strfilename, intfilename, 1) = "" Or mid(strfilename, intfilename, 1) = "/" Then
                 intslash = intFilename
              End If
           Next
           Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash)                          22
         End Function
Summary
• The first comprehensive research focusing on,
    (1) .NET clone detection,
    (2) across programming languages,
    and (3) using Intermediate Language
• Identified challenges in cross language clone detection + IL




Input: .NET Code      CIL Manipulation     Clone Detection   Clone Analysis      Reporting
                     for Clone Detection     Algorithms
  Source Code                                 LCS-based
                                                             Clone Clusters
                                             (from NiCad)
    MS .NET                                                                     Report (CIL)
                      Proposed Filtering   SimHash-based
   EXE & DLL             Mechanism         (from SimCad)        Merging
                                                                              Report (Src Code)
   IlDasm.exe
                                            Levenshtein       Source Code
  CIL (plain text)                         Distance-based       Mapping




                                                                                               23
Related Publication
Iman Keivanloo, Chanchal K. Roy, Juergen Rilling,
“Java Bytecode Clone Detection via Relaxation on Code
  Fingerprint and Semantic Web Reasoning,”
6th International Workshop on Software Clones (IWSC), 2012.
Contact: keivanloo@ieee.org

ANY QUESTION?




                              25

More Related Content

What's hot

14. Java defining classes
14. Java defining classes14. Java defining classes
14. Java defining classesIntro C# Book
 
03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...Intro C# Book
 
02 Primitive data types and variables
02 Primitive data types and variables02 Primitive data types and variables
02 Primitive data types and variablesmaznabili
 
Cybersecurity Research Paper
Cybersecurity Research PaperCybersecurity Research Paper
Cybersecurity Research PaperShubham Gupta
 
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...Adam Mukharil Bachtiar
 
introduction to python
introduction to pythonintroduction to python
introduction to pythonSardar Alam
 
Introduction To Csharp
Introduction To CsharpIntroduction To Csharp
Introduction To Csharpsarfarazali
 
DITEC - Programming with C#.NET
DITEC - Programming with C#.NETDITEC - Programming with C#.NET
DITEC - Programming with C#.NETRasan Samarasinghe
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharpRaga Vahini
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharphmanjarawala
 
C# / Java Language Comparison
C# / Java Language ComparisonC# / Java Language Comparison
C# / Java Language ComparisonRobert Bachmann
 
Java ppt Gandhi Ravi (gandhiri@gmail.com)
Java ppt  Gandhi Ravi  (gandhiri@gmail.com)Java ppt  Gandhi Ravi  (gandhiri@gmail.com)
Java ppt Gandhi Ravi (gandhiri@gmail.com)Gandhi Ravi
 
Python Basics by Akanksha Bali
Python Basics by Akanksha BaliPython Basics by Akanksha Bali
Python Basics by Akanksha BaliAkanksha Bali
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futuresnithinmohantk
 

What's hot (18)

14. Java defining classes
14. Java defining classes14. Java defining classes
14. Java defining classes
 
Clanguage
ClanguageClanguage
Clanguage
 
03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...03 and 04 .Operators, Expressions, working with the console and conditional s...
03 and 04 .Operators, Expressions, working with the console and conditional s...
 
02 Primitive data types and variables
02 Primitive data types and variables02 Primitive data types and variables
02 Primitive data types and variables
 
Cybersecurity Research Paper
Cybersecurity Research PaperCybersecurity Research Paper
Cybersecurity Research Paper
 
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
 
introduction to python
introduction to pythonintroduction to python
introduction to python
 
Introduction To Csharp
Introduction To CsharpIntroduction To Csharp
Introduction To Csharp
 
DITEC - Programming with C#.NET
DITEC - Programming with C#.NETDITEC - Programming with C#.NET
DITEC - Programming with C#.NET
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharp
 
Introduction to csharp
Introduction to csharpIntroduction to csharp
Introduction to csharp
 
C# / Java Language Comparison
C# / Java Language ComparisonC# / Java Language Comparison
C# / Java Language Comparison
 
Clean code
Clean codeClean code
Clean code
 
Java ppt Gandhi Ravi (gandhiri@gmail.com)
Java ppt  Gandhi Ravi  (gandhiri@gmail.com)Java ppt  Gandhi Ravi  (gandhiri@gmail.com)
Java ppt Gandhi Ravi (gandhiri@gmail.com)
 
Java Fundamentals
Java FundamentalsJava Fundamentals
Java Fundamentals
 
Python Basics by Akanksha Bali
Python Basics by Akanksha BaliPython Basics by Akanksha Bali
Python Basics by Akanksha Bali
 
Algorithm and Programming (Record)
Algorithm and Programming (Record)Algorithm and Programming (Record)
Algorithm and Programming (Record)
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
 

Viewers also liked

Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...Silvio Cesare
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in PythonValerio Maggio
 
BIZI- OHITURAK
BIZI- OHITURAKBIZI- OHITURAK
BIZI- OHITURAKolatz
 

Viewers also liked (6)

PhD Proposal
PhD ProposalPhD Proposal
PhD Proposal
 
ICPC Demo
ICPC DemoICPC Demo
ICPC Demo
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
 
Empirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone DetectionEmpirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone Detection
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
BIZI- OHITURAK
BIZI- OHITURAKBIZI- OHITURAK
BIZI- OHITURAK
 

Similar to Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

C# and the Evolution of a Programming Language
C# and the Evolution of a Programming LanguageC# and the Evolution of a Programming Language
C# and the Evolution of a Programming LanguageJacinto Limjap
 
Kotlin as a Better Java
Kotlin as a Better JavaKotlin as a Better Java
Kotlin as a Better JavaGarth Gilmour
 
C# Language Overview Part I
C# Language Overview Part IC# Language Overview Part I
C# Language Overview Part IDoncho Minkov
 
A quick and fast intro to Kotlin
A quick and fast intro to Kotlin A quick and fast intro to Kotlin
A quick and fast intro to Kotlin XPeppers
 
Chapter i c#(console application and programming)
Chapter i c#(console application and programming)Chapter i c#(console application and programming)
Chapter i c#(console application and programming)Chhom Karath
 
Puzles C#
Puzles C#Puzles C#
Puzles C#lantoli
 
NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)Ron Munitz
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesEmery Berger
 
Pragmatic Smalltalk
Pragmatic SmalltalkPragmatic Smalltalk
Pragmatic SmalltalkESUG
 
IronSmalltalk
IronSmalltalkIronSmalltalk
IronSmalltalkESUG
 
Geecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with KotlinGeecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with KotlinNicolas Fränkel
 
The Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s PerspectiveThe Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s Perspectiveguest4fd7a2
 
Tim Popl
Tim PoplTim Popl
Tim Poplmchaar
 

Similar to Detecting Clones across Microsoft .NET Programming Languages (WCRE2012) (20)

C# and the Evolution of a Programming Language
C# and the Evolution of a Programming LanguageC# and the Evolution of a Programming Language
C# and the Evolution of a Programming Language
 
How To Code in C#
How To Code in C#How To Code in C#
How To Code in C#
 
Kotlin as a Better Java
Kotlin as a Better JavaKotlin as a Better Java
Kotlin as a Better Java
 
Compiler
CompilerCompiler
Compiler
 
C# Language Overview Part I
C# Language Overview Part IC# Language Overview Part I
C# Language Overview Part I
 
C# for beginners
C# for beginnersC# for beginners
C# for beginners
 
A quick and fast intro to Kotlin
A quick and fast intro to Kotlin A quick and fast intro to Kotlin
A quick and fast intro to Kotlin
 
Kotlin from-scratch
Kotlin from-scratchKotlin from-scratch
Kotlin from-scratch
 
IOS debugging
IOS debuggingIOS debugging
IOS debugging
 
Scala Intro
Scala IntroScala Intro
Scala Intro
 
Chapter i c#(console application and programming)
Chapter i c#(console application and programming)Chapter i c#(console application and programming)
Chapter i c#(console application and programming)
 
Puzles C#
Puzles C#Puzles C#
Puzles C#
 
NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)NDK Primer (AnDevCon Boston 2014)
NDK Primer (AnDevCon Boston 2014)
 
Overview Of Msil
Overview Of MsilOverview Of Msil
Overview Of Msil
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe Languages
 
Pragmatic Smalltalk
Pragmatic SmalltalkPragmatic Smalltalk
Pragmatic Smalltalk
 
IronSmalltalk
IronSmalltalkIronSmalltalk
IronSmalltalk
 
Geecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with KotlinGeecon - Improve your Android-fu with Kotlin
Geecon - Improve your Android-fu with Kotlin
 
The Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s PerspectiveThe Next Mainstream Programming Language: A Game Developer’s Perspective
The Next Mainstream Programming Language: A Game Developer’s Perspective
 
Tim Popl
Tim PoplTim Popl
Tim Popl
 

Recently uploaded

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 

Recently uploaded (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

  • 1. This is not the original version given in the WCRE 2012 conference (no animation etc.) Detecting Clones across Microsoft .NET Programming Languages Farouq Al-omari Iman Keivanloo Chanchal K. Roy Juergen Rilling Contact: keivanloo@ieee.org Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo
  • 2. Mergesort Merge Mergesort Clones (Mergesort) Mergesort
  • 3. Clone Detection across Languages General Solution • C# • VB.NET • • J# Intermediate Language (IL) (low level) • F# The solution is to use this • COBOL (.NET) (instead of dealing with several languages) • Java 3
  • 4. Clone Detection across Languages using IL Is there any chance to work? Input Data Type CIL Source Code Dataset # Clone # Clone # Clone Class # Clone Class Fragment Fragment ASXGUI 9 393 69 261 Mono 37 4373 369 1523 • Up to 3 times more cloned fragment detected using IL 4
  • 5. Clone Detection across Languages using IL Observed Challenges (using an example) VB.NET C# Sub Main() static void main(string[] args){ Dim x As Integer x = 10 int x=10; If x < 0 Then if(x<0) x += 1 x++; Else Console.WriteLine("Positive number") else End If console.WriteLine ("Positive number"); End Sub } 5
  • 6. Clone Detection across Languages using IL Observed Challenges (using an example) VB IL from VB C# IL from C# VB.NET C# Sub Main() static void main(string[] args){ Dim x As Integer x = 10 int x=10; If x < 0 Then if(x<0) x += 1 x++; Else Console.WriteLine("Positive number") else End If console.WriteLine ("Positive number"); End Sub } 6
  • 7. Clone Detection across Languages using IL Observed Challenges VB.NET C# Sub Main() static void main(string[] args){ Dim x As Integer x = 10 int x=10; If x < 0 Then if(x<0) x += 1 x++; Observed Challenges Else Console.WriteLine("Positive number") else console.WriteLine ("Positive number"); 1- Larger unpredictable size at IL level End If End Sub } [Keivanloo IWSC’12] 2- Higher dissimilarity at IL level 7
  • 8. Observed Challenges #2: High Dissimilarity Noise • Sample IL Major noise types: • Line numbers • Pointers to line number IL_000c: ldloc.0 • Push, Pop … IL_000d: ldc.i4.1 • Detailed Data Type data IL_000e: add.ovf IL_000f: stloc.0 IL_0010: br.s IL_0024 IL_0012: nop IL_0013: ldstr "Positive number" IL_0018: call void [mscorlib]System.Console::WriteLine(string) 8
  • 9. Clone Detection across Languages using IL The Core Solution • The Challenge: Noise • Solution: Data cleansing (filtering noises) • Why? (Answer: to increase recall) Source Code IL + noise IL - noise 9
  • 10. Our Before After Example Filter Set Filtering Filtering Description Filter 1 Filters for noise reduction IL_0003: stloc.0 stloc.0 IL_0003 (instruction address) Filter 2 brtrue.s IL_0015 brtrue.s The IL_0015 address of the branch destination Filter 3 ldarg 3 ldarg The value 3&1 represent starg 1 starg argument number Filter 4 ldc.i4.s 10 ldc.i4.s 10 is the number (pushed to the stack) Filter 5 ldstr "Positive number" ldstr “positive number” is the printed string constant Filter 6 stloc 7 stloc 7 represents variable index Filter 7 ldc.i4.s 10 ldc i4 represent the int32 data type in CIL and s for Short Filter 8 IL_0011: add add IL_0012: stloc.0 stloc Note that Filter 8 is just a nick IL_0013: br.s IL_0020 name. Refer to the Filter 8 br description section for more IL_001a: call void [mscorlib]System.Console::W call details. 10 riteLine (string)
  • 11. Clone Detection across Languages using IL Filtering Advantage: Recall Improvement Before Filtering Noises: VB.NET C# ~50% similarity Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 After:Else Console.WriteLine("Positive number") End If ~90% similarity End Sub 11
  • 12. Disadvantage of Noise reduction Danger! • Data Loss • What if we remove important data during data cleansing • Might mislead the detection by making non-cloned pairs identical  Possible negative effect on Precision Filtering Color Data 12
  • 13. RQ: Are They (Filters) Dangerous? Evaluation Preparation 1. Filter Contribution Formula: 2. Dataset preparation: – Controlled dataset (iText.NET J#) 25 pairs * 3 Lang. 1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#) 2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#) 13
  • 14. RQ: Are They (Filters) Dangerous? Filter Contribution - Study #1 • Are they harmful? (The answer is NO - based on following graphs, filters do not remove similar amount of data from actual clones vs. NONcloned code fragments) Cloned Dataset NonCloned Dataset A strong threshold for the Judge to decide 0.3 0.2 14
  • 15. RQ: Are They (Filters) Dangerous? Filter Contribution - Study #2 • Are they useful? (The answer is YES - based on the given figure, our filters help to discriminate among actual clones and NONcloned fragments, therefore it is possible to separate them with high confidence with the chosen threshold) 15
  • 16. RQ: Are They (Filters) Dangerous? Filter Contribution - Study #3 • Does filtering make actual clone-pairs and noncloned pairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in most cases) Final Conclusion: Filters contribute to discriminate between cloned and noncloned fragments 16
  • 17. An Interesting Unexpected Discovery Language-dependency!!! Corresponding faces in each group are not similar, while all of them are extracted from single language (IL). Specially look at C#-J# faces, all of them are different from other groups. This is an interesting discovery that the original high-level programming languages affect similarity at the IL level 17
  • 18. Clone Detection across Languages using IL Our Clone Detection Framework Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting for Clone Detection Algorithms Source Code LCS-based Clone Clusters (from NiCad) MS .NET Report (CIL) Proposed Filtering SimHash-based EXE & DLL Mechanism (from SimCad) Merging Report (Src Code) IlDasm.exe Levenshtein Source Code CIL (plain text) Distance-based Mapping 18
  • 19. The Selected Datasets for Performance Evaluation language File LOC Method ASXGUI 2.5 VB.NET 47 32,594 303 ASXGUI 3.0 C# 19 2088 78 language File LOC Method Mono 2.10 VB.NET 375 - - Mono 2.10 C# 57 - - Total 432 - 4998 language File LOC Method iText C# - - - iText.NET J# - - - Total 2.5 K 600 K 4th Dataset: iText.NET dataset from 1st case study19
  • 20. Clone Detection across Languages using IL Our Clone Detection Framework Performance Pay attention to changes within 0.6 … 0.8 20
  • 21. Clone Detection across Languages using IL Our Clone Detection Framework • 2K clone-pair manually investigated Precision The optimum, considering the trade-off between precision and recall, was achieved using Levenshtein Distance-based comparison with the High threshold (80% TP) Recall 0.6 Normal (iText.NET API) 76% using High 0.7 High TP = {E and S} threshold between three 0.8 Extreme 21 languages (C#, J#, and VB.NET).
  • 22. An Interesting Clone Detected by Our Approach private static string filename_nodir(string name) { int slash = -1, len = name.Length; for (int i = 0; i < len; i++) { string sub = name.Substring(i, 1); if (sub == "" || sub == "/") C# slash = i; } slash++; return name.Substring(slash, len - slash); } *The matching algorithm was limited to the content available within the boxes (it was NOT aware of same method names) Function Filename_Nodir() As String Dim intFileName As Integer, intSlash As Integer, strFilename As String strFileName = editvid.video For intFilename = 1 To len(strFileName) VB.NET If mid(strfilename, intfilename, 1) = "" Or mid(strfilename, intfilename, 1) = "/" Then intslash = intFilename End If Next Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash) 22 End Function
  • 23. Summary • The first comprehensive research focusing on, (1) .NET clone detection, (2) across programming languages, and (3) using Intermediate Language • Identified challenges in cross language clone detection + IL Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting for Clone Detection Algorithms Source Code LCS-based Clone Clusters (from NiCad) MS .NET Report (CIL) Proposed Filtering SimHash-based EXE & DLL Mechanism (from SimCad) Merging Report (Src Code) IlDasm.exe Levenshtein Source Code CIL (plain text) Distance-based Mapping 23
  • 24. Related Publication Iman Keivanloo, Chanchal K. Roy, Juergen Rilling, “Java Bytecode Clone Detection via Relaxation on Code Fingerprint and Semantic Web Reasoning,” 6th International Workshop on Software Clones (IWSC), 2012.

Editor's Notes

  1. {In this paper we answered some the very basic research questions related to this topicA general clone detection framework
  2. This talk is about source code clones. And I am going to use Sheeps to present clones. Suppose that there is a ship which is doing Mergesort. And the other sheeps are also doing mergesort. I can detect them as clone groups since thery are identical. so far there is no problem, however it becomes challenging when we want to find sheeps from other planets which are doing merge sort as wellTwo code fragments that share some degree of similarityare typically considered a clone pair. Based on their actualsimilarity, clone pairs can be categorized [5, 8] as Type-1,Type-2, Type-3, and Type-4 clones. Type-1 clones are exactcopies of each other, except for possible differences inwhitespaces, layouts and comments. Type-2 clones aresyntactically identical fragments except for variations inidentifiers, literals, data types, whitespace, layouts andcomments. Copied fragments (e.g., Type-1 and Type-2clones) with further modifications such as additions,deletions and changes of statements are called Type-3clones. Type-2 and Type-3 clones are also known as nearmissclones. Code fragments that perform the samecomputation (e.g., semantically similar) but implementedthrough different syntactic variations are called Type-4clones. Note that all of these definitions were originallyintroduced for clone-pairs implemented in the sameprogramming language. In our cross-language clone researchthese definitions are no longer applicable as-is, and have tobe refined to meet our research context. For example, the VBand C# fragments in Fig. 1 would be considered Type-1clones in the cross-language clone detection since they areessentially performing the same task implemented indifferent programming languages-----------------------------------------------------------the best of our knowledge, C2D2 [10] is theonly tool capable of detecting cross-language clones. It usesNRefactory Library to generate the Unified CodeDOM graphfor both C# and VB.NET. A string is generated by traversingthis graph and targeted to string matching algorithm(focusing on singlelanguage clone detection, mostly Java). One of the firststudies on Intermediate Language clone detection is byBaker [9]. After some preprocessing (e.g., remappingoffsets), she uses three comparison techniques (e.g., Diff[22]) to find similar fragments. Davis and Godfrey [23] usethe disassembler for both Java and C/C++ to detect clones insingle language.Selim et al. use “Jimple” [24]Juricic [26] uses Intermediate Language codeto detect plagiarism and similarities. The approach is basedon Levenshtein Distance as the similarity measure tocompare disassembled C# binary, and applies some primitivepreprocessing techniques which are comparable to two of ourfilters.filters. There are also some formal approaches, such as byCuomo et al. [27] that transform Java bytecode tomathematical models for clone detection
  3. {.NET targets multi-language development vs. java multi-platform direction{Now, the problem is changed to the single-language clone detection - so the problem is solved(It seems easy)Actually, it is possible but it is not easy which show in this paper why this is the case.NET: Contrary to Java, which targets application development using one language on several platforms, .NET aims for multi-language development on a single platform. It provides language interoperability, with each program module being able to use code written in the other languages.
  4. {as far as it finds something it worth a try it is tempting to give it a try
  5. {In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
  6. {In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
  7. {In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
  8. Filter 1: Removal of the instruction address (IL_xxxx:) atthe begin of each CIL instruction, eliminating dissimilaritiesdue to application/environment specific variations.Filter 2: Removal of instruction address (IL_xxxx:) forbranching statement. As part of this filtering step we cover all33 branching statements (e.g. beq, beq.s, bge).Filter 3: Removal of integer values that represent argumentnumber in CIL. e.g. ldarg 3 is interpreted in CIL as load theargument number 3 onto the stack. Instructions included in thisfilter are: starg, starg.s, ldrag, ldrag.s, ldrags,and ldraga.s.Filter 4: This filter eliminates constants in the CIL code,e.g. “ldc.i4 num” which corresponds to a Push numof typeint32 onto the stack as int32. Instructions covered by this filterare ldc.i4, ldc.i8, ldc.r4, ldc.r8, and ldc.i4.s.Filter 5: This filter removes all print literals in the CILcode, which are identified through ldstr statements.Filter 6: This filter removes all variable indexes like stlocindex, which correspond to popping a value from stack into alocal variable. Among the instructions removed by this filterare: ldloc, ldloc.s, ldloca.s, stloc and stloc.s.Filter 7: This filter removes some additional data typesand constant integers such as i4 from “ldc.i4. 1”. The completecommand pushes 1 as an int32 onto the stack.Filter 8: Is not actually a new filter, it combines all sevenfiltering techniques mentioned above, including thepreprocessing tasks in one single filter.
  9. {Before filter (50% sim) after filter almost similarWe address this challenge by creating a setof cleaning and filtering steps for CIL to improve theperformance of Type-1, Type-2, Type-3 and Type-4 clonedetection in the CIL code. The filters are designed to improvethe detection rate (i.e., recall) since the CIL data contains asignificant amount of noise (e.g., reference numbers to stringtables, which are compilation context dependent). Due to suchnoise in the CIL files, two semantically identical source codefragments might no longer be considered as highly similar atthe CIL level (e.g., content similar VB and C# methods mighthave less than 50% similarity at the CIL level, see Fig. 1).
  10. Filters increases RECALLBut might decrease PRECisiondrasticlyA major threat to any filter-based approach is the loss ofprecision by filtering out essential data. As a result, excessiveor improper loss of data (due to filtering) can lead to situationwhere non-answers and actual answers become similar to thedecision making algorithm, which eventually leads to anincrease in the false positive ratio
  11. {measures the effectiveness of each filter. That is how much it increases the content similarity after filtering comparing to before{iText.NET (J#)  25 Feature Code Sample (C#, VB.NET, J#)-&gt;75 code fragmentsmutually created three true positive clone-pair sets (VB-C#, VB-J#, and C#-J#)The second dataset (a.k.a., NonclonedFragments Dataset) contains 25 non-clone classes andAs well, 75 false positive clone-pair candidates created in the samemanner as clone classes.---Additional Info:To answer this question, we defined a metric called FilterContribution that measures the effectiveness of each filter. Theunderlying idea is to measure the similarity degree of candidateclone-pairs before and after applying different filters. Themeasure will indicate how much a particular filter increases thesimilarity value between two fragments. Note that in the idealcase, we expect that a filter would increase the similarity valuesof true positive cases significantly more than the ones for falsepositive cases. Otherwise, a particular filter would not be usefulto discriminate (with high confidence) against false positives.The Filter Contribution (FltrCntrb) function is defined in Eq. 2,which is based on LCS-based similarity. denotes theparticipant fragments in the clone-pair under investigation and presents the filter function with x being the filter number.
  12. It has no negative effect In the most cases, the filters increased thesimilarity up to ~0.2 (max) for non-cloned pairs whileimproving the similarity of cloned pairs by at least ~0.3.F8: (non-cloned pairs less than 0.5, while for themajority of cloned pairs the similarity increases between 0.5and 0.8.Thisresult supports our research hypothesis that filtering increasesthe similarity values for true positive cases (the cloned dataset)with a higher ratio than the false positive cases (the non-cloneddataset).
  13. Not only has no negative effect but also it contributes to descriminate between themTo support our claim, we conducted another case study onthe same dataset to determine if our filters can be used toidentify an appropriate similarity threshold. Fig. 3 summarizesthe findings, showing that before applying our filters, there wasno clear distinction between similarity values of actual clonepairs(true positives) and false positives. Therefore it isimpossible to determine an adequate threshold that allowsseparating actual clones from false positives. In contrast, Fig. 3shows that filters address this problem by increasing thedistance between the two groups (tagged on the right side ofFig. 3). For example, using our filters, a threshold from 0.4 to0.55 can separate true positives from false positives with highconfidence.
  14. Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation. The idea behind using faces is that humans easily recognize faces and notice small changes without difficulty.Glyphswe produced seven facefeatures for each pair by calculating Filter Contribution on allseven filters separately. That is, each pair can be modeled usinga vector in a multi-dimensional space (in our case, sevendimensions).----Filter 1, 2, and 5 since they are mapped to: (1) theface size, (2) distance between forehead and jaw, and (3)distance between eyes respectively. Therefore, it is alsopossible to intuitively observe that Filter 1, 2, and 5 (includingFilter 7 observed in Fig. 5) play the major role incharacterization of true positives.
  15. participant source code affects the similarity in IL levelA new interesting discovery“Is filtering neutral to the participating programming languages of clone-pairs (in cross-language clone detection context)?”.That is most of the faces are notround shaped comparing to the two other groups
  16. using three editdistance methods (LCS, LEV, SimHash) to avoid comparison function dependency in further case studies
  17. The noticeable difference in project metrics (e.g., LOC)can be attributed to the (1) dissimilarities in the programminglanguages, and (2) re-engineering and refactoring tasks.PDF Lib called iText and iText.NET. While their project namesare similar, both projects are completely independent from eachother. We created our third dataset from the iText (C# branch)and iText.NET (J#) source code.
  18. it is possible to detect numerous candidate clone-pairs even for cross-language case regardlessof the underlying algorithm, -------------Additional Info(2) no candidate clone-pair isdetected for cross-language using 1.0 as the Similarity Factor(i.e., the decision making threshold), which would only reportclone-pairs with complete identical content. Therefore, evenusing filtering on highly similar cross-language clone-pairs(e.g., Fig. 1), some dissimilarities will have to be handled bythe clone detection approach. However, this is not the case forsingle language clone detection (shown in Fig. 7), (3) for alldataset, we can observe a major decrease in the number ofcandidates when the threshold value is set to a range between0.6 and 0.8 (marked by ovals).
  19. Quality evaluation is inherently challenging in our researchsince there is no clear agreement on what constitutes truepositives (TP) and the various clone types definitions.Therefore, we applied in our qualitative evaluation thefollowing approach: (1) since it is possible to easily locate withconfidence false positives among candidate clone-pairs, wefirst tag all false positives; (2) we assume the rest as truepositive. However, in order to provide a more in depth qualityassessment, we also analyze the quality of the reported truepositives.--------Fig. 9 reviews the findings of our quality evaluation frommanually assessing ~2K candidate clone-pairs (answeringRQ4). In general, using the Normal threshold all candidateclone-pairs that were reported are true positive (100% TP). Thequality decreases with less restrictive thresholds. For exampleusing SimHash and the Extreme threshold, the reported TPreduces to ~40%. The optimum, considering the trade-offbetween precision and recall, was achieved using LevenshteinDistance-based comparison with the High threshold (80% TP).Nevertheless, this result is not 100% precise
  20. {Why we need such topic in general from industry points of view, these constitutes our motivation{Application being developed in different lagnuages (customer/contract iText iText.NET, legal issues {community Hibernate &gt; NHibernate)
  21. using three editdistance methods (LCS, LEV, SimHash) to avoid comparison function dependency in further case studies
  22. Comparison -&gt; Judge -&gt; threshold -&gt; yes/no
  23. A major threat to any filter-based approach is the loss ofprecision by filtering out essential data. As a result, excessiveor improper loss of data (due to filtering) can lead to situationwhere non-answers and actual answers become similar to thedecision making algorithm, which eventually leads to anincrease in the false positive ratio
  24. It is a detailed study on challenges, possible solution and evaluation, final resultNot only a clone detection approach but also important study which gives insight for futture research
  25. the best of our knowledge, C2D2 [10] is theonly tool capable of detecting cross-language clones. It usesNRefactory Library to generate the Unified CodeDOM graphfor both C# and VB.NET. A string is generated by traversingthis graph and targeted to string matching algorithm(focusing on singlelanguage clone detection, mostly Java). One of the firststudies on Intermediate Language clone detection is byBaker [9]. After some preprocessing (e.g., remappingoffsets), she uses three comparison techniques (e.g., Diff[22]) to find similar fragments. Davis and Godfrey [23] usethe disassembler for both Java and C/C++ to detect clones insingle language.Selim et al. use “Jimple” [24]Juricic [26] uses Intermediate Language codeto detect plagiarism and similarities. The approach is basedon Levenshtein Distance as the similarity measure tocompare disassembled C# binary, and applies some primitivepreprocessing techniques which are comparable to two of ourfilters.filters. There are also some formal approaches, such as byCuomo et al. [27] that transform Java bytecode tomathematical models for clone detection