This presentation is given in Working Conference on Reverse Engineering (WCRE 2012).
The paper title is: "Detecting Clones across Microsoft .NET Programming Languages"
Abstract:
The Microsoft .NET framework and its language
family focus on multi-language development to support
interoperability across several programming languages. The
framework allows for the development of similar applications
in different languages through the reuse of core libraries. As a
result of such a multi-language development, the identification
and traceability of similar code fragments (clones) becomes a
key challenge. In this paper, we present a clone detection
approach for the .NET language family. The approach is based
on the Common Intermediate Language, which is generated by
the .NET compiler for the different languages within the .NET
framework. In order to achieve an acceptable recall while
maintaining the precision of our detection approach, we define
a set of filtering processes to reduce noise in the raw data. We
show that these filters are essential for Intermediate Languagebased
clone detection, without significantly affecting the
precision of the detection approach. Finally, we study the
quantitative and qualitative performance aspects of our clone
detection approach. We evaluate the number of reported
candidate clone-pairs, as well as the precision and recall (using
manual validation) for several open source cross-language
systems, to show the effectiveness of our proposed approach.
Unit-IV; Professional Sales Representative (PSR).pptx
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
1. This is not the original version given in the WCRE 2012 conference (no animation etc.)
Detecting Clones across
Microsoft .NET Programming Languages
Farouq Al-omari
Iman Keivanloo
Chanchal K. Roy
Juergen Rilling
Contact:
keivanloo@ieee.org
Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo
3. Clone Detection across Languages
General Solution
• C#
• VB.NET
•
• J#
Intermediate Language (IL)
(low level)
• F#
The solution is to use this
• COBOL (.NET) (instead of dealing with
several languages)
• Java
3
4. Clone Detection across Languages using IL
Is there any chance to work?
Input Data Type
CIL Source Code
Dataset # Clone # Clone # Clone Class # Clone
Class Fragment Fragment
ASXGUI 9 393 69 261
Mono 37 4373 369 1523
• Up to 3 times more cloned fragment detected
using IL
4
5. Clone Detection across Languages using IL
Observed Challenges (using an example)
VB.NET C#
Sub Main() static void main(string[] args){
Dim x As Integer x = 10 int x=10;
If x < 0 Then if(x<0)
x += 1 x++;
Else Console.WriteLine("Positive number") else
End If console.WriteLine ("Positive number");
End Sub }
5
6. Clone Detection across Languages using IL
Observed Challenges (using an example)
VB IL from VB C# IL from C#
VB.NET C#
Sub Main() static void main(string[] args){
Dim x As Integer x = 10 int x=10;
If x < 0 Then if(x<0)
x += 1 x++;
Else Console.WriteLine("Positive number") else
End If console.WriteLine ("Positive number");
End Sub }
6
7. Clone Detection across Languages using IL
Observed Challenges
VB.NET C#
Sub Main() static void main(string[] args){
Dim x As Integer x = 10 int x=10;
If x < 0 Then if(x<0)
x += 1 x++;
Observed Challenges
Else Console.WriteLine("Positive number") else
console.WriteLine ("Positive number");
1- Larger unpredictable size at IL level
End If
End Sub } [Keivanloo IWSC’12]
2- Higher dissimilarity at IL level
7
8. Observed Challenges #2: High Dissimilarity
Noise
• Sample IL Major noise types:
• Line numbers
• Pointers to line number
IL_000c: ldloc.0
• Push, Pop …
IL_000d: ldc.i4.1
• Detailed Data Type data
IL_000e: add.ovf
IL_000f: stloc.0
IL_0010: br.s IL_0024
IL_0012: nop
IL_0013: ldstr "Positive number"
IL_0018: call void [mscorlib]System.Console::WriteLine(string)
8
9. Clone Detection across Languages using IL
The Core Solution
• The Challenge: Noise
• Solution: Data cleansing (filtering noises)
• Why? (Answer: to increase recall)
Source Code IL + noise IL - noise 9
10. Our Before After Example
Filter Set Filtering Filtering Description
Filter 1 Filters for noise reduction
IL_0003: stloc.0 stloc.0 IL_0003 (instruction address)
Filter 2 brtrue.s IL_0015 brtrue.s The IL_0015 address of the
branch destination
Filter 3 ldarg 3 ldarg The value 3&1 represent
starg 1 starg argument number
Filter 4 ldc.i4.s 10 ldc.i4.s 10 is the number (pushed to
the stack)
Filter 5 ldstr "Positive number" ldstr “positive number” is the
printed string constant
Filter 6 stloc 7 stloc 7 represents variable index
Filter 7 ldc.i4.s 10 ldc i4 represent the int32 data
type in CIL and s for Short
Filter 8 IL_0011: add add
IL_0012: stloc.0 stloc Note that Filter 8 is just a nick
IL_0013: br.s IL_0020 name. Refer to the Filter 8
br description section for more
IL_001a: call void
[mscorlib]System.Console::W call details.
10
riteLine (string)
11. Clone Detection across Languages using IL
Filtering Advantage: Recall Improvement
Before Filtering Noises:
VB.NET C#
~50% similarity
Sub Main()
Dim x As Integer x = 10
If x < 0 Then
x += 1
After:Else Console.WriteLine("Positive number")
End If
~90% similarity
End Sub
11
12. Disadvantage of Noise reduction
Danger!
• Data Loss
• What if we remove important
data during data cleansing
• Might mislead the detection by
making non-cloned pairs identical
Possible negative effect on Precision
Filtering Color Data
12
13. RQ: Are They (Filters) Dangerous?
Evaluation Preparation
1. Filter Contribution Formula:
2. Dataset preparation:
– Controlled dataset (iText.NET J#) 25 pairs * 3 Lang.
1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#)
2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#)
13
14. RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #1
• Are they harmful? (The answer is NO - based on following graphs, filters
do not remove similar amount of data from actual clones vs. NONcloned code fragments)
Cloned Dataset NonCloned Dataset
A strong threshold for the Judge to decide
0.3 0.2
14
15. RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #2
• Are they useful?
(The answer is YES - based on
the given figure, our filters help to
discriminate among actual clones
and NONcloned
fragments, therefore it is possible
to separate them with high
confidence with the chosen
threshold)
15
16. RQ: Are They (Filters) Dangerous?
Filter Contribution - Study #3
• Does filtering make actual clone-pairs and noncloned
pairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to
cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in
most cases)
Final Conclusion:
Filters contribute to discriminate between cloned and noncloned fragments
16
17. An Interesting Unexpected Discovery
Language-dependency!!!
Corresponding faces in each group are
not similar, while all of them are
extracted from single language (IL).
Specially look at C#-J# faces, all of them
are different from other groups. This is an
interesting discovery that the original
high-level programming languages affect
similarity at the IL level
17
18. Clone Detection across Languages using IL
Our Clone Detection Framework
Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting
for Clone Detection Algorithms
Source Code LCS-based
Clone Clusters
(from NiCad)
MS .NET Report (CIL)
Proposed Filtering SimHash-based
EXE & DLL Mechanism (from SimCad) Merging
Report (Src Code)
IlDasm.exe
Levenshtein Source Code
CIL (plain text) Distance-based Mapping
18
19. The Selected Datasets for Performance
Evaluation
language File LOC Method
ASXGUI 2.5 VB.NET 47 32,594 303
ASXGUI 3.0 C# 19 2088 78
language File LOC Method
Mono 2.10 VB.NET 375 - -
Mono 2.10 C# 57 - -
Total 432 - 4998
language File LOC Method
iText C# - - -
iText.NET J# - - -
Total 2.5 K 600 K
4th Dataset: iText.NET dataset from 1st case study19
20. Clone Detection across Languages using IL
Our Clone Detection Framework Performance
Pay attention to
changes within
0.6 … 0.8
20
21. Clone Detection across Languages using IL
Our Clone Detection Framework
• 2K clone-pair manually investigated
Precision
The optimum, considering the
trade-off
between precision and recall,
was achieved using
Levenshtein Distance-based
comparison with the High
threshold (80% TP)
Recall
0.6 Normal (iText.NET API) 76% using High
0.7 High TP = {E and S} threshold between three
0.8 Extreme 21
languages (C#, J#, and VB.NET).
22. An Interesting Clone
Detected by Our Approach
private static string filename_nodir(string name)
{
int slash = -1, len = name.Length;
for (int i = 0; i < len; i++)
{
string sub = name.Substring(i, 1);
if (sub == "" || sub == "/")
C#
slash = i;
}
slash++;
return name.Substring(slash, len - slash);
}
*The matching algorithm was limited to the content available
within the boxes (it was NOT aware of same method names)
Function Filename_Nodir() As String
Dim intFileName As Integer, intSlash As Integer, strFilename As String
strFileName = editvid.video
For intFilename = 1 To len(strFileName)
VB.NET
If mid(strfilename, intfilename, 1) = "" Or mid(strfilename, intfilename, 1) = "/" Then
intslash = intFilename
End If
Next
Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash) 22
End Function
23. Summary
• The first comprehensive research focusing on,
(1) .NET clone detection,
(2) across programming languages,
and (3) using Intermediate Language
• Identified challenges in cross language clone detection + IL
Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting
for Clone Detection Algorithms
Source Code LCS-based
Clone Clusters
(from NiCad)
MS .NET Report (CIL)
Proposed Filtering SimHash-based
EXE & DLL Mechanism (from SimCad) Merging
Report (Src Code)
IlDasm.exe
Levenshtein Source Code
CIL (plain text) Distance-based Mapping
23
24. Related Publication
Iman Keivanloo, Chanchal K. Roy, Juergen Rilling,
“Java Bytecode Clone Detection via Relaxation on Code
Fingerprint and Semantic Web Reasoning,”
6th International Workshop on Software Clones (IWSC), 2012.
{In this paper we answered some the very basic research questions related to this topicA general clone detection framework
This talk is about source code clones. And I am going to use Sheeps to present clones. Suppose that there is a ship which is doing Mergesort. And the other sheeps are also doing mergesort. I can detect them as clone groups since thery are identical. so far there is no problem, however it becomes challenging when we want to find sheeps from other planets which are doing merge sort as wellTwo code fragments that share some degree of similarityare typically considered a clone pair. Based on their actualsimilarity, clone pairs can be categorized [5, 8] as Type-1,Type-2, Type-3, and Type-4 clones. Type-1 clones are exactcopies of each other, except for possible differences inwhitespaces, layouts and comments. Type-2 clones aresyntactically identical fragments except for variations inidentifiers, literals, data types, whitespace, layouts andcomments. Copied fragments (e.g., Type-1 and Type-2clones) with further modifications such as additions,deletions and changes of statements are called Type-3clones. Type-2 and Type-3 clones are also known as nearmissclones. Code fragments that perform the samecomputation (e.g., semantically similar) but implementedthrough different syntactic variations are called Type-4clones. Note that all of these definitions were originallyintroduced for clone-pairs implemented in the sameprogramming language. In our cross-language clone researchthese definitions are no longer applicable as-is, and have tobe refined to meet our research context. For example, the VBand C# fragments in Fig. 1 would be considered Type-1clones in the cross-language clone detection since they areessentially performing the same task implemented indifferent programming languages-----------------------------------------------------------the best of our knowledge, C2D2 [10] is theonly tool capable of detecting cross-language clones. It usesNRefactory Library to generate the Unified CodeDOM graphfor both C# and VB.NET. A string is generated by traversingthis graph and targeted to string matching algorithm(focusing on singlelanguage clone detection, mostly Java). One of the firststudies on Intermediate Language clone detection is byBaker [9]. After some preprocessing (e.g., remappingoffsets), she uses three comparison techniques (e.g., Diff[22]) to find similar fragments. Davis and Godfrey [23] usethe disassembler for both Java and C/C++ to detect clones insingle language.Selim et al. use “Jimple” [24]Juricic [26] uses Intermediate Language codeto detect plagiarism and similarities. The approach is basedon Levenshtein Distance as the similarity measure tocompare disassembled C# binary, and applies some primitivepreprocessing techniques which are comparable to two of ourfilters.filters. There are also some formal approaches, such as byCuomo et al. [27] that transform Java bytecode tomathematical models for clone detection
{.NET targets multi-language development vs. java multi-platform direction{Now, the problem is changed to the single-language clone detection - so the problem is solved(It seems easy)Actually, it is possible but it is not easy which show in this paper why this is the case.NET: Contrary to Java, which targets application development using one language on several platforms, .NET aims for multi-language development on a single platform. It provides language interoperability, with each program module being able to use code written in the other languages.
{as far as it finds something it worth a try it is tempting to give it a try
{In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
{In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
{In this research we have observed interesting challenges in this problem, I am going to show some them using an exampleEven in this simple example, we can clearly see serious challenges to be addressedChallenges:1-2-larger unpredictable size [sebyte] (being a lower level language causes 5 to 20 LOC)2-High dissimilarity at bytecode, even in cases of semantically identical source code fragments-----Additional info------------------Being a lower levelrepresentation, CIL code size tends to be much larger thantraditional high-level source code. Fig. 1 (the first twocolumns) shows a comparison between a VB code fragment(a small VB method), and its corresponding CILrepresentation. In this example the method body with fivelines of code has been transformed to more than twenty linesof code in CIL. This creates an additional challenge, makingclone detection on binary rather different from source code.Nevertheless, given this common representation of codefragments written in different programming languagesprovides the ability to use CIL for clone detection across.NET languages. However, a key challenge is the fact that itis possible to have some dissimilarity at CIL level, even incases of semantically identical source code fragments(written in different .Net languages). The first four columnsof Fig. 1 (the Raw Data section) provide an example for suchdissimilarities. Both the VB and C# methods implement thesame program following similar coding pattern and structureas much as possible. However, when we compare the CILpairs, there are three key sections clearly distinguishable: (1)identical CIL content which is marked by the first dashedarea, (2) the first point of dissimilarity which is flagged bythe italic font style, and (3) the rest of the content marked bythe second dashed box that covers CIL content withconsiderable dissimilarity. In general, this examplehighlights the key challenge in binary clone detection, thepossibility of facing dissimilarity by exploiting .NETIntermediate Language even for semantically (and almostsyntactically) identical fragments in cross-language context
Filter 1: Removal of the instruction address (IL_xxxx:) atthe begin of each CIL instruction, eliminating dissimilaritiesdue to application/environment specific variations.Filter 2: Removal of instruction address (IL_xxxx:) forbranching statement. As part of this filtering step we cover all33 branching statements (e.g. beq, beq.s, bge).Filter 3: Removal of integer values that represent argumentnumber in CIL. e.g. ldarg 3 is interpreted in CIL as load theargument number 3 onto the stack. Instructions included in thisfilter are: starg, starg.s, ldrag, ldrag.s, ldrags,and ldraga.s.Filter 4: This filter eliminates constants in the CIL code,e.g. “ldc.i4 num” which corresponds to a Push numof typeint32 onto the stack as int32. Instructions covered by this filterare ldc.i4, ldc.i8, ldc.r4, ldc.r8, and ldc.i4.s.Filter 5: This filter removes all print literals in the CILcode, which are identified through ldstr statements.Filter 6: This filter removes all variable indexes like stlocindex, which correspond to popping a value from stack into alocal variable. Among the instructions removed by this filterare: ldloc, ldloc.s, ldloca.s, stloc and stloc.s.Filter 7: This filter removes some additional data typesand constant integers such as i4 from “ldc.i4. 1”. The completecommand pushes 1 as an int32 onto the stack.Filter 8: Is not actually a new filter, it combines all sevenfiltering techniques mentioned above, including thepreprocessing tasks in one single filter.
{Before filter (50% sim) after filter almost similarWe address this challenge by creating a setof cleaning and filtering steps for CIL to improve theperformance of Type-1, Type-2, Type-3 and Type-4 clonedetection in the CIL code. The filters are designed to improvethe detection rate (i.e., recall) since the CIL data contains asignificant amount of noise (e.g., reference numbers to stringtables, which are compilation context dependent). Due to suchnoise in the CIL files, two semantically identical source codefragments might no longer be considered as highly similar atthe CIL level (e.g., content similar VB and C# methods mighthave less than 50% similarity at the CIL level, see Fig. 1).
Filters increases RECALLBut might decrease PRECisiondrasticlyA major threat to any filter-based approach is the loss ofprecision by filtering out essential data. As a result, excessiveor improper loss of data (due to filtering) can lead to situationwhere non-answers and actual answers become similar to thedecision making algorithm, which eventually leads to anincrease in the false positive ratio
{measures the effectiveness of each filter. That is how much it increases the content similarity after filtering comparing to before{iText.NET (J#) 25 Feature Code Sample (C#, VB.NET, J#)->75 code fragmentsmutually created three true positive clone-pair sets (VB-C#, VB-J#, and C#-J#)The second dataset (a.k.a., NonclonedFragments Dataset) contains 25 non-clone classes andAs well, 75 false positive clone-pair candidates created in the samemanner as clone classes.---Additional Info:To answer this question, we defined a metric called FilterContribution that measures the effectiveness of each filter. Theunderlying idea is to measure the similarity degree of candidateclone-pairs before and after applying different filters. Themeasure will indicate how much a particular filter increases thesimilarity value between two fragments. Note that in the idealcase, we expect that a filter would increase the similarity valuesof true positive cases significantly more than the ones for falsepositive cases. Otherwise, a particular filter would not be usefulto discriminate (with high confidence) against false positives.The Filter Contribution (FltrCntrb) function is defined in Eq. 2,which is based on LCS-based similarity. denotes theparticipant fragments in the clone-pair under investigation and presents the filter function with x being the filter number.
It has no negative effect In the most cases, the filters increased thesimilarity up to ~0.2 (max) for non-cloned pairs whileimproving the similarity of cloned pairs by at least ~0.3.F8: (non-cloned pairs less than 0.5, while for themajority of cloned pairs the similarity increases between 0.5and 0.8.Thisresult supports our research hypothesis that filtering increasesthe similarity values for true positive cases (the cloned dataset)with a higher ratio than the false positive cases (the non-cloneddataset).
Not only has no negative effect but also it contributes to descriminate between themTo support our claim, we conducted another case study onthe same dataset to determine if our filters can be used toidentify an appropriate similarity threshold. Fig. 3 summarizesthe findings, showing that before applying our filters, there wasno clear distinction between similarity values of actual clonepairs(true positives) and false positives. Therefore it isimpossible to determine an adequate threshold that allowsseparating actual clones from false positives. In contrast, Fig. 3shows that filters address this problem by increasing thedistance between the two groups (tagged on the right side ofFig. 3). For example, using our filters, a threshold from 0.4 to0.55 can separate true positives from false positives with highconfidence.
Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation. The idea behind using faces is that humans easily recognize faces and notice small changes without difficulty.Glyphswe produced seven facefeatures for each pair by calculating Filter Contribution on allseven filters separately. That is, each pair can be modeled usinga vector in a multi-dimensional space (in our case, sevendimensions).----Filter 1, 2, and 5 since they are mapped to: (1) theface size, (2) distance between forehead and jaw, and (3)distance between eyes respectively. Therefore, it is alsopossible to intuitively observe that Filter 1, 2, and 5 (includingFilter 7 observed in Fig. 5) play the major role incharacterization of true positives.
participant source code affects the similarity in IL levelA new interesting discovery“Is filtering neutral to the participating programming languages of clone-pairs (in cross-language clone detection context)?”.That is most of the faces are notround shaped comparing to the two other groups
using three editdistance methods (LCS, LEV, SimHash) to avoid comparison function dependency in further case studies
The noticeable difference in project metrics (e.g., LOC)can be attributed to the (1) dissimilarities in the programminglanguages, and (2) re-engineering and refactoring tasks.PDF Lib called iText and iText.NET. While their project namesare similar, both projects are completely independent from eachother. We created our third dataset from the iText (C# branch)and iText.NET (J#) source code.
it is possible to detect numerous candidate clone-pairs even for cross-language case regardlessof the underlying algorithm, -------------Additional Info(2) no candidate clone-pair isdetected for cross-language using 1.0 as the Similarity Factor(i.e., the decision making threshold), which would only reportclone-pairs with complete identical content. Therefore, evenusing filtering on highly similar cross-language clone-pairs(e.g., Fig. 1), some dissimilarities will have to be handled bythe clone detection approach. However, this is not the case forsingle language clone detection (shown in Fig. 7), (3) for alldataset, we can observe a major decrease in the number ofcandidates when the threshold value is set to a range between0.6 and 0.8 (marked by ovals).
Quality evaluation is inherently challenging in our researchsince there is no clear agreement on what constitutes truepositives (TP) and the various clone types definitions.Therefore, we applied in our qualitative evaluation thefollowing approach: (1) since it is possible to easily locate withconfidence false positives among candidate clone-pairs, wefirst tag all false positives; (2) we assume the rest as truepositive. However, in order to provide a more in depth qualityassessment, we also analyze the quality of the reported truepositives.--------Fig. 9 reviews the findings of our quality evaluation frommanually assessing ~2K candidate clone-pairs (answeringRQ4). In general, using the Normal threshold all candidateclone-pairs that were reported are true positive (100% TP). Thequality decreases with less restrictive thresholds. For exampleusing SimHash and the Extreme threshold, the reported TPreduces to ~40%. The optimum, considering the trade-offbetween precision and recall, was achieved using LevenshteinDistance-based comparison with the High threshold (80% TP).Nevertheless, this result is not 100% precise
{Why we need such topic in general from industry points of view, these constitutes our motivation{Application being developed in different lagnuages (customer/contract iText iText.NET, legal issues {community Hibernate > NHibernate)
using three editdistance methods (LCS, LEV, SimHash) to avoid comparison function dependency in further case studies
A major threat to any filter-based approach is the loss ofprecision by filtering out essential data. As a result, excessiveor improper loss of data (due to filtering) can lead to situationwhere non-answers and actual answers become similar to thedecision making algorithm, which eventually leads to anincrease in the false positive ratio
It is a detailed study on challenges, possible solution and evaluation, final resultNot only a clone detection approach but also important study which gives insight for futture research
the best of our knowledge, C2D2 [10] is theonly tool capable of detecting cross-language clones. It usesNRefactory Library to generate the Unified CodeDOM graphfor both C# and VB.NET. A string is generated by traversingthis graph and targeted to string matching algorithm(focusing on singlelanguage clone detection, mostly Java). One of the firststudies on Intermediate Language clone detection is byBaker [9]. After some preprocessing (e.g., remappingoffsets), she uses three comparison techniques (e.g., Diff[22]) to find similar fragments. Davis and Godfrey [23] usethe disassembler for both Java and C/C++ to detect clones insingle language.Selim et al. use “Jimple” [24]Juricic [26] uses Intermediate Language codeto detect plagiarism and similarities. The approach is basedon Levenshtein Distance as the similarity measure tocompare disassembled C# binary, and applies some primitivepreprocessing techniques which are comparable to two of ourfilters.filters. There are also some formal approaches, such as byCuomo et al. [27] that transform Java bytecode tomathematical models for clone detection