Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

This is not the original version given in the WCRE 2012 conference (no animation etc.)

Detecting Clones across
Microsoft .NET Programming Languages
Farouq Al-omari
Iman Keivanloo
Chanchal K. Roy
Juergen Rilling
Contact:
keivanloo@ieee.org

Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo

Mergesort Merge
Mergesort

Clones (Mergesort)

Mergesort

Clone Detection across Languages
General Solution
• C#

• VB.NET
•

• J#
Intermediate Language (IL)
(low level)
• F#

The solution is to use this
• COBOL (.NET) (instead of dealing with
several languages)

• Java
3

Clone Detection across Languages using IL
Is there any chance to work?
Input Data Type
CIL Source Code
Dataset # Clone # Clone # Clone Class # Clone
Class Fragment Fragment

ASXGUI 9 393 69 261
Mono 37 4373 369 1523
• Up to 3 times more cloned fragment detected
using IL

4

Observed Challenges (using an example)

VB.NET C#
Sub Main() static void main(string[] args){
Dim x As Integer x = 10 int x=10;
If x < 0 Then if(x<0)
x += 1 x++;
Else Console.WriteLine("Positive number") else
End If console.WriteLine ("Positive number");
End Sub }

5

Observed Challenges (using an example)
VB IL from VB C# IL from C#

VB.NET C#
x += 1 x++;
End If console.WriteLine ("Positive number");
End Sub }

6

Observed Challenges

VB.NET C#
x += 1 x++;
Observed Challenges
console.WriteLine ("Positive number");
1- Larger unpredictable size at IL level
End If
End Sub } [Keivanloo IWSC’12]
2- Higher dissimilarity at IL level

7

Observed Challenges #2: High Dissimilarity
Noise
• Sample IL Major noise types:
• Line numbers
• Pointers to line number
IL_000c: ldloc.0
• Push, Pop …
IL_000d: ldc.i4.1
• Detailed Data Type data
IL_000e: add.ovf
IL_000f: stloc.0
IL_0010: br.s IL_0024
IL_0012: nop
IL_0013: ldstr "Positive number"
IL_0018: call void [mscorlib]System.Console::WriteLine(string)

8

The Core Solution
• The Challenge: Noise
• Solution: Data cleansing (filtering noises)
• Why? (Answer: to increase recall)

Source Code IL + noise IL - noise 9

Our Before After Example
Filter Set Filtering Filtering Description
Filter 1 Filters for noise reduction
IL_0003: stloc.0 stloc.0 IL_0003 (instruction address)
Filter 2 brtrue.s IL_0015 brtrue.s The IL_0015 address of the
branch destination
Filter 3 ldarg 3 ldarg The value 3&1 represent
starg 1 starg argument number
Filter 4 ldc.i4.s 10 ldc.i4.s 10 is the number (pushed to
the stack)
Filter 5 ldstr "Positive number" ldstr “positive number” is the
printed string constant
Filter 6 stloc 7 stloc 7 represents variable index
Filter 7 ldc.i4.s 10 ldc i4 represent the int32 data
type in CIL and s for Short
Filter 8 IL_0011: add add
IL_0012: stloc.0 stloc Note that Filter 8 is just a nick
IL_0013: br.s IL_0020 name. Refer to the Filter 8
br description section for more
IL_001a: call void
[mscorlib]System.Console::W call details.
10
riteLine (string)

Filtering Advantage: Recall Improvement

Before Filtering Noises:
VB.NET C#
~50% similarity
Sub Main()
Dim x As Integer x = 10
If x < 0 Then
x += 1
After:Else Console.WriteLine("Positive number")
End If
~90% similarity
End Sub

11

Disadvantage of Noise reduction
Danger!
• Data Loss
• What if we remove important
data during data cleansing
• Might mislead the detection by
making non-cloned pairs identical
 Possible negative effect on Precision

Filtering Color Data

12

RQ: Are They (Filters) Dangerous?
Evaluation Preparation
1. Filter Contribution Formula:

2. Dataset preparation:
– Controlled dataset (iText.NET J#) 25 pairs * 3 Lang.
1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#)
2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#)

13

Filter Contribution - Study #1
• Are they harmful? (The answer is NO - based on following graphs, filters
do not remove similar amount of data from actual clones vs. NONcloned code fragments)

Cloned Dataset NonCloned Dataset
A strong threshold for the Judge to decide

0.3 0.2

14

• Are they useful?
(The answer is YES - based on
the given figure, our filters help to
discriminate among actual clones
and NONcloned
fragments, therefore it is possible
to separate them with high
confidence with the chosen
threshold)

15

• Does filtering make actual clone-pairs and noncloned
pairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to
cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in
most cases)

Final Conclusion:

Filters contribute to discriminate between cloned and noncloned fragments
16

An Interesting Unexpected Discovery
Language-dependency!!!

Corresponding faces in each group are
not similar, while all of them are
extracted from single language (IL).
Specially look at C#-J# faces, all of them
are different from other groups. This is an
interesting discovery that the original
high-level programming languages affect
similarity at the IL level

17

Our Clone Detection Framework

Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting
for Clone Detection Algorithms
Source Code LCS-based
Clone Clusters
(from NiCad)
MS .NET Report (CIL)
Proposed Filtering SimHash-based
EXE & DLL Mechanism (from SimCad) Merging
Report (Src Code)
IlDasm.exe
Levenshtein Source Code
CIL (plain text) Distance-based Mapping

18

The Selected Datasets for Performance
Evaluation
language File LOC Method
ASXGUI 2.5 VB.NET 47 32,594 303
ASXGUI 3.0 C# 19 2088 78

Mono 2.10 VB.NET 375 - -
Mono 2.10 C# 57 - -
Total 432 - 4998

iText C# - - -
iText.NET J# - - -
Total 2.5 K 600 K

4th Dataset: iText.NET dataset from 1st case study19

Our Clone Detection Framework Performance

Pay attention to
changes within
0.6 … 0.8
20

Our Clone Detection Framework
• 2K clone-pair manually investigated
Precision
The optimum, considering the
trade-off
between precision and recall,
was achieved using
Levenshtein Distance-based
comparison with the High
threshold (80% TP)

Recall
0.6 Normal (iText.NET API) 76% using High
0.7 High TP = {E and S} threshold between three
0.8 Extreme 21
languages (C#, J#, and VB.NET).

An Interesting Clone
Detected by Our Approach
private static string filename_nodir(string name)
{
int slash = -1, len = name.Length;
for (int i = 0; i < len; i++)
{
string sub = name.Substring(i, 1);
if (sub == "" || sub == "/")
C#

slash = i;
}
slash++;
return name.Substring(slash, len - slash);
}
*The matching algorithm was limited to the content available
within the boxes (it was NOT aware of same method names)
Function Filename_Nodir() As String
Dim intFileName As Integer, intSlash As Integer, strFilename As String
strFileName = editvid.video
For intFilename = 1 To len(strFileName)
VB.NET

If mid(strfilename, intfilename, 1) = "" Or mid(strfilename, intfilename, 1) = "/" Then
intslash = intFilename
End If
Next
Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash) 22
End Function

Summary
• The first comprehensive research focusing on,
(1) .NET clone detection,
(2) across programming languages,
and (3) using Intermediate Language
• Identified challenges in cross language clone detection + IL

Input: .NET Code CIL Manipulation Clone Detection Clone Analysis Reporting
for Clone Detection Algorithms
Source Code LCS-based
Clone Clusters
(from NiCad)
MS .NET Report (CIL)
Proposed Filtering SimHash-based
EXE & DLL Mechanism (from SimCad) Merging
Report (Src Code)
IlDasm.exe
Levenshtein Source Code
CIL (plain text) Distance-based Mapping

23

Related Publication
Iman Keivanloo, Chanchal K. Roy, Juergen Rilling,
“Java Bytecode Clone Detection via Relaxation on Code
Fingerprint and Semantic Web Reasoning,”
6th International Workshop on Software Clones (IWSC), 2012.

Contact: keivanloo@ieee.org

ANY QUESTION?

25

Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (6)

Similar to Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

Similar to Detecting Clones across Microsoft .NET Programming Languages (WCRE2012) (20)

Recently uploaded

Recently uploaded (20)

Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

Editor's Notes