1. Guide : Ms Sangeetha Jamal Presented by
Dept of Computer Science Merin Paul
Mtech CS-IS S1
9/25/2012 1
2. Contents
Introduction
Types of Source-code Plagiarism
Textual Similarity
Functional Similarity
Source Code Detection Algorithms.
Detecting Techniques
Tools used for code based plagiarism.
Conclusion
9/25/2012 2
3. Introduction
Plagiarism in source-code files occurs when source-code
is copied and edited without proper acknowledgment of
the original author.
Techniques for plagiarism: Lexical changes and structural
changes.
Lexical changes: changes that can be done to the source-
code without affecting the parsing of the program
9/25/2012 3
4. Introduction
Structural changes: changes made to the source code that
will affect the parsing of the code and involve program
debugging.
Reasons for code copying:
Code reusing.
Programmer limitation
Coincidentally implement using the same logic
9/25/2012 4
6. Textual Similarity
Two individual source codes look similar based on their
textual content.
Textual content mean the words, letters, variable
names, etc
Type 1, Type 2, Type 3.
9/25/2012 6
7. Type I
The copied code fragment is as same as the original one
without any modification except white spaces, comments
and line modifications.
int a; // counter
// count five times
for(a = 0; a < 5; a++)
{
printf(“a = %d”, a); // print value of a
}
return 0;
9/25/2012 7
8. Type I
int a;
/* Loop increasing of a and print a value of it */
for(a = 0; a < 5; a++){
printf(“a = %d”, a);
}
return 0;
9/25/2012 8
9. Type II
Same as Type I and also with modifications to variable
names, function names and other user-defined identifiers.
if(a > b)
{
a = a - 1;
b = b * a; // comment 1
}
else
{
b = a; // comment
2a = 0;
}
9/25/2012 9
11. Type III
A copied code fragment is done by inserting or
removing unnecessary statements.
if(a > b)
{
a = a - 1;
b = b * a;
}
else
{
b = a;
a = 0;
}
9/25/2012 11
12. Type III
if(a > b)
{
a = a – 1;
c = 0; // this statement is added
b = b * a;
}
else
{
b = a;
a = 0;
}
9/25/2012 12
13. Functional similarity
It refers to the code fragments that have the same semantic or
functionality.
fragment 1 : fragment 2:
int i , j = 1; int factorial(int n)
for(i = 1; i <= VALUE; i++) {
j = j * i; if(n == 0) return 1;
else return factorial(n – 1)*n;
}
9/25/2012 13
15. CONTD..
Text based
Find
textual match between two source codes..
Simple and Fast.
Token based
Using a lexer to convert the program into tokens.
Find a match in token sequences.
More robust to simple text replacements.
9/25/2012 15
16. CONTD…
Parse Trees
Build and compare parsetrees
Contains the complete information about the
source code
Tree comparison can normalize conditional
statements.
Program Dependency Graphs (PDGs)
Captures the actual flow of control in a program.
Allows higher-level equivalences to be located.
More complex.
9/25/2012 16
17. CONTD…
Metrics
capture 'scores' of code segments according to
certain criteria.
Metrics are simple to calculate.
Lead to false positives.
• Hybrid
Combination of two or more previous
techniques.
9/25/2012 17
18. Detecting Techniques
Detection via Lexical Similarities
The process of lexical analysis takes source code and
converts it into a stream of lexical tokens.
Source code undergoes a series of transformation.
Identification of reserved words, identifiers, and
numbers are beneficial for plagiarism detection.
9/25/2012 18
22. Detection via Metrics
Calculate and compare attribute counts.
Programs with similar attribute counts are potentially
similar programs.
Counts of operators and operands are typically used to
construct attribute counts.
9/25/2012 22
23. Tools used for code based plagiarism
Jplag
Finds similarities among multiple sets of source code files.
JPlag operates in two phases.
First phase: All programs to be compared are parsed and
converted into token strings.
Second phase: Token strings are compared in pairs for
determining the similarity of each pair.
It is more robust. It supports Java, c#, C, C++ and natural
language text.
9/25/2012 23
24. CONTD..
MOSS (Measure Of Software Similarity)
Measure Of Software Similarity was developed in 1994
by Alex Aiken.
It analyzes code written in languages like
C, C++, Python, Visual
Basic, Javascript, FORTRAN, Lisp, Ada etc.
Provided as an internet service and given a list of source
files.
9/25/2012 24
25. CONTD…
YAP (Yet Another Plague)
Token-based system.
YAP works in two phases.
The first phase generates a token file for each submission.
The second phase compares pairs of token files using the
token matching algorithm, Running-Karp-Rabin Greedy-
String-Tiling algorithm (RKRGST)
9/25/2012 25
26. Conclusion
Plagiarism in programming assignments is an inevitable
issue for most academics teaching programming.
Plagiarism Detection systems are built based on a few
languages.
Most of the detection software checking is done with
some repository situated in an organization.
As the number of digital copies are going up the
repository size should be large and the plagiarism
Detection software should be able to handle it.
9/25/2012 26
27. Conclusion
Plagiarism in programming assignments is an inevitable
issue for most academics teaching programming.
Most popular plagiarism detection algorithms use string-
matching to create token string representations of
programs.
The tokens of each document are compared on a pair-wise
basis to determine similar source-code segments between
the files.
String-matching systems are language-dependent
depending on the programming languages supported by
their parsers
9/25/2012 27
28. References
1) G. Cosma and M. Joy,” An Approach to Source-Code Plagiarism
Detection and Investigation Using Latent Semantic Analysis”
IEEE Trans. Computers, vol. 61, no. 3, pp. 379-391, March 2012
2) Georgina Cosma, Mike Joy, Daniel White and Jane Yau, 9th
August 2007 ,ICS,University of Ulster
http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/
3) Okiemute Omuta ”Electronic Source Code Plagiarism Detection”
Computer Engineering Department,European University of
Lefke, North Cyprus
4) S. Schleimer, D. Wilkerson, and A. Aiken, “Winnowing: Local
Algorithms for Document Fingerprinting,” Proc. the ACM
SIGMOD Int’l Conf. Management of Data, pp. 76-85, 2003
9/25/2012 28
29. References
4) M.J. Wise, “YAP3: Improved Detection of Similarities in Computer
Program and Other Texts,” Proc. 27th SIGCSE Technical
Symp., pp. 130-134, 1996.
9/25/2012 29