2. 244 S. Srivastava et al.
Plagiarism in coding is not a completely novel experience. This concern has been
studied earlier by researchers to recognize the rigorousness of the problem [1, 2].
Plagiarism in programming assignment, not only engrossed the replication of source
code but comments and input data are also considered as plagiarism. There are many
reasons for students of getting involved in plagiarism like sometimes they feel lazy
to write their code. Usually, plagiarism in coding is firm to sense since similar coding
is used for the same application. Plagiarism in coding is straightforward to do but
tricky to detect. Scholars facsimile all or part of a program from a source or different
sources and put forward the fake as their work. This includes students who act as a
team and present analogous work. Such plagiarism is felt to be ordinary, even though
the true similarity level is hard to assess. When a teacher in a programming course
gives a common problem to all scholars then all have to work on the same problem.
Consequently, some scholars may inscribe the source code of a problem on their
own. While other scholars just obtain the code and change the variable names, the
order of statements, functions, and variables of a class. Such modifications in source
code are complicated to seize. There are two categories of source code variation:
lexical change and structural change. Lexical change can be done without any prior
programming knowledge. Structural changes need prior knowledge of programming
language. Change in the number of iterations, conditional statements, the order of
statements, a procedure to function, and vice versa, adding comments are structural
changes.
For the code in Fig. 1, one can use the same logic devoid of considering this
code. For sure, this is not considered plagiarism. Such a scenario can be handled by
putting some constraints over the size of the code. The constraints may be like that
if n consecutive lines are similar in two codes then it will be considered as stealing.
We need a system to calculate the similarity percentage of code between two Java
files. We proposed a plagiarism detection system based on a novel normalization
process, to identify the uniqueness of the scholar’s code by comparing the input code
with the original code. It may be used by the teachers to detect whether the student
committed plagiarism or not. This is possible when the plagiarism is estimated for
two Java files. If the percentage of plagiarism is less than the specified threshold,
then the input code is acceptable otherwise not.
Fig. 1 Sample code
3. A Tool to Detect Plagiarism in Java Source Code 245
The rest of the paper is organized as Sect. 2 represents the previous work on
plagiarism detection. Section 3 presents the proposed work. The results are discussed
in Sect. 4. Section 5 concludes the proposal.
2 Related Work
Many researchers have given methods for plagiarism detection in text and program-
ming code [3, 4, 5, 6]. While some researchers gave a comparison among different
plagiarism detection tools [7, 1, 2]. Nurhayati and Busman [8] intended the Leven-
shtein Distance (LD) algorithm for plagiarism detection in the document. They devel-
oped software for Android smartphones. One way to measure the distance is a string
metric which is the result of the LD algorithm. In [9], the authors created an appli-
cation using the LD algorithm to identify similarity in Java codes. A technique
for uncovering the plagiarism between C++ and Java codes based on semantics
has been projected in [10]. It is a multimedia-based e-Learning and smart estima-
tion method. Input code transformed into tokens to determine semantic comparison
token by token. Then it estimated the semantic similarity for the whole input code.
In literature, there exist many similarity detection algorithms. Based on these algo-
rithms, the researchers developed a similarity detection system referred to as SCSDS
[11]. SCSDS was slower than existing methods. By the fusion of various similarity
detection algorithms, the speed and performance of SCSDS became even worse.
SCSDS required speed and performance improvement. In [12], the plagiarism detec-
tion system considered only text documents for plagiarism tasks. No consideration
was given to the syntactical structure of formal programming language. They used
normalization of commonly used identifiers to detect a pair of programs that have the
same objective. They proved that removal of these normalized operations improves
the system.
3 Proposed Method
The proposed system aims to estimate the plagiarism percentage in the given input
code. Initially, the user needs to give an input code that has to be checked for plagia-
rism. The already available codes are called here as original codes that are used
for comparison. These two codes are stored in separate variables. After that, the
code stored in these two variables is converted to a form that can be easily used
for detecting plagiarism. This is done in the normalization step. Following steps are
performed to normalize the code:
• Removing white spaces
• Removing comments
• Removing all the keywords
4. 246 S. Srivastava et al.
• Removing all the operators
• Replacing all the identifiers with **identifier**
• Sorting.
Removing white spaces
Generally, there are white spaces before and after any operator to enhance the read-
ability. If the code is copied from any online platform then users generally take care
of these extra spaces because it looks like it has been copied. So, there is no need for
extra spaces as it will increase the length of our string. As the length of the string
increases, it will reflect on the LD algorithm as its complexity is O(n2
).
Removing comments
As comments do not affect the actual functioning of code, it is merely there for
understanding code in case of complex and long code. We are removing comments
because someone can add an extra comment or edit the copied comment. Since the
LD algorithm checks similarity character by character, it will affect the result of
our plagiarism detection tool. The following regular expression is used to detect the
comments.
replaceAll(“(?:/*(?:[ˆ*]|(?:*+[ˆ*/]))**+/)|(?://.*)”,”“))
Removing all the keywords
This is the most significant step. It involves removing all the keywords that belong to
a language. In our proposal, we check plagiarism only in Java code, so we removed
all the keywords that belong to Java language. We are removing keywords because
the code of the same program will generally have some type of data types and inbuilt
functions. Therefore, they are generally increasing the length of our string which
will again reflect the complexity as O(n2
). So, to save time and space we remove
keywords. Sometimes users come around with some hack and use different data
types and functions to complete the code. Although the code is copied, as he/she
understood the copied code, he/she edited it to avoid plagiarism. Removing all the
keywords will help in detecting the genuine similarity index.
Removing all the operators
Generally, codes of the same program used the same type and the same number of
operators even if they are not copied. They are only increasing the time and space
complexity of our code. To get away from this, we remove all the operators.
Replacing all the identifiers with **identifier**
Users generally change the name of identifiers involved in a code to dodge plagiarism.
So, we are renaming all the identifiers in both the codes that mean original code and
the code to be checked by “**identifier**”.
5. A Tool to Detect Plagiarism in Java Source Code 247
Sorting
Sort both the strings containing original code and the code to be checked alphabeti-
cally. A user can change the position of copied code (function, class, etc). Sometimes
user also changes the position of statements. Therefore, we need to sort both the
strings. The result of sorting is stored separately for original code as well as code to
be checked to detect plagiarism even if the user has changed the position of copied
code. This completes the normalization step.
After performing all these steps, we get normalized code that again can be stored
in a variable. Now, we simply apply the LD algorithm [8]. After that, we store the
result of the LD algorithm in a variable. Now, we calculate the plagiarized value
using the result of the LD algorithm.
Levenshtein Algorithm
The LD algorithm [8] is used to find the distance which is used for measuring the
dissimilarity between two progressions. This distance is referred to as Levenshtein
distance or edit distance. It may also denote a larger family of distance metrics. It
gives a minimum number of single-character alterations, essential to change one
word into the other, between two terms.
Calculating Plagiarism
After performing normalization, we get normalized codes in the form of string both
for original code and code to be checked. The original code is referred to as source
string (δ). The code to be checked string is referred to as the target string (ε). After
this, we fed these two strings to the LD algorithm. It gives us a numeric value which
corresponds to the difference between these two strings. This is called LD distance
( -
d) and is defined as:
(1)
Now, using plagiarized value formula, we can calculate plagiarism between these
two stings. The plagiarized value (ƥ) can be calculated as:
(2)
where -
d is the LD distance, δ represents the original code, ε is code to be checked
for plagiarism, max(δ, ε) is maximum length between δ and ε. Figure 2 shows the
working of the proposed plagiarism detection system.
6. 248 S. Srivastava et al.
Fig. 2 Framework of the
proposed plagiarism
detection system
7. A Tool to Detect Plagiarism in Java Source Code 249
4 Results and Findings
To estimate the plagiarism percentage of the given input code, first, the user needs
to give input code that has to be checked for plagiarism along with the original code.
Figures 3 and 4 show the samples of the original code and code to be checked, respec-
tively. This code is injected into the normalization step which results in normalized
code. Now, the LD algorithm [8] is applied to the normalized code. Then, using the
result of the LD algorithm, the plagiarized value can be estimated. Figure 5 shows
the user interface of the proposed system. Figure 6 shows the interface after filling
the code in the specified area. Figure 7 shows the estimated plagiarism by clicking on
the check fraud button. From Fig. 8, it can be observed that the standard plagiarism
detection software is not suitable to detect the originality of a Java programming
code. Since there are common keywords in a programming language used by the
programmers. Therefore, merely the detection of the same words is not the correct
criteria to investigate the originality of source code. As can be seen from Figs. 7
and 8, standard software (Turnitin) gives the similarity index of 78% whereas the
proposed system gives the similarity index of 51% for the same code. The similarity
index calculated by the proposed method and standard software can be compared
from Table 1. The above comparison can also be seen in Fig. 9. Thus, it can be
stated that the proposed system is more suitable for Java codes than other software
for originality detection of source code.
Fig. 3 Sample original code
8. 250 S. Srivastava et al.
Fig. 4 Sample code to be
checked
Fig. 5 User interface
9. A Tool to Detect Plagiarism in Java Source Code 251
Fig. 6 After filling both the text areas accordingly
Fig. 7 After clicking on check fraud
10. 252 S. Srivastava et al.
Fig. 8 Plagiarism report of a standard plagiarism detection software
Table 1 Comparison of
similarity indexes of proposed
system and existing software
Input Similarity index
(proposed system) (%)
Similarity index (existing
software) (%)
Code 1 51.85 7
Code 2 54.76 80
Code 3 57.29 83
Code 4 53.26 81
Fig. 9 Comparison of similarity indexes of proposed system and existing software
11. A Tool to Detect Plagiarism in Java Source Code 253
5 Conclusion
We have proposed a tool that can efficiently be used to check whether the input
Java code is plagiarized or not. To carry out plagiarism detection, first, the code is
preprocessed through normalization. Normalization of code consists of various steps:
removing white spaces, removing comments, removing all the keywords, removing
all the operators, replacing all the identifiers with **identifier**, sorting. Then the
normalized code is fed into the LD algorithm to obtain LD distance. The value
returned by the LD algorithm is used to calculate the plagiarized value. The proposed
tool only works on Java source code. Further, it could be extended to work on all
programming languages. Plagiarized value has been calculated for 4 codes through
the proposed system as well as the existing system. From the results, it can be
concluded that the proposed system is more suitable for Java codes than the existing
system for originality detection of source code.
References
1. Foltýnek Tomáš, Meuschke Norman, Gipp Bela (2019) Academic plagiarism detection: a
systematic literature review. ACM Comput Surv (CSUR) 52(6):1–42
2. Naik RR, Landge MB, Mahender CN (2015) A review on plagiarism detection tools. Int J
Comput Appl 125(11)
3. Ghanem B, Arafeh L, Rosso P, Sánchez-Vega F (2018) HYPLAG: hybrid Arabic text plagia-
rism detection system. In: International conference on applications of natural language to
information systems. Springer, Cham, pp 315–323
4. Jadalla Ameera, Elnagar Ashraf (2008) PDE4Java: plagiarism detection engine for java, source
code: a clustering approach. IJBIDM 3(2):121–135
5. Alzahrani SM, Salim N, Abraham A (2011) Understanding plagiarism linguistic patterns,
textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Appl Rev)
42(2):133–149
6. Sulistiani Lisan, Karnalim Oscar (2019) ES-Plag: efficient and sensitive source code plagiarism
detection tool for academic environment. Comput Appl Eng Educ 27(1):166–182
7. Ali AM, Abdulla HM, Snasel V (2011) Overview and comparison of plagiarism detection
tools. In: DATESO, pp 161–172
8. Nurhayati B, Busman B (2017) Development of document plagiarism detection software using
levensthein distance algorithm on Android smartphone. In: 2017 5th International conference
on cyber and IT service management (CITSM), pp 1–6
9. Liaqat AG, Ahmad A (2011) Plagiarism detection in java code
10. Ullah F, Wang J, Farhan M, Jabbar S, Wu Z, Khalid S (2018) Plagiarism detection in students’
programming assignments based on semantics: multimedia e-learning based smart assessment
methodology. In: Multimedia tools and applications, pp 1–18
11. Ðurić Zoran, Gašević Dragan (2013) A source code similarity system for plagiarism detection.
Comput J 56(1):70–86
12. Heblikar S, Sharma P, Munnangi M, Bankapur C (2015) Normalization based stop-word
approach to source code plagiarism detection. In: FIRE workshops, pp 6–9