Plagiarism introduction
Upcoming SlideShare
Loading in...5
×
 

Plagiarism introduction

on

  • 485 views

 

Statistics

Views

Total Views
485
Views on SlideShare
485
Embed Views
0

Actions

Likes
0
Downloads
24
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Plagiarism introduction Plagiarism introduction Presentation Transcript

  • Guide : Ms Sangeetha Jamal Presented by Dept of Computer Science Merin Paul Mtech CS-IS S19/25/2012 1
  • Contents  Introduction  Types of Source-code Plagiarism Textual Similarity Functional Similarity  Source Code Detection Algorithms.  Detecting Techniques  Tools used for code based plagiarism.  Conclusion9/25/2012 2
  • Introduction Plagiarism in source-code files occurs when source-code is copied and edited without proper acknowledgment of the original author. Techniques for plagiarism: Lexical changes and structural changes. Lexical changes: changes that can be done to the source- code without affecting the parsing of the program9/25/2012 3
  • Introduction Structural changes: changes made to the source code that will affect the parsing of the code and involve program debugging. Reasons for code copying: Code reusing. Programmer limitation Coincidentally implement using the same logic9/25/2012 4
  • TYPES OF SOURCE CODE PLAGIARISM  Textual Similarity  Functional Similarity9/25/2012 5
  • Textual Similarity  Two individual source codes look similar based on their textual content.  Textual content mean the words, letters, variable names, etc  Type 1, Type 2, Type 3.9/25/2012 6
  • Type I  The copied code fragment is as same as the original one without any modification except white spaces, comments and line modifications. int a; // counter // count five times for(a = 0; a < 5; a++) { printf(“a = %d”, a); // print value of a } return 0;9/25/2012 7
  • Type I int a; /* Loop increasing of a and print a value of it */ for(a = 0; a < 5; a++){ printf(“a = %d”, a); } return 0;9/25/2012 8
  • Type II  Same as Type I and also with modifications to variable names, function names and other user-defined identifiers. if(a > b) { a = a - 1; b = b * a; // comment 1 } else { b = a; // comment 2a = 0; }9/25/2012 9
  • Type II if(m > n) {m=m - 5; n=n*m; //my comment 1 } else {n=m; //my comment 2m=0; }9/25/2012 10
  • Type III  A copied code fragment is done by inserting or removing unnecessary statements. if(a > b) { a = a - 1; b = b * a; } else { b = a; a = 0; }9/25/2012 11
  • Type III if(a > b) { a = a – 1; c = 0; // this statement is added b = b * a; } else { b = a; a = 0; }9/25/2012 12
  • Functional similarity It refers to the code fragments that have the same semantic or functionality.fragment 1 : fragment 2:int i , j = 1; int factorial(int n)for(i = 1; i <= VALUE; i++) {j = j * i; if(n == 0) return 1; else return factorial(n – 1)*n; }9/25/2012 13
  • Source Code Detection Algorithms  Text based  Token-based  Parse tree-based  PDG-based  Metrics-based  Hybrid Approaches9/25/2012 14
  • CONTD..  Text based  Find textual match between two source codes.. Simple and Fast.  Token based  Using a lexer to convert the program into tokens. Find a match in token sequences. More robust to simple text replacements.9/25/2012 15
  • CONTD…  Parse Trees Build and compare parsetrees Contains the complete information about the source code Tree comparison can normalize conditional statements.  Program Dependency Graphs (PDGs) Captures the actual flow of control in a program. Allows higher-level equivalences to be located. More complex.9/25/2012 16
  • CONTD…  Metrics capture scores of code segments according to certain criteria. Metrics are simple to calculate. Lead to false positives. • Hybrid Combination of two or more previous techniques.9/25/2012 17
  • Detecting Techniques Detection via Lexical Similarities The process of lexical analysis takes source code and converts it into a stream of lexical tokens. Source code undergoes a series of transformation. Identification of reserved words, identifiers, and numbers are beneficial for plagiarism detection.9/25/2012 18
  • CONTD… int[] A = {1,2,3,4}; int[] B = {1, 2, 3, 4}; for(int i = 0; i < for(int j = 0; j < B.length; A.length; i++) { j++) { A[i] = A[i] + 1; B[j] = B[j] + 1; } }9/25/2012 19
  • CONTD… LITERAL_int LBRACK RBRACK IDENT ASSIGN LCURLY NUM_INT COMMA NUM_INT COMMA NUM_INT COMMA NUM_INT RCURLY SEMI LITERAL_for LPAREN LITERAL_int IDENT ASSIGN NUM_INT SEMI IDENT LT IDENT DOT IDENT SEMI IDENT INC RPAREN LCURLY NUM_INT SEMI RCURLY9/25/2012 20
  • Detection via Parse Tree Similarities9/25/2012 21
  • Detection via Metrics  Calculate and compare attribute counts.  Programs with similar attribute counts are potentially similar programs.  Counts of operators and operands are typically used to construct attribute counts.9/25/2012 22
  • Tools used for code based plagiarism Jplag  Finds similarities among multiple sets of source code files.  JPlag operates in two phases.  First phase: All programs to be compared are parsed and converted into token strings.  Second phase: Token strings are compared in pairs for determining the similarity of each pair.  It is more robust. It supports Java, c#, C, C++ and natural language text.9/25/2012 23
  • CONTD..MOSS (Measure Of Software Similarity) Measure Of Software Similarity was developed in 1994 by Alex Aiken. It analyzes code written in languages like C, C++, Python, Visual Basic, Javascript, FORTRAN, Lisp, Ada etc. Provided as an internet service and given a list of source files.9/25/2012 24
  • CONTD…  YAP (Yet Another Plague)  Token-based system.  YAP works in two phases.  The first phase generates a token file for each submission.  The second phase compares pairs of token files using the token matching algorithm, Running-Karp-Rabin Greedy- String-Tiling algorithm (RKRGST)9/25/2012 25
  • Conclusion  Plagiarism in programming assignments is an inevitable issue for most academics teaching programming.  Plagiarism Detection systems are built based on a few languages.  Most of the detection software checking is done with some repository situated in an organization.  As the number of digital copies are going up the repository size should be large and the plagiarism Detection software should be able to handle it.9/25/2012 26
  • Conclusion  Plagiarism in programming assignments is an inevitable issue for most academics teaching programming.  Most popular plagiarism detection algorithms use string- matching to create token string representations of programs.  The tokens of each document are compared on a pair-wise basis to determine similar source-code segments between the files.  String-matching systems are language-dependent depending on the programming languages supported by their parsers9/25/2012 27
  • References 1) G. Cosma and M. Joy,” An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis” IEEE Trans. Computers, vol. 61, no. 3, pp. 379-391, March 2012 2) Georgina Cosma, Mike Joy, Daniel White and Jane Yau, 9th August 2007 ,ICS,University of Ulster http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/ 3) Okiemute Omuta ”Electronic Source Code Plagiarism Detection” Computer Engineering Department,European University of Lefke, North Cyprus 4) S. Schleimer, D. Wilkerson, and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting,” Proc. the ACM SIGMOD Int’l Conf. Management of Data, pp. 76-85, 20039/25/2012 28
  • References 4) M.J. Wise, “YAP3: Improved Detection of Similarities in Computer Program and Other Texts,” Proc. 27th SIGCSE Technical Symp., pp. 130-134, 1996.9/25/2012 29
  • THANK U!!!9/25/2012 30