Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An approach to source code plagiarism

1,384 views

Published on

seminar on the ieee paper titled as "An Approach To Source Code Plagiarism Detection and Investigation using Latent Semantic Analysis"

Published in: Education
  • Be the first to comment

  • Be the first to like this

An approach to source code plagiarism

  1. 1. AN APPROACH TO SOURCE CODE PLAGIARISM DETECTION ANDINVESTIGATION USING LATENT SEMANTIC ANALYSISAuthors: Georgina Cosma and Mike JoyPresented byVarsha Bhat K(1DS09CS105)
  2. 2. INTRODUCTION Source code plagiarism: reuse of source code authored by someone else & fail to adequately acknowledge the fact It may occur intentionally or unintentionally Used by higher education academics
  3. 3. CHALLENGES INVOLVED Detect similar file pairs Investigating the similar source code fragments within the detected files Determine if the similarity is suspicious or innocent Burden of proof:- “Not only do we need to detect instances of plagiarism, we must also be able to demonstrate beyond reasonable doubt that those instances are not chance similarities.”
  4. 4. EXISTING TOOLSCategory of the tools: Fingerprint based systems String matching systems Parameterized matching systems These were identified by Mozgovoy
  5. 5. FINGER PRINT BASED SYSTEM Create finger print for each of the files Finger print contains statistical information There are various metrics used for detecting plagiarism. Ex: Halstead‟s metrics Example of such a system is the ITPAD
  6. 6. STRING MATCHING APPROACHThe various steps involved here are :-- Stage I is the process of tokenization Then the source code is written as a series of token strings Tokens are compared to check for similarity Example tools are:  MOSS  YAP3  JPLAG
  7. 7. PARAMETERIZED MATCHING SYSTEMS Detects identical and near duplicate sections of source code Achieved by matching source code sections whose identifiers have been substituted for systematically Ex: DUP tool INFORMATION RETRIEVAL METHODS Represents program as indexed set of keywords Computed the frequency of these keywords Then computed the pair wise similarity Ex: PDetect
  8. 8. PLAGATE Detect similar source code files Investigate the similar code fragments within them The view of investigation is to gather evidence for proving plagiarism by indicating contribution levels of fragments This enhances detection performance of existing algorithms Uses the technique of Latent Semantic Analysis to achieve this
  9. 9. LATENT SEMANTIC ANALYSIS It is an information retrieval technique Text collection is preprocessed a11 a12 Represented as a term-by-file matrix a21 a22 Matrix transformation is applied Singular value decomposition performed Thus uncovers latent relationships Derives meaning of terms by approximating the structure of term usage among document using SVD
  10. 10. ADVANTAGES Can detect transitive relationships unlike the traditional text retrieval systems Helps reduce noise in the data Overcomes problems of synonymy and polysemy Changes to document structure will not affect the detection Language independent DISADVANTAGES  Gives relatively high similarity values for non copied programs also
  11. 11. SIMILARITY IN SOURCE CODE FILES Key factors for judging similarity in files are  Nature of programming language and the problem  Variance in solution  Supporting source code already given  Assignment requirements Fragments under investigation must not be  Short  Simple  Standard  Trivial  Limited functionality  Frequently published
  12. 12. SIMILARITY CATEGORIES Source code fragments have varying contribution to evidence for plagiarism Thus arises the need for a criterion for identifying the contributions Contribution levels 1. Contribution level 0- no contribution 2. Contribution level 1- low contribution 3. Contribution level 2- high contribution Similarity levels 1. Level 0- innocent 2. Level 1- suspicious
  13. 13. PLAGATE SYSTEM Aim: enhance the process of plagiarism detection and investigation It is integrated with external detection tools as an enhancer Components 1. PlaGate Detection tool (PGDT) 2. PlaGate Query tool (PGQT)
  14. 14. FUNCTIONALITY
  15. 15. SYSTEM REPRESENTATION File copus C C={ F1, F2, …….. Fn } Source code fragment „s‟ from source code file F FɛC F= { s1, s2, ………sp } Set of source code fragments S File length ‘lf’ where Source code fragment length ‘ls’
  16. 16. LSA PROCESS IN PLAGATE Preprocess the files Transform the corpus of files into an m x n matrix A=[ ] Term weighting algorithm are applied to them value of term in file: SVD is performed on the weighted matrix A Reduction of dimention
  17. 17. DETECTION AND CLASSIFICATION PROCESS INPLAGATE PGQT component transforms the input file or fragment into a query vector „q‟ Then q is projected onto the k-dimensional space Thus we get: We now measure similarity between Q and all the source code files in the corpus by using similarity measure Cosine similarity measure is the most popular
  18. 18. EXPERIMENTATION Four corpora consisting of java source code files Corpora is produced by undergraduate students at University of Warwick Students were given simple skeleton code to start with The Data Sets
  19. 19. PERFORMANCE EVALUATION MEASURES sim(Fa,Fb) gives the similarity of two files and is computed using similarity measure Recall and Precision are two most commonly used measures for information retrieval systems A threshold is selected Ø Files that have sim(Fa,Fb) ≥ Ø are detected
  20. 20.  Overall performance will be evaluated by combining both the measures Closer the value of F to 1.00 the better is the detection performance
  21. 21. PLAGATE VS JPLAG AND SHERLOCK Performance when tools function alone and when integrated with PlaGate is evaluated List of suspicious file are created
  22. 22.  Results  Recall increases after integration with PGDT  This constant increase indicates PGDT and external tools compliment each other  Further increase seen when both PGDT and PGQT are integrated but at the cost of Precision
  23. 23.  JPlag alone had high Precision and low Recall in all data sets Sherlock and JPlag, both string matching algorithms vary significantly in detection performance
  24. 24.  Similarity often occurs in groups containing more than 2 files JPlag and Sherlock fail to parse some suspicious files due to Local Confusion Local Confusion occurs when some code segments shorter than the minimum- match length have been shuffled in files as they are string matched algorithms PlaGate does not suffer from this sort of local confusion as it does not depend on the structure of the code
  25. 25. Example
  26. 26. CONCLUSION LSA based technique for plagiarism detection and investigation as enhancers Detection of missed source code files by current plagiarism detection tools Integration with PlaGate increases Recall at the cost of Precision Classification of similarity by PlaGate into contribution levels PlaGate is language independent Unlike other tools that find the similarity of two files, PlaGate finds the relative similarity
  27. 27. FUTURE WORK Automating dimensionality reduction is still a problem Miss classification of source code fragment PlaGate behavior is not as stable as the string matching algorithms

×