Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

58 views

Published on

Presentation of our full paper
"Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations"
at the Joint Conference on Digital Libraries (JCDL) 2019 taking place at Urbana-Champaign, IL, US, June 3-5, 2019.

Pre-print of the paper: https://bit.ly/2wAyakz
Data, Code, System Prototype: https://purl.org/hybridPD

Published in: Science
  • Be the first to comment

  • Be the first to like this

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

  1. 1. Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, Bela Gipp
  2. 2. Outline 1. Problem Detecting academic plagiarism in math-heavy STEM disciplines 2. Methodology Combined analysis of math-based and citation-based features for confirmed cases of plagiarism and exploratory search for unknown cases in arXiv documents 3. Results Math-based and citation-based methods are a valuable complement to text-based methods and can identify so far undiscovered cases of academic plagiarism in math-heavy STEM documents 2
  3. 3. Problem
  4. 4. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” 3 Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
  5. 5. Problem Summary • Current Plagiarism Detection Systems: – Perform sophisticated text analysis – Find copy & paste plagiarism typical for students – Miss disguised plagiarism frequent among researchers • Our prior research: – Analyzing citation patterns [1] – Analyzing image similarity [2] Doc C Doc E Doc D Section 1 This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is ain-text citation [1]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis for plagiarism detection. Section 2 Another in-text citation [2]. tThi s is an exampletext with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.This is a repeated in-text citation [1]. This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Setion 3 A third in-text citation [3]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection.afinal i n-text-citation[2]. References [1] [2] [3] Document B This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2]. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an example text withreferences to differentdocuments fori llustrating the usage ofcitation analysi s for pl agiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Here s a third in-text citation [3]. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. Document A References [1] [2] [3] EDC DECDC Citation Pattern Citation Pattern Doc A Doc B Ins.EIns.DC DECDC Pattern Comparison Doc A Doc B [1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011. Analysis of in-text citation patterns. Analysis of image similarity using Perceptual Hashing [2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
  6. 6. Problem Summary Cont. • Non-textual feature analysis (in-text citations, images) achieves good detection effectiveness for disguised plagiarism • Papers in math-heavy disciplines: – Mix natural language and mathematical content – Cite comparably fewer sources than other STEM disciplines – Use figures sparsely • Text-based, citation-based, and image-based methods are less effective for math-heavy disciplines Plagiarized engineering paper (original: left, plagiarized: right , matching mathematical content: yellow, matching text of 10+ chars: blue). 6
  7. 7. Methodology
  8. 8. Approach • Combined analysis of math-based and citation-based similarity • Pilot study [3]: [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  9. 9. Dataset • Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) Retraction Watch VroniPlag Wiki 39 Compilation of Test Cases Expert inspection to create ground truth Retrieval of confirmed cases of plagiarism File conversion & cleaning Infty Reader NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp) LaTeXML Selection of arXiv documents File conversion & cleaning Provision for research 1010 Formulae from: 10 plagiarized doc. 10 source doc. Formulae from: 105,120 arXiv doc. arXiv.org LaTeXML (X)HTML5 Test Collection 9
  10. 10. 0 1 2 Doc1 Doc2 Distance 0 1 2 Do Do Dis r x Δ Identifiers Doc 1 r xx − − )²( 2 3 Histo Doc 2 r xx )3²2( − ∆𝑓 = 0 → 𝑠 = 1 Identifier Frequency Histograms (Histo) • Order-agnostic bag of identifiers • Similarity = relative difference in occurrence frequency 10
  11. 11. Approach • Combined analysis of math-based and citation-based similarity • Pilot study [3]: • MRR = 0.86 (ident., doc.) • MRR = 0.70 (comb., part.) [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  12. 12. Contributions of this Study 1. Two-step retrieval process – Candidate retrieval (CR) – Detailed analysis (DA) 2. New math-based similarity measures – Candidate Retrieval: Adaption of Lucene’s scoring function to math features – Detailed Analysis: Order-considering similarity measures 3. Combined analysis with citation-based methods – Performed well for disguised plagiarism in our prior research 4. Exploratory study in 102,524 arXiv documents – Search for undiscovered cases of plagiarism 12
  13. 13. Candidate Retrieval • Dataset same as in pilot study – Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) • Lucene’s Practical Scoring Function – Combination of tf/idf vector space and Boolean retrieval model • Features: – Mathematical Identifiers (boost = # of occurrences in document) – In-text citations – Terms (default) • Retrieve 100 top-ranked documents for each query as candidates 13
  14. 14. Detailed Analysis - Mathematical Features • Identifier Frequency Histograms (Histo) – same as in pilot study • Greedy Identifier Tiles (GIT) – Contiguous identifiers in same order (minimum length of 5) • Longest Common Identifier Subsequence (LCIS) – Identifiers in same order but not necessarily contiguous 14 Doc 1 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) Doc 2 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) GIT 𝑙GIT = 6 Doc 1 Doc 2 LCIS 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) 𝑙LCIS = 10
  15. 15. Detailed Analysis - Citation & Text Features • Bibliographic Coupling (BC) – Order-agnostic bag of references • Greedy Citation Tiles (GCT) – Contiguous in-text citations in same order – minimum length of 2 • Longest Common Citation Sequence (LCCS) – Citations in same order but not necessarily contiguous • Text-based: Encoplot – Efficiency-optimized character 16-gram comparison 15 xxx6x54xx321 6xxxx321xx54 I III II III III Tiles: I (1,5,3) II (6,1,2) III (9,12,1) Doc A: Doc B: 6543xx2xx1xx x34xxx256x1x LCCS: 1,2,3 Doc A: Doc B: 𝑠BC = 3Doc A citing Doc B citing [1] [2] [3] cites cites
  16. 16. Results
  17. 17. Candidate Retrieval C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R Math + + + – – – + + + + 0.7 Cit. + + – + + + + + + + 0.9 Text + + + + + + – + + + 0.9 17 • Citation-based and text-based approaches perform better than math-based analysis (potential for improvement) • Basic union of the result sets of any two of the approaches achieves 100% recall
  18. 18. Determining Significance Thresholds for Scores • Goal: Derive approximation for maximum similarity by chance • Analysis of score distribution for 1M (hopefully) unrelated document pairs (no common authors, do not cite each other) • Threshold = score of highest ranked document pair without noticeable topical relatedness Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that 18
  19. 19. Retrieval Effectiveness for Confirmed Plagiarism Cases • Text-based approach performs best for complete retrieval process (known deficiency of test collection) • Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20) 19
  20. 20. Retrieval Effectiveness for Confirmed Plagiarism Cases • Best math-based approach (GIT) achieves same MRR as text-based approach (Enco) for detailed analysis • Result achievable using multi-feature candidate retrieval 20
  21. 21. Retrieval Effectiveness for Confirmed Plagiarism Cases • Non-textual detection approaches provide valuable indicators for suspicious similarity in case of low textual similarity 21 Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that exhibit high content similarity for likely legitimate reasons, i.e., reusing own work and referring to the work of others with due attribution. Our goal was to estimate an upper bound for similarity scores that likely result from random feature matches. To do so, we manually assessed the topical relatedness of the top-ranked document pairs within the random sample of 1M documents for each similarity measure. We picked as the significance threshold for a similarity measure the rank of the first document pair for which we could not identify a topical relatedness. Table 3 shows the significance scores we derived using this procedure. Figure 2 shows the distribution of the similarity scores s (vertical axis) computed using each similarity measure for the random sam- ple of 1M documents. Large horizontal bars shaded in blue indicate the median score; small horizontal bars shaded in grey mark the
  22. 22. Retrieval Effectiveness for Confirmed Plagiarism Cases • Combined feature analysis, e.g., basic set union, achieves suspicious scores for 9 / 10 test cases (see underlined values) • Only C7 could not be identified 22
  23. 23. Exploratory Study • Retrieve candidate set (100 doc.) using best-performing math-based, citation-based, and text-based approach for all 102,524 documents • Form union of candidate sets • Detailed Analysis of each document to candidate set • Manual Investigation of top-10 results 23
  24. 24. Results Exploratory Study • Two known cases of plagiarism (Plag) • One so far undiscovered case confirmed as plagiarism by the author of the source document (Susp.) • Five cases of content reuse (CR) – All duly cited but citation not recognized • Two false positives (FP) 24 Table 5: Top-rankeddocuments in exploratory study. Rank 1 2 3 4 5 6 7 8 9 10 Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18 Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
  25. 25. Newly Discovered Suspicious Case Source Documents (S1, S2) Suspicious Document 25
  26. 26. Conclusion & Future Work • Math-based and citation-based detection methods complement text-based approaches – Improve recall for candidate retrieval stage – Perform equally well as text-based methods for detailed analysis in many cases – Provide indicators for suspicious similarity in cases with low textual similarity – Can identify so far undiscovered cases • Extraction of citation data for challenging STEM documents must be improved – Citation-based methods currently do no achieve full potential • Improve math-based methods – Include positional information for candidate retrieval stage – Include structural and semantic information for detailed analysis 26
  27. 27. Contact: Norman Meuschke n@meuschke.org | @normeu Paper, Data, Code, Prototype: purl.org/hybridPD Other Projects & Publications: dke.uni-wuppertal.de Many Thanks to DAAD for Travel Support! German Academic Exchange Service

×