Presentation of our full paper
"Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations"
at the Joint Conference on Digital Libraries (JCDL) 2019 taking place at Urbana-Champaign, IL, US, June 3-5, 2019.
Pre-print of the paper: https://bit.ly/2wAyakz
Data, Code, System Prototype: https://purl.org/hybridPD
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations
1. Improving Academic
Plagiarism Detection
for STEM Documents
by Analyzing Mathematical
Content and Citations
Norman Meuschke, Vincent Stange,
Moritz Schubotz, Michael Kramer,
Bela Gipp
2. Outline
1. Problem
Detecting academic plagiarism in math-heavy STEM disciplines
2. Methodology
Combined analysis of math-based and citation-based features
for confirmed cases of plagiarism and
exploratory search for unknown cases in arXiv documents
3. Results
Math-based and citation-based methods are a valuable complement to
text-based methods and can identify so far undiscovered cases of
academic plagiarism in math-heavy STEM documents
2
4. Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source
to benefit in a setting where originality is expected.”
3
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of
plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
5. Problem Summary
• Current Plagiarism Detection Systems:
– Perform sophisticated text analysis
– Find copy & paste plagiarism typical for students
– Miss disguised plagiarism frequent among researchers
• Our prior research:
– Analyzing citation patterns [1]
– Analyzing image similarity [2]
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
This is ain-text citation [1]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis
for plagiarism detection.
Section 2
Another in-text citation [2]. tThi s is an exampletext with references to different
documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.This is a repeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
Setion 3
A third in-text citation [3]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.afinal i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an example text withreferences to differentdocuments fori llustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3]. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
[1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation
Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011.
Analysis of in-text citation patterns.
Analysis of image similarity using Perceptual Hashing
[2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism
Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
6. Problem Summary Cont.
• Non-textual feature analysis (in-text citations, images)
achieves good detection effectiveness for disguised plagiarism
• Papers in math-heavy disciplines:
– Mix natural language and mathematical content
– Cite comparably fewer sources than
other STEM disciplines
– Use figures sparsely
• Text-based, citation-based, and image-based methods
are less effective for math-heavy disciplines
Plagiarized engineering paper
(original: left, plagiarized: right ,
matching mathematical content: yellow,
matching text of 10+ chars: blue).
6
8. Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
9. Dataset
• Ten confirmed cases (plagiarized document and its source) embedded
in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
Retraction
Watch
VroniPlag Wiki
39
Compilation of Test Cases
Expert inspection to
create ground truth
Retrieval of confirmed
cases of plagiarism
File conversion
& cleaning
Infty
Reader
NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp)
LaTeXML
Selection of
arXiv documents
File conversion
& cleaning
Provision
for research
1010
Formulae from:
10 plagiarized doc.
10 source doc.
Formulae from:
105,120 arXiv doc.
arXiv.org
LaTeXML
(X)HTML5
Test
Collection
9
10. 0
1
2
Doc1
Doc2
Distance
0
1
2
Do
Do
Dis
r x Δ
Identifiers
Doc 1
r
xx −
− )²( 2
3
Histo
Doc 2
r
xx )3²2( −
∆𝑓 = 0 → 𝑠 = 1
Identifier Frequency Histograms (Histo)
• Order-agnostic bag of identifiers
• Similarity = relative difference in occurrence frequency
10
11. Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
• MRR = 0.86
(ident., doc.)
• MRR = 0.70
(comb., part.)
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
12. Contributions of this Study
1. Two-step retrieval process
– Candidate retrieval (CR)
– Detailed analysis (DA)
2. New math-based similarity measures
– Candidate Retrieval: Adaption of Lucene’s scoring function to math features
– Detailed Analysis: Order-considering similarity measures
3. Combined analysis with citation-based methods
– Performed well for disguised plagiarism in our prior research
4. Exploratory study in 102,524 arXiv documents
– Search for undiscovered cases of plagiarism
12
13. Candidate Retrieval
• Dataset same as in pilot study
– Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11
MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
• Lucene’s Practical Scoring Function
– Combination of tf/idf vector space and Boolean retrieval model
• Features:
– Mathematical Identifiers (boost = # of occurrences in document)
– In-text citations
– Terms (default)
• Retrieve 100 top-ranked documents for each query as candidates
13
14. Detailed Analysis - Mathematical Features
• Identifier Frequency Histograms (Histo) – same as in pilot study
• Greedy Identifier Tiles (GIT)
– Contiguous identifiers in same order (minimum length of 5)
• Longest Common Identifier Subsequence (LCIS)
– Identifiers in same order but not necessarily contiguous
14
Doc 1
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 )
Doc 2
𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
GIT
𝑙GIT = 6
Doc 1 Doc 2
LCIS
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
𝑙LCIS = 10
15. Detailed Analysis - Citation & Text Features
• Bibliographic Coupling (BC)
– Order-agnostic bag of references
• Greedy Citation Tiles (GCT)
– Contiguous in-text citations
in same order
– minimum length of 2
• Longest Common Citation Sequence (LCCS)
– Citations in same order
but not necessarily contiguous
• Text-based: Encoplot
– Efficiency-optimized character 16-gram comparison
15
xxx6x54xx321
6xxxx321xx54
I
III
II III
III
Tiles: I (1,5,3) II (6,1,2) III (9,12,1)
Doc A:
Doc B:
6543xx2xx1xx
x34xxx256x1x
LCCS: 1,2,3
Doc A:
Doc B:
𝑠BC = 3Doc A
citing
Doc B
citing
[1]
[2]
[3]
cites cites
17. Candidate Retrieval
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R
Math + + + – – – + + + + 0.7
Cit. + + – + + + + + + + 0.9
Text + + + + + + – + + + 0.9
17
• Citation-based and text-based approaches perform better than
math-based analysis (potential for improvement)
• Basic union of the result sets of any two of the approaches achieves
100% recall
18. Determining Significance Thresholds for Scores
• Goal: Derive approximation for maximum similarity by chance
• Analysis of score distribution for 1M (hopefully) unrelated document
pairs (no common authors, do not cite each other)
• Threshold = score of highest ranked document pair without
noticeable topical relatedness
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that 18
19. Retrieval Effectiveness for Confirmed Plagiarism Cases
• Text-based approach performs best for complete retrieval process
(known deficiency of test collection)
• Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20)
19
20. Retrieval Effectiveness for Confirmed Plagiarism Cases
• Best math-based approach (GIT) achieves same MRR as
text-based approach (Enco) for detailed analysis
• Result achievable using multi-feature candidate retrieval
20
21. Retrieval Effectiveness for Confirmed Plagiarism Cases
• Non-textual detection approaches provide valuable indicators for
suspicious similarity in case of low textual similarity
21
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that
exhibit high content similarity for likely legitimate reasons, i.e.,
reusing own work and referring to the work of others with due
attribution. Our goal was to estimate an upper bound for similarity
scores that likely result from random feature matches. To do so,
we manually assessed the topical relatedness of the top-ranked
document pairs within the random sample of 1M documents for
each similarity measure. We picked as the significance threshold
for a similarity measure the rank of the first document pair for
which we could not identify a topical relatedness. Table 3 shows
the significance scores we derived using this procedure.
Figure 2 shows the distribution of the similarity scores s (vertical
axis) computed using each similarity measure for the random sam-
ple of 1M documents. Large horizontal bars shaded in blue indicate
the median score; small horizontal bars shaded in grey mark the
22. Retrieval Effectiveness for Confirmed Plagiarism Cases
• Combined feature analysis, e.g., basic set union, achieves suspicious
scores for 9 / 10 test cases (see underlined values)
• Only C7 could not be identified
22
23. Exploratory Study
• Retrieve candidate set (100 doc.) using best-performing math-based,
citation-based, and text-based approach for all 102,524 documents
• Form union of candidate sets
• Detailed Analysis of each document to candidate set
• Manual Investigation of top-10 results
23
24. Results Exploratory Study
• Two known cases of plagiarism (Plag)
• One so far undiscovered case confirmed as plagiarism by the author of
the source document (Susp.)
• Five cases of content reuse (CR)
– All duly cited but citation not recognized
• Two false positives (FP)
24
Table 5: Top-rankeddocuments in exploratory study.
Rank 1 2 3 4 5 6 7 8 9 10
Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18
Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
26. Conclusion & Future Work
• Math-based and citation-based detection methods complement
text-based approaches
– Improve recall for candidate retrieval stage
– Perform equally well as text-based methods for detailed analysis in many cases
– Provide indicators for suspicious similarity in cases with low textual similarity
– Can identify so far undiscovered cases
• Extraction of citation data for challenging STEM documents
must be improved
– Citation-based methods currently do no achieve full potential
• Improve math-based methods
– Include positional information for candidate retrieval stage
– Include structural and semantic information for detailed analysis
26
27. Contact:
Norman Meuschke
n@meuschke.org | @normeu
Paper, Data, Code, Prototype:
purl.org/hybridPD
Other Projects & Publications:
dke.uni-wuppertal.de
Many Thanks to DAAD
for Travel Support!
German Academic
Exchange Service