Improving Academic
Plagiarism Detection
for STEM Documents
by Analyzing Mathematical
Content and Citations
Norman Meuschke, Vincent Stange,
Moritz Schubotz, Michael Kramer,
Bela Gipp
Outline
1. Problem
Detecting academic plagiarism in math-heavy STEM disciplines
2. Methodology
Combined analysis of math-based and citation-based features
for confirmed cases of plagiarism and
exploratory search for unknown cases in arXiv documents
3. Results
Math-based and citation-based methods are a valuable complement to
text-based methods and can identify so far undiscovered cases of
academic plagiarism in math-heavy STEM documents
2
Problem
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source
to benefit in a setting where originality is expected.”
3
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of
plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
Problem Summary
• Current Plagiarism Detection Systems:
– Perform sophisticated text analysis
– Find copy & paste plagiarism typical for students
– Miss disguised plagiarism frequent among researchers
• Our prior research:
– Analyzing citation patterns [1]
– Analyzing image similarity [2]
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
This is ain-text citation [1]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis
for plagiarism detection.
Section 2
Another in-text citation [2]. tThi s is an exampletext with references to different
documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.This is a repeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
Setion 3
A third in-text citation [3]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.afinal i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an example text withreferences to differentdocuments fori llustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3]. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
[1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation
Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011.
Analysis of in-text citation patterns.
Analysis of image similarity using Perceptual Hashing
[2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism
Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
Problem Summary Cont.
• Non-textual feature analysis (in-text citations, images)
achieves good detection effectiveness for disguised plagiarism
• Papers in math-heavy disciplines:
– Mix natural language and mathematical content
– Cite comparably fewer sources than
other STEM disciplines
– Use figures sparsely
• Text-based, citation-based, and image-based methods
are less effective for math-heavy disciplines
Plagiarized engineering paper
(original: left, plagiarized: right ,
matching mathematical content: yellow,
matching text of 10+ chars: blue).
6
Methodology
Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
Dataset
• Ten confirmed cases (plagiarized document and its source) embedded
in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
Retraction
Watch
VroniPlag Wiki
39
Compilation of Test Cases
Expert inspection to
create ground truth
Retrieval of confirmed
cases of plagiarism
File conversion
& cleaning
Infty
Reader
NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp)
LaTeXML
Selection of
arXiv documents
File conversion
& cleaning
Provision
for research
1010
Formulae from:
10 plagiarized doc.
10 source doc.
Formulae from:
105,120 arXiv doc.
arXiv.org
LaTeXML
(X)HTML5
Test
Collection
9
0
1
2
Doc1
Doc2
Distance
0
1
2
Do
Do
Dis
r x Δ
Identifiers
Doc 1
r
xx −
− )²( 2
3
Histo
Doc 2
r
xx )3²2( −
∆𝑓 = 0 → 𝑠 = 1
Identifier Frequency Histograms (Histo)
• Order-agnostic bag of identifiers
• Similarity = relative difference in occurrence frequency
10
Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
• MRR = 0.86
(ident., doc.)
• MRR = 0.70
(comb., part.)
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
Contributions of this Study
1. Two-step retrieval process
– Candidate retrieval (CR)
– Detailed analysis (DA)
2. New math-based similarity measures
– Candidate Retrieval: Adaption of Lucene’s scoring function to math features
– Detailed Analysis: Order-considering similarity measures
3. Combined analysis with citation-based methods
– Performed well for disguised plagiarism in our prior research
4. Exploratory study in 102,524 arXiv documents
– Search for undiscovered cases of plagiarism
12
Candidate Retrieval
• Dataset same as in pilot study
– Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11
MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
• Lucene’s Practical Scoring Function
– Combination of tf/idf vector space and Boolean retrieval model
• Features:
– Mathematical Identifiers (boost = # of occurrences in document)
– In-text citations
– Terms (default)
• Retrieve 100 top-ranked documents for each query as candidates
13
Detailed Analysis - Mathematical Features
• Identifier Frequency Histograms (Histo) – same as in pilot study
• Greedy Identifier Tiles (GIT)
– Contiguous identifiers in same order (minimum length of 5)
• Longest Common Identifier Subsequence (LCIS)
– Identifiers in same order but not necessarily contiguous
14
Doc 1
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 )
Doc 2
𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
GIT
𝑙GIT = 6
Doc 1 Doc 2
LCIS
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
𝑙LCIS = 10
Detailed Analysis - Citation & Text Features
• Bibliographic Coupling (BC)
– Order-agnostic bag of references
• Greedy Citation Tiles (GCT)
– Contiguous in-text citations
in same order
– minimum length of 2
• Longest Common Citation Sequence (LCCS)
– Citations in same order
but not necessarily contiguous
• Text-based: Encoplot
– Efficiency-optimized character 16-gram comparison
15
xxx6x54xx321
6xxxx321xx54
I
III
II III
III
Tiles: I (1,5,3) II (6,1,2) III (9,12,1)
Doc A:
Doc B:
6543xx2xx1xx
x34xxx256x1x
LCCS: 1,2,3
Doc A:
Doc B:
𝑠BC = 3Doc A
citing
Doc B
citing
[1]
[2]
[3]
cites cites
Results
Candidate Retrieval
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R
Math + + + – – – + + + + 0.7
Cit. + + – + + + + + + + 0.9
Text + + + + + + – + + + 0.9
17
• Citation-based and text-based approaches perform better than
math-based analysis (potential for improvement)
• Basic union of the result sets of any two of the approaches achieves
100% recall
Determining Significance Thresholds for Scores
• Goal: Derive approximation for maximum similarity by chance
• Analysis of score distribution for 1M (hopefully) unrelated document
pairs (no common authors, do not cite each other)
• Threshold = score of highest ranked document pair without
noticeable topical relatedness
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that 18
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Text-based approach performs best for complete retrieval process
(known deficiency of test collection)
• Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20)
19
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Best math-based approach (GIT) achieves same MRR as
text-based approach (Enco) for detailed analysis
• Result achievable using multi-feature candidate retrieval
20
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Non-textual detection approaches provide valuable indicators for
suspicious similarity in case of low textual similarity
21
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that
exhibit high content similarity for likely legitimate reasons, i.e.,
reusing own work and referring to the work of others with due
attribution. Our goal was to estimate an upper bound for similarity
scores that likely result from random feature matches. To do so,
we manually assessed the topical relatedness of the top-ranked
document pairs within the random sample of 1M documents for
each similarity measure. We picked as the significance threshold
for a similarity measure the rank of the first document pair for
which we could not identify a topical relatedness. Table 3 shows
the significance scores we derived using this procedure.
Figure 2 shows the distribution of the similarity scores s (vertical
axis) computed using each similarity measure for the random sam-
ple of 1M documents. Large horizontal bars shaded in blue indicate
the median score; small horizontal bars shaded in grey mark the
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Combined feature analysis, e.g., basic set union, achieves suspicious
scores for 9 / 10 test cases (see underlined values)
• Only C7 could not be identified
22
Exploratory Study
• Retrieve candidate set (100 doc.) using best-performing math-based,
citation-based, and text-based approach for all 102,524 documents
• Form union of candidate sets
• Detailed Analysis of each document to candidate set
• Manual Investigation of top-10 results
23
Results Exploratory Study
• Two known cases of plagiarism (Plag)
• One so far undiscovered case confirmed as plagiarism by the author of
the source document (Susp.)
• Five cases of content reuse (CR)
– All duly cited but citation not recognized
• Two false positives (FP)
24
Table 5: Top-rankeddocuments in exploratory study.
Rank 1 2 3 4 5 6 7 8 9 10
Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18
Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
Newly Discovered Suspicious Case
Source Documents (S1, S2) Suspicious Document
25
Conclusion & Future Work
• Math-based and citation-based detection methods complement
text-based approaches
– Improve recall for candidate retrieval stage
– Perform equally well as text-based methods for detailed analysis in many cases
– Provide indicators for suspicious similarity in cases with low textual similarity
– Can identify so far undiscovered cases
• Extraction of citation data for challenging STEM documents
must be improved
– Citation-based methods currently do no achieve full potential
• Improve math-based methods
– Include positional information for candidate retrieval stage
– Include structural and semantic information for detailed analysis
26
Contact:
Norman Meuschke
n@meuschke.org | @normeu
Paper, Data, Code, Prototype:
purl.org/hybridPD
Other Projects & Publications:
dke.uni-wuppertal.de
Many Thanks to DAAD
for Travel Support!
German Academic
Exchange Service

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

  • 1.
    Improving Academic Plagiarism Detection forSTEM Documents by Analyzing Mathematical Content and Citations Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, Bela Gipp
  • 2.
    Outline 1. Problem Detecting academicplagiarism in math-heavy STEM disciplines 2. Methodology Combined analysis of math-based and citation-based features for confirmed cases of plagiarism and exploratory search for unknown cases in arXiv documents 3. Results Math-based and citation-based methods are a valuable complement to text-based methods and can identify so far undiscovered cases of academic plagiarism in math-heavy STEM documents 2
  • 3.
  • 4.
    Academic Plagiarism “The useof ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” 3 Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
  • 5.
    Problem Summary • CurrentPlagiarism Detection Systems: – Perform sophisticated text analysis – Find copy & paste plagiarism typical for students – Miss disguised plagiarism frequent among researchers • Our prior research: – Analyzing citation patterns [1] – Analyzing image similarity [2] Doc C Doc E Doc D Section 1 This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is ain-text citation [1]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis for plagiarism detection. Section 2 Another in-text citation [2]. tThi s is an exampletext with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.This is a repeated in-text citation [1]. This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Setion 3 A third in-text citation [3]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection.afinal i n-text-citation[2]. References [1] [2] [3] Document B This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2]. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an example text withreferences to differentdocuments fori llustrating the usage ofcitation analysi s for pl agiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Here s a third in-text citation [3]. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. Document A References [1] [2] [3] EDC DECDC Citation Pattern Citation Pattern Doc A Doc B Ins.EIns.DC DECDC Pattern Comparison Doc A Doc B [1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011. Analysis of in-text citation patterns. Analysis of image similarity using Perceptual Hashing [2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
  • 6.
    Problem Summary Cont. •Non-textual feature analysis (in-text citations, images) achieves good detection effectiveness for disguised plagiarism • Papers in math-heavy disciplines: – Mix natural language and mathematical content – Cite comparably fewer sources than other STEM disciplines – Use figures sparsely • Text-based, citation-based, and image-based methods are less effective for math-heavy disciplines Plagiarized engineering paper (original: left, plagiarized: right , matching mathematical content: yellow, matching text of 10+ chars: blue). 6
  • 7.
  • 8.
    Approach • Combined analysisof math-based and citation-based similarity • Pilot study [3]: [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  • 9.
    Dataset • Ten confirmedcases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) Retraction Watch VroniPlag Wiki 39 Compilation of Test Cases Expert inspection to create ground truth Retrieval of confirmed cases of plagiarism File conversion & cleaning Infty Reader NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp) LaTeXML Selection of arXiv documents File conversion & cleaning Provision for research 1010 Formulae from: 10 plagiarized doc. 10 source doc. Formulae from: 105,120 arXiv doc. arXiv.org LaTeXML (X)HTML5 Test Collection 9
  • 10.
    0 1 2 Doc1 Doc2 Distance 0 1 2 Do Do Dis r x Δ Identifiers Doc1 r xx − − )²( 2 3 Histo Doc 2 r xx )3²2( − ∆𝑓 = 0 → 𝑠 = 1 Identifier Frequency Histograms (Histo) • Order-agnostic bag of identifiers • Similarity = relative difference in occurrence frequency 10
  • 11.
    Approach • Combined analysisof math-based and citation-based similarity • Pilot study [3]: • MRR = 0.86 (ident., doc.) • MRR = 0.70 (comb., part.) [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  • 12.
    Contributions of thisStudy 1. Two-step retrieval process – Candidate retrieval (CR) – Detailed analysis (DA) 2. New math-based similarity measures – Candidate Retrieval: Adaption of Lucene’s scoring function to math features – Detailed Analysis: Order-considering similarity measures 3. Combined analysis with citation-based methods – Performed well for disguised plagiarism in our prior research 4. Exploratory study in 102,524 arXiv documents – Search for undiscovered cases of plagiarism 12
  • 13.
    Candidate Retrieval • Datasetsame as in pilot study – Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) • Lucene’s Practical Scoring Function – Combination of tf/idf vector space and Boolean retrieval model • Features: – Mathematical Identifiers (boost = # of occurrences in document) – In-text citations – Terms (default) • Retrieve 100 top-ranked documents for each query as candidates 13
  • 14.
    Detailed Analysis -Mathematical Features • Identifier Frequency Histograms (Histo) – same as in pilot study • Greedy Identifier Tiles (GIT) – Contiguous identifiers in same order (minimum length of 5) • Longest Common Identifier Subsequence (LCIS) – Identifiers in same order but not necessarily contiguous 14 Doc 1 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) Doc 2 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) GIT 𝑙GIT = 6 Doc 1 Doc 2 LCIS 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) 𝑙LCIS = 10
  • 15.
    Detailed Analysis -Citation & Text Features • Bibliographic Coupling (BC) – Order-agnostic bag of references • Greedy Citation Tiles (GCT) – Contiguous in-text citations in same order – minimum length of 2 • Longest Common Citation Sequence (LCCS) – Citations in same order but not necessarily contiguous • Text-based: Encoplot – Efficiency-optimized character 16-gram comparison 15 xxx6x54xx321 6xxxx321xx54 I III II III III Tiles: I (1,5,3) II (6,1,2) III (9,12,1) Doc A: Doc B: 6543xx2xx1xx x34xxx256x1x LCCS: 1,2,3 Doc A: Doc B: 𝑠BC = 3Doc A citing Doc B citing [1] [2] [3] cites cites
  • 16.
  • 17.
    Candidate Retrieval C1 C2C3 C4 C5 C6 C7 C8 C9 C10 R Math + + + – – – + + + + 0.7 Cit. + + – + + + + + + + 0.9 Text + + + + + + – + + + 0.9 17 • Citation-based and text-based approaches perform better than math-based analysis (potential for improvement) • Basic union of the result sets of any two of the approaches achieves 100% recall
  • 18.
    Determining Significance Thresholdsfor Scores • Goal: Derive approximation for maximum similarity by chance • Analysis of score distribution for 1M (hopefully) unrelated document pairs (no common authors, do not cite each other) • Threshold = score of highest ranked document pair without noticeable topical relatedness Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that 18
  • 19.
    Retrieval Effectiveness forConfirmed Plagiarism Cases • Text-based approach performs best for complete retrieval process (known deficiency of test collection) • Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20) 19
  • 20.
    Retrieval Effectiveness forConfirmed Plagiarism Cases • Best math-based approach (GIT) achieves same MRR as text-based approach (Enco) for detailed analysis • Result achievable using multi-feature candidate retrieval 20
  • 21.
    Retrieval Effectiveness forConfirmed Plagiarism Cases • Non-textual detection approaches provide valuable indicators for suspicious similarity in case of low textual similarity 21 Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that exhibit high content similarity for likely legitimate reasons, i.e., reusing own work and referring to the work of others with due attribution. Our goal was to estimate an upper bound for similarity scores that likely result from random feature matches. To do so, we manually assessed the topical relatedness of the top-ranked document pairs within the random sample of 1M documents for each similarity measure. We picked as the significance threshold for a similarity measure the rank of the first document pair for which we could not identify a topical relatedness. Table 3 shows the significance scores we derived using this procedure. Figure 2 shows the distribution of the similarity scores s (vertical axis) computed using each similarity measure for the random sam- ple of 1M documents. Large horizontal bars shaded in blue indicate the median score; small horizontal bars shaded in grey mark the
  • 22.
    Retrieval Effectiveness forConfirmed Plagiarism Cases • Combined feature analysis, e.g., basic set union, achieves suspicious scores for 9 / 10 test cases (see underlined values) • Only C7 could not be identified 22
  • 23.
    Exploratory Study • Retrievecandidate set (100 doc.) using best-performing math-based, citation-based, and text-based approach for all 102,524 documents • Form union of candidate sets • Detailed Analysis of each document to candidate set • Manual Investigation of top-10 results 23
  • 24.
    Results Exploratory Study •Two known cases of plagiarism (Plag) • One so far undiscovered case confirmed as plagiarism by the author of the source document (Susp.) • Five cases of content reuse (CR) – All duly cited but citation not recognized • Two false positives (FP) 24 Table 5: Top-rankeddocuments in exploratory study. Rank 1 2 3 4 5 6 7 8 9 10 Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18 Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
  • 25.
    Newly Discovered SuspiciousCase Source Documents (S1, S2) Suspicious Document 25
  • 26.
    Conclusion & FutureWork • Math-based and citation-based detection methods complement text-based approaches – Improve recall for candidate retrieval stage – Perform equally well as text-based methods for detailed analysis in many cases – Provide indicators for suspicious similarity in cases with low textual similarity – Can identify so far undiscovered cases • Extraction of citation data for challenging STEM documents must be improved – Citation-based methods currently do no achieve full potential • Improve math-based methods – Include positional information for candidate retrieval stage – Include structural and semantic information for detailed analysis 26
  • 27.
    Contact: Norman Meuschke n@meuschke.org |@normeu Paper, Data, Code, Prototype: purl.org/hybridPD Other Projects & Publications: dke.uni-wuppertal.de Many Thanks to DAAD for Travel Support! German Academic Exchange Service