SlideShare a Scribd company logo
1 of 27
Download to read offline
Improving Academic
Plagiarism Detection
for STEM Documents
by Analyzing Mathematical
Content and Citations
Norman Meuschke, Vincent Stange,
Moritz Schubotz, Michael Kramer,
Bela Gipp
Outline
1. Problem
Detecting academic plagiarism in math-heavy STEM disciplines
2. Methodology
Combined analysis of math-based and citation-based features
for confirmed cases of plagiarism and
exploratory search for unknown cases in arXiv documents
3. Results
Math-based and citation-based methods are a valuable complement to
text-based methods and can identify so far undiscovered cases of
academic plagiarism in math-heavy STEM documents
2
Problem
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source
to benefit in a setting where originality is expected.”
3
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of
plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
Problem Summary
• Current Plagiarism Detection Systems:
– Perform sophisticated text analysis
– Find copy & paste plagiarism typical for students
– Miss disguised plagiarism frequent among researchers
• Our prior research:
– Analyzing citation patterns [1]
– Analyzing image similarity [2]
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
This is ain-text citation [1]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis
for plagiarism detection.
Section 2
Another in-text citation [2]. tThi s is an exampletext with references to different
documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.This is a repeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
Setion 3
A third in-text citation [3]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.afinal i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an example text withreferences to differentdocuments fori llustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3]. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
[1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation
Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011.
Analysis of in-text citation patterns.
Analysis of image similarity using Perceptual Hashing
[2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism
Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
Problem Summary Cont.
• Non-textual feature analysis (in-text citations, images)
achieves good detection effectiveness for disguised plagiarism
• Papers in math-heavy disciplines:
– Mix natural language and mathematical content
– Cite comparably fewer sources than
other STEM disciplines
– Use figures sparsely
• Text-based, citation-based, and image-based methods
are less effective for math-heavy disciplines
Plagiarized engineering paper
(original: left, plagiarized: right ,
matching mathematical content: yellow,
matching text of 10+ chars: blue).
6
Methodology
Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
Dataset
• Ten confirmed cases (plagiarized document and its source) embedded
in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
Retraction
Watch
VroniPlag Wiki
39
Compilation of Test Cases
Expert inspection to
create ground truth
Retrieval of confirmed
cases of plagiarism
File conversion
& cleaning
Infty
Reader
NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp)
LaTeXML
Selection of
arXiv documents
File conversion
& cleaning
Provision
for research
1010
Formulae from:
10 plagiarized doc.
10 source doc.
Formulae from:
105,120 arXiv doc.
arXiv.org
LaTeXML
(X)HTML5
Test
Collection
9
0
1
2
Doc1
Doc2
Distance
0
1
2
Do
Do
Dis
r x Δ
Identifiers
Doc 1
r
xx −
− )²( 2
3
Histo
Doc 2
r
xx )3²2( −
∆𝑓 = 0 → 𝑠 = 1
Identifier Frequency Histograms (Histo)
• Order-agnostic bag of identifiers
• Similarity = relative difference in occurrence frequency
10
Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
• MRR = 0.86
(ident., doc.)
• MRR = 0.70
(comb., part.)
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
Contributions of this Study
1. Two-step retrieval process
– Candidate retrieval (CR)
– Detailed analysis (DA)
2. New math-based similarity measures
– Candidate Retrieval: Adaption of Lucene’s scoring function to math features
– Detailed Analysis: Order-considering similarity measures
3. Combined analysis with citation-based methods
– Performed well for disguised plagiarism in our prior research
4. Exploratory study in 102,524 arXiv documents
– Search for undiscovered cases of plagiarism
12
Candidate Retrieval
• Dataset same as in pilot study
– Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11
MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
• Lucene’s Practical Scoring Function
– Combination of tf/idf vector space and Boolean retrieval model
• Features:
– Mathematical Identifiers (boost = # of occurrences in document)
– In-text citations
– Terms (default)
• Retrieve 100 top-ranked documents for each query as candidates
13
Detailed Analysis - Mathematical Features
• Identifier Frequency Histograms (Histo) – same as in pilot study
• Greedy Identifier Tiles (GIT)
– Contiguous identifiers in same order (minimum length of 5)
• Longest Common Identifier Subsequence (LCIS)
– Identifiers in same order but not necessarily contiguous
14
Doc 1
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 )
Doc 2
𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
GIT
𝑙GIT = 6
Doc 1 Doc 2
LCIS
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
𝑙LCIS = 10
Detailed Analysis - Citation & Text Features
• Bibliographic Coupling (BC)
– Order-agnostic bag of references
• Greedy Citation Tiles (GCT)
– Contiguous in-text citations
in same order
– minimum length of 2
• Longest Common Citation Sequence (LCCS)
– Citations in same order
but not necessarily contiguous
• Text-based: Encoplot
– Efficiency-optimized character 16-gram comparison
15
xxx6x54xx321
6xxxx321xx54
I
III
II III
III
Tiles: I (1,5,3) II (6,1,2) III (9,12,1)
Doc A:
Doc B:
6543xx2xx1xx
x34xxx256x1x
LCCS: 1,2,3
Doc A:
Doc B:
𝑠BC = 3Doc A
citing
Doc B
citing
[1]
[2]
[3]
cites cites
Results
Candidate Retrieval
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R
Math + + + – – – + + + + 0.7
Cit. + + – + + + + + + + 0.9
Text + + + + + + – + + + 0.9
17
• Citation-based and text-based approaches perform better than
math-based analysis (potential for improvement)
• Basic union of the result sets of any two of the approaches achieves
100% recall
Determining Significance Thresholds for Scores
• Goal: Derive approximation for maximum similarity by chance
• Analysis of score distribution for 1M (hopefully) unrelated document
pairs (no common authors, do not cite each other)
• Threshold = score of highest ranked document pair without
noticeable topical relatedness
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that 18
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Text-based approach performs best for complete retrieval process
(known deficiency of test collection)
• Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20)
19
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Best math-based approach (GIT) achieves same MRR as
text-based approach (Enco) for detailed analysis
• Result achievable using multi-feature candidate retrieval
20
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Non-textual detection approaches provide valuable indicators for
suspicious similarity in case of low textual similarity
21
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that
exhibit high content similarity for likely legitimate reasons, i.e.,
reusing own work and referring to the work of others with due
attribution. Our goal was to estimate an upper bound for similarity
scores that likely result from random feature matches. To do so,
we manually assessed the topical relatedness of the top-ranked
document pairs within the random sample of 1M documents for
each similarity measure. We picked as the significance threshold
for a similarity measure the rank of the first document pair for
which we could not identify a topical relatedness. Table 3 shows
the significance scores we derived using this procedure.
Figure 2 shows the distribution of the similarity scores s (vertical
axis) computed using each similarity measure for the random sam-
ple of 1M documents. Large horizontal bars shaded in blue indicate
the median score; small horizontal bars shaded in grey mark the
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Combined feature analysis, e.g., basic set union, achieves suspicious
scores for 9 / 10 test cases (see underlined values)
• Only C7 could not be identified
22
Exploratory Study
• Retrieve candidate set (100 doc.) using best-performing math-based,
citation-based, and text-based approach for all 102,524 documents
• Form union of candidate sets
• Detailed Analysis of each document to candidate set
• Manual Investigation of top-10 results
23
Results Exploratory Study
• Two known cases of plagiarism (Plag)
• One so far undiscovered case confirmed as plagiarism by the author of
the source document (Susp.)
• Five cases of content reuse (CR)
– All duly cited but citation not recognized
• Two false positives (FP)
24
Table 5: Top-rankeddocuments in exploratory study.
Rank 1 2 3 4 5 6 7 8 9 10
Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18
Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
Newly Discovered Suspicious Case
Source Documents (S1, S2) Suspicious Document
25
Conclusion & Future Work
• Math-based and citation-based detection methods complement
text-based approaches
– Improve recall for candidate retrieval stage
– Perform equally well as text-based methods for detailed analysis in many cases
– Provide indicators for suspicious similarity in cases with low textual similarity
– Can identify so far undiscovered cases
• Extraction of citation data for challenging STEM documents
must be improved
– Citation-based methods currently do no achieve full potential
• Improve math-based methods
– Include positional information for candidate retrieval stage
– Include structural and semantic information for detailed analysis
26
Contact:
Norman Meuschke
n@meuschke.org | @normeu
Paper, Data, Code, Prototype:
purl.org/hybridPD
Other Projects & Publications:
dke.uni-wuppertal.de
Many Thanks to DAAD
for Travel Support!
German Academic
Exchange Service

More Related Content

What's hot

Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationcsandit
 
Workshop unpad2014 with ref
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with refLola Devung
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalYI-JHEN LIN
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Use text mining method to support criminal case judgment
Use text mining method to support criminal case judgmentUse text mining method to support criminal case judgment
Use text mining method to support criminal case judgmentZhongLI28
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 

What's hot (19)

Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
 
A0210110
A0210110A0210110
A0210110
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
 
Workshop unpad2014 with ref
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with ref
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Ontology learning
Ontology learningOntology learning
Ontology learning
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Use text mining method to support criminal case judgment
Use text mining method to support criminal case judgmentUse text mining method to support criminal case judgment
Use text mining method to support criminal case judgment
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 

Similar to Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...cscpconf
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...ijcsitcejournal
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesKausar Mukadam
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 

Similar to Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations (20)

F017243241
F017243241F017243241
F017243241
 
P33077080
P33077080P33077080
P33077080
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
G04124041046
G04124041046G04124041046
G04124041046
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A First Step Towards Content Protecting Plagiarism Detection
A First Step Towards Content Protecting Plagiarism Detection  A First Step Towards Content Protecting Plagiarism Detection
A First Step Towards Content Protecting Plagiarism Detection
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Ceis 1
Ceis 1Ceis 1
Ceis 1
 

More from Scientific Information Analytics Group, Prof. Gipp

More from Scientific Information Analytics Group, Prof. Gipp (8)

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
 
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
 
Towards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and RecognitionTowards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and Recognition
 
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
Too Late to Collaborate:Challenges tothe Discovery ofin-progress ResearchToo Late to Collaborate:Challenges tothe Discovery ofin-progress Research
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
 
Repurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical GuideRepurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical Guide
 
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
 
An Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection ApproachAn Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection Approach
 
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
 

Recently uploaded

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 

Recently uploaded (20)

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

  • 1. Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, Bela Gipp
  • 2. Outline 1. Problem Detecting academic plagiarism in math-heavy STEM disciplines 2. Methodology Combined analysis of math-based and citation-based features for confirmed cases of plagiarism and exploratory search for unknown cases in arXiv documents 3. Results Math-based and citation-based methods are a valuable complement to text-based methods and can identify so far undiscovered cases of academic plagiarism in math-heavy STEM documents 2
  • 4. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” 3 Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
  • 5. Problem Summary • Current Plagiarism Detection Systems: – Perform sophisticated text analysis – Find copy & paste plagiarism typical for students – Miss disguised plagiarism frequent among researchers • Our prior research: – Analyzing citation patterns [1] – Analyzing image similarity [2] Doc C Doc E Doc D Section 1 This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is ain-text citation [1]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis for plagiarism detection. Section 2 Another in-text citation [2]. tThi s is an exampletext with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.This is a repeated in-text citation [1]. This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Setion 3 A third in-text citation [3]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection.afinal i n-text-citation[2]. References [1] [2] [3] Document B This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2]. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an example text withreferences to differentdocuments fori llustrating the usage ofcitation analysi s for pl agiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Here s a third in-text citation [3]. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. Document A References [1] [2] [3] EDC DECDC Citation Pattern Citation Pattern Doc A Doc B Ins.EIns.DC DECDC Pattern Comparison Doc A Doc B [1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011. Analysis of in-text citation patterns. Analysis of image similarity using Perceptual Hashing [2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
  • 6. Problem Summary Cont. • Non-textual feature analysis (in-text citations, images) achieves good detection effectiveness for disguised plagiarism • Papers in math-heavy disciplines: – Mix natural language and mathematical content – Cite comparably fewer sources than other STEM disciplines – Use figures sparsely • Text-based, citation-based, and image-based methods are less effective for math-heavy disciplines Plagiarized engineering paper (original: left, plagiarized: right , matching mathematical content: yellow, matching text of 10+ chars: blue). 6
  • 8. Approach • Combined analysis of math-based and citation-based similarity • Pilot study [3]: [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  • 9. Dataset • Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) Retraction Watch VroniPlag Wiki 39 Compilation of Test Cases Expert inspection to create ground truth Retrieval of confirmed cases of plagiarism File conversion & cleaning Infty Reader NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp) LaTeXML Selection of arXiv documents File conversion & cleaning Provision for research 1010 Formulae from: 10 plagiarized doc. 10 source doc. Formulae from: 105,120 arXiv doc. arXiv.org LaTeXML (X)HTML5 Test Collection 9
  • 10. 0 1 2 Doc1 Doc2 Distance 0 1 2 Do Do Dis r x Δ Identifiers Doc 1 r xx − − )²( 2 3 Histo Doc 2 r xx )3²2( − ∆𝑓 = 0 → 𝑠 = 1 Identifier Frequency Histograms (Histo) • Order-agnostic bag of identifiers • Similarity = relative difference in occurrence frequency 10
  • 11. Approach • Combined analysis of math-based and citation-based similarity • Pilot study [3]: • MRR = 0.86 (ident., doc.) • MRR = 0.70 (comb., part.) [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  • 12. Contributions of this Study 1. Two-step retrieval process – Candidate retrieval (CR) – Detailed analysis (DA) 2. New math-based similarity measures – Candidate Retrieval: Adaption of Lucene’s scoring function to math features – Detailed Analysis: Order-considering similarity measures 3. Combined analysis with citation-based methods – Performed well for disguised plagiarism in our prior research 4. Exploratory study in 102,524 arXiv documents – Search for undiscovered cases of plagiarism 12
  • 13. Candidate Retrieval • Dataset same as in pilot study – Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) • Lucene’s Practical Scoring Function – Combination of tf/idf vector space and Boolean retrieval model • Features: – Mathematical Identifiers (boost = # of occurrences in document) – In-text citations – Terms (default) • Retrieve 100 top-ranked documents for each query as candidates 13
  • 14. Detailed Analysis - Mathematical Features • Identifier Frequency Histograms (Histo) – same as in pilot study • Greedy Identifier Tiles (GIT) – Contiguous identifiers in same order (minimum length of 5) • Longest Common Identifier Subsequence (LCIS) – Identifiers in same order but not necessarily contiguous 14 Doc 1 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) Doc 2 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) GIT 𝑙GIT = 6 Doc 1 Doc 2 LCIS 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) 𝑙LCIS = 10
  • 15. Detailed Analysis - Citation & Text Features • Bibliographic Coupling (BC) – Order-agnostic bag of references • Greedy Citation Tiles (GCT) – Contiguous in-text citations in same order – minimum length of 2 • Longest Common Citation Sequence (LCCS) – Citations in same order but not necessarily contiguous • Text-based: Encoplot – Efficiency-optimized character 16-gram comparison 15 xxx6x54xx321 6xxxx321xx54 I III II III III Tiles: I (1,5,3) II (6,1,2) III (9,12,1) Doc A: Doc B: 6543xx2xx1xx x34xxx256x1x LCCS: 1,2,3 Doc A: Doc B: 𝑠BC = 3Doc A citing Doc B citing [1] [2] [3] cites cites
  • 17. Candidate Retrieval C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R Math + + + – – – + + + + 0.7 Cit. + + – + + + + + + + 0.9 Text + + + + + + – + + + 0.9 17 • Citation-based and text-based approaches perform better than math-based analysis (potential for improvement) • Basic union of the result sets of any two of the approaches achieves 100% recall
  • 18. Determining Significance Thresholds for Scores • Goal: Derive approximation for maximum similarity by chance • Analysis of score distribution for 1M (hopefully) unrelated document pairs (no common authors, do not cite each other) • Threshold = score of highest ranked document pair without noticeable topical relatedness Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that 18
  • 19. Retrieval Effectiveness for Confirmed Plagiarism Cases • Text-based approach performs best for complete retrieval process (known deficiency of test collection) • Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20) 19
  • 20. Retrieval Effectiveness for Confirmed Plagiarism Cases • Best math-based approach (GIT) achieves same MRR as text-based approach (Enco) for detailed analysis • Result achievable using multi-feature candidate retrieval 20
  • 21. Retrieval Effectiveness for Confirmed Plagiarism Cases • Non-textual detection approaches provide valuable indicators for suspicious similarity in case of low textual similarity 21 Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that exhibit high content similarity for likely legitimate reasons, i.e., reusing own work and referring to the work of others with due attribution. Our goal was to estimate an upper bound for similarity scores that likely result from random feature matches. To do so, we manually assessed the topical relatedness of the top-ranked document pairs within the random sample of 1M documents for each similarity measure. We picked as the significance threshold for a similarity measure the rank of the first document pair for which we could not identify a topical relatedness. Table 3 shows the significance scores we derived using this procedure. Figure 2 shows the distribution of the similarity scores s (vertical axis) computed using each similarity measure for the random sam- ple of 1M documents. Large horizontal bars shaded in blue indicate the median score; small horizontal bars shaded in grey mark the
  • 22. Retrieval Effectiveness for Confirmed Plagiarism Cases • Combined feature analysis, e.g., basic set union, achieves suspicious scores for 9 / 10 test cases (see underlined values) • Only C7 could not be identified 22
  • 23. Exploratory Study • Retrieve candidate set (100 doc.) using best-performing math-based, citation-based, and text-based approach for all 102,524 documents • Form union of candidate sets • Detailed Analysis of each document to candidate set • Manual Investigation of top-10 results 23
  • 24. Results Exploratory Study • Two known cases of plagiarism (Plag) • One so far undiscovered case confirmed as plagiarism by the author of the source document (Susp.) • Five cases of content reuse (CR) – All duly cited but citation not recognized • Two false positives (FP) 24 Table 5: Top-rankeddocuments in exploratory study. Rank 1 2 3 4 5 6 7 8 9 10 Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18 Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
  • 25. Newly Discovered Suspicious Case Source Documents (S1, S2) Suspicious Document 25
  • 26. Conclusion & Future Work • Math-based and citation-based detection methods complement text-based approaches – Improve recall for candidate retrieval stage – Perform equally well as text-based methods for detailed analysis in many cases – Provide indicators for suspicious similarity in cases with low textual similarity – Can identify so far undiscovered cases • Extraction of citation data for challenging STEM documents must be improved – Citation-based methods currently do no achieve full potential • Improve math-based methods – Include positional information for candidate retrieval stage – Include structural and semantic information for detailed analysis 26
  • 27. Contact: Norman Meuschke n@meuschke.org | @normeu Paper, Data, Code, Prototype: purl.org/hybridPD Other Projects & Publications: dke.uni-wuppertal.de Many Thanks to DAAD for Travel Support! German Academic Exchange Service