Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Scientific Information Analytics Group, Prof. Gipp
Scientific Information Analytics Group, Prof. GippProfessor at Scientific Information Analytics Group, Prof. Gipp
Improving Academic
Plagiarism Detection
for STEM Documents
by Analyzing Mathematical
Content and Citations
Norman Meuschke, Vincent Stange,
Moritz Schubotz, Michael Kramer,
Bela Gipp
Outline
1. Problem
Detecting academic plagiarism in math-heavy STEM disciplines
2. Methodology
Combined analysis of math-based and citation-based features
for confirmed cases of plagiarism and
exploratory search for unknown cases in arXiv documents
3. Results
Math-based and citation-based methods are a valuable complement to
text-based methods and can identify so far undiscovered cases of
academic plagiarism in math-heavy STEM documents
2
Problem
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source
to benefit in a setting where originality is expected.”
3
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of
plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
Problem Summary
• Current Plagiarism Detection Systems:
– Perform sophisticated text analysis
– Find copy & paste plagiarism typical for students
– Miss disguised plagiarism frequent among researchers
• Our prior research:
– Analyzing citation patterns [1]
– Analyzing image similarity [2]
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
This is ain-text citation [1]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis
for plagiarism detection.
Section 2
Another in-text citation [2]. tThi s is an exampletext with references to different
documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.This is a repeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an example text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection.
Setion 3
A third in-text citation [3]. This is anexample text with references todifferent documents
for illustrating the usage of citation analysis for plagiari sm detection.This is an example
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.afinal i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an example text withreferences to differentdocuments fori llustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3]. This is anexample text with references to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
[1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation
Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011.
Analysis of in-text citation patterns.
Analysis of image similarity using Perceptual Hashing
[2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism
Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
Problem Summary Cont.
• Non-textual feature analysis (in-text citations, images)
achieves good detection effectiveness for disguised plagiarism
• Papers in math-heavy disciplines:
– Mix natural language and mathematical content
– Cite comparably fewer sources than
other STEM disciplines
– Use figures sparsely
• Text-based, citation-based, and image-based methods
are less effective for math-heavy disciplines
Plagiarized engineering paper
(original: left, plagiarized: right ,
matching mathematical content: yellow,
matching text of 10+ chars: blue).
6
Methodology
Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
Dataset
• Ten confirmed cases (plagiarized document and its source) embedded
in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
Retraction
Watch
VroniPlag Wiki
39
Compilation of Test Cases
Expert inspection to
create ground truth
Retrieval of confirmed
cases of plagiarism
File conversion
& cleaning
Infty
Reader
NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp)
LaTeXML
Selection of
arXiv documents
File conversion
& cleaning
Provision
for research
1010
Formulae from:
10 plagiarized doc.
10 source doc.
Formulae from:
105,120 arXiv doc.
arXiv.org
LaTeXML
(X)HTML5
Test
Collection
9
0
1
2
Doc1
Doc2
Distance
0
1
2
Do
Do
Dis
r x Δ
Identifiers
Doc 1
r
xx −
− )²( 2
3
Histo
Doc 2
r
xx )3²2( −
∆𝑓 = 0 → 𝑠 = 1
Identifier Frequency Histograms (Histo)
• Order-agnostic bag of identifiers
• Similarity = relative difference in occurrence frequency
10
Approach
• Combined analysis of math-based and citation-based similarity
• Pilot study [3]:
• MRR = 0.86
(ident., doc.)
• MRR = 0.70
(comb., part.)
[3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx −
− )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( −
Contributions of this Study
1. Two-step retrieval process
– Candidate retrieval (CR)
– Detailed analysis (DA)
2. New math-based similarity measures
– Candidate Retrieval: Adaption of Lucene’s scoring function to math features
– Detailed Analysis: Order-considering similarity measures
3. Combined analysis with citation-based methods
– Performed well for disguised plagiarism in our prior research
4. Exploratory study in 102,524 arXiv documents
– Search for undiscovered cases of plagiarism
12
Candidate Retrieval
• Dataset same as in pilot study
– Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11
MathIR Task Dataset (105,120 arXiv docs., 60 M formulae)
• Lucene’s Practical Scoring Function
– Combination of tf/idf vector space and Boolean retrieval model
• Features:
– Mathematical Identifiers (boost = # of occurrences in document)
– In-text citations
– Terms (default)
• Retrieve 100 top-ranked documents for each query as candidates
13
Detailed Analysis - Mathematical Features
• Identifier Frequency Histograms (Histo) – same as in pilot study
• Greedy Identifier Tiles (GIT)
– Contiguous identifiers in same order (minimum length of 5)
• Longest Common Identifier Subsequence (LCIS)
– Identifiers in same order but not necessarily contiguous
14
Doc 1
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 )
Doc 2
𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
GIT
𝑙GIT = 6
Doc 1 Doc 2
LCIS
𝛽 𝑥 = (𝐿 𝑔 𝑗
𝐿𝑓
𝑝−1
ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓
𝑘 𝑖−1
ℎ𝑖(𝑥)
𝑙LCIS = 10
Detailed Analysis - Citation & Text Features
• Bibliographic Coupling (BC)
– Order-agnostic bag of references
• Greedy Citation Tiles (GCT)
– Contiguous in-text citations
in same order
– minimum length of 2
• Longest Common Citation Sequence (LCCS)
– Citations in same order
but not necessarily contiguous
• Text-based: Encoplot
– Efficiency-optimized character 16-gram comparison
15
xxx6x54xx321
6xxxx321xx54
I
III
II III
III
Tiles: I (1,5,3) II (6,1,2) III (9,12,1)
Doc A:
Doc B:
6543xx2xx1xx
x34xxx256x1x
LCCS: 1,2,3
Doc A:
Doc B:
𝑠BC = 3Doc A
citing
Doc B
citing
[1]
[2]
[3]
cites cites
Results
Candidate Retrieval
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R
Math + + + – – – + + + + 0.7
Cit. + + – + + + + + + + 0.9
Text + + + + + + – + + + 0.9
17
• Citation-based and text-based approaches perform better than
math-based analysis (potential for improvement)
• Basic union of the result sets of any two of the approaches achieves
100% recall
Determining Significance Thresholds for Scores
• Goal: Derive approximation for maximum similarity by chance
• Analysis of score distribution for 1M (hopefully) unrelated document
pairs (no common authors, do not cite each other)
• Threshold = score of highest ranked document pair without
noticeable topical relatedness
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that 18
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Text-based approach performs best for complete retrieval process
(known deficiency of test collection)
• Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20)
19
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Best math-based approach (GIT) achieves same MRR as
text-based approach (Enco) for detailed analysis
• Result achievable using multi-feature candidate retrieval
20
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Non-textual detection approaches provide valuable indicators for
suspicious similarity in case of low textual similarity
21
Meuschke, Stange, Schubotz, Kramer, Gipp
Table 3: Significancethresholds for similarity measures.
Histo LCIS GIT BC LCCS GCT Enco
s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06
pairs. The selection criteria ought to eliminate document pairs that
exhibit high content similarity for likely legitimate reasons, i.e.,
reusing own work and referring to the work of others with due
attribution. Our goal was to estimate an upper bound for similarity
scores that likely result from random feature matches. To do so,
we manually assessed the topical relatedness of the top-ranked
document pairs within the random sample of 1M documents for
each similarity measure. We picked as the significance threshold
for a similarity measure the rank of the first document pair for
which we could not identify a topical relatedness. Table 3 shows
the significance scores we derived using this procedure.
Figure 2 shows the distribution of the similarity scores s (vertical
axis) computed using each similarity measure for the random sam-
ple of 1M documents. Large horizontal bars shaded in blue indicate
the median score; small horizontal bars shaded in grey mark the
Retrieval Effectiveness for Confirmed Plagiarism Cases
• Combined feature analysis, e.g., basic set union, achieves suspicious
scores for 9 / 10 test cases (see underlined values)
• Only C7 could not be identified
22
Exploratory Study
• Retrieve candidate set (100 doc.) using best-performing math-based,
citation-based, and text-based approach for all 102,524 documents
• Form union of candidate sets
• Detailed Analysis of each document to candidate set
• Manual Investigation of top-10 results
23
Results Exploratory Study
• Two known cases of plagiarism (Plag)
• One so far undiscovered case confirmed as plagiarism by the author of
the source document (Susp.)
• Five cases of content reuse (CR)
– All duly cited but citation not recognized
• Two false positives (FP)
24
Table 5: Top-rankeddocuments in exploratory study.
Rank 1 2 3 4 5 6 7 8 9 10
Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18
Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
Newly Discovered Suspicious Case
Source Documents (S1, S2) Suspicious Document
25
Conclusion & Future Work
• Math-based and citation-based detection methods complement
text-based approaches
– Improve recall for candidate retrieval stage
– Perform equally well as text-based methods for detailed analysis in many cases
– Provide indicators for suspicious similarity in cases with low textual similarity
– Can identify so far undiscovered cases
• Extraction of citation data for challenging STEM documents
must be improved
– Citation-based methods currently do no achieve full potential
• Improve math-based methods
– Include positional information for candidate retrieval stage
– Include structural and semantic information for detailed analysis
26
Contact:
Norman Meuschke
n@meuschke.org | @normeu
Paper, Data, Code, Prototype:
purl.org/hybridPD
Other Projects & Publications:
dke.uni-wuppertal.de
Many Thanks to DAAD
for Travel Support!
German Academic
Exchange Service
1 of 27

Recommended

Analyzing Nontextual Content Features to Detect Academic Plagiarism by
Analyzing Nontextual Content Features to Detect Academic PlagiarismAnalyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic PlagiarismScientific Information Analytics Group, Prof. Gipp
355 views47 slides
Ju3517011704 by
Ju3517011704Ju3517011704
Ju3517011704IJERA Editor
321 views4 slides
Combining Approximate String Matching Algorithms and Term Frequency In The De... by
Combining Approximate String Matching Algorithms and Term Frequency In The De...Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...CSCJournals
83 views9 slides
Probablistic information retrieval by
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
63 views42 slides
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR... by
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
28 views23 slides
Classification of News and Research Articles Using Text Pattern Mining by
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
482 views7 slides

More Related Content

What's hot

Information_Retrieval_Models_Nfaoui_El_Habib by
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
2.6K views70 slides
Probabilistic retrieval model by
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
1.9K views15 slides
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi... by
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
208 views10 slides
A0210110 by
A0210110A0210110
A0210110inventionjournals
397 views10 slides
Semantic tagging for documents using 'short text' information by
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationcsandit
419 views14 slides
Workshop unpad2014 with ref by
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with refLola Devung
571 views5 slides

What's hot(19)

Information_Retrieval_Models_Nfaoui_El_Habib by El Habib NFAOUI
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI2.6K views
Probabilistic retrieval model by baradhimarch81
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
baradhimarch811.9K views
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi... by iosrjce
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
iosrjce208 views
Semantic tagging for documents using 'short text' information by csandit
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
csandit419 views
Workshop unpad2014 with ref by Lola Devung
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with ref
Lola Devung571 views
Keyphrase Extraction using Neighborhood Knowledge by IJMTST Journal
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
IJMTST Journal134 views
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval by YI-JHEN LIN
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
YI-JHEN LIN294 views
Probabilistic Information Retrieval by Harsh Thakkar
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
Harsh Thakkar4.9K views
Semantics-based clustering approach for similar research area detection by TELKOMNIKA JOURNAL
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH by IJDKP
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP255 views
A Text Mining Research Based on LDA Topic Modelling by csandit
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
csandit141 views
Blei ngjordan2003 by Ajay Ohri
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
Ajay Ohri2K views
A rough set based hybrid method to text categorization by Ninad Samel
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
Ninad Samel314 views
Neural Models for Document Ranking by Bhaskar Mitra
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
Bhaskar Mitra1.8K views
Use text mining method to support criminal case judgment by ZhongLI28
Use text mining method to support criminal case judgmentUse text mining method to support criminal case judgment
Use text mining method to support criminal case judgment
ZhongLI2897 views
Adversarial and reinforcement learning-based approaches to information retrieval by Bhaskar Mitra
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra703 views

Similar to Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

F017243241 by
F017243241F017243241
F017243241IOSR Journals
122 views10 slides
P33077080 by
P33077080P33077080
P33077080IJERA Editor
270 views4 slides
International Journal of Engineering Research and Development (IJERD) by
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
438 views6 slides
A Semantically Enriched Recommendation & Visualization Approach for Academic ... by
A Semantically Enriched Recommendation & Visualization Approach for Academic ...A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...Scientific Information Analytics Group, Prof. Gipp
562 views33 slides
Textual Document Categorization using Bigram Maximum Likelihood and KNN by
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
39 views5 slides
Bl24409420 by
Bl24409420Bl24409420
Bl24409420IJERA Editor
242 views12 slides

Similar to Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations(20)

International Journal of Engineering Research and Development (IJERD) by IJERD Editor
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor438 views
Textual Document Categorization using Bigram Maximum Likelihood and KNN by Rounak Dhaneriya
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya39 views
G04124041046 by IOSR-JEN
G04124041046G04124041046
G04124041046
IOSR-JEN333 views
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M... by ijdms
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
ijdms46 views
International Journal of Computational Engineering Research(IJCER) by ijceronline
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline267 views
Usage of word sense disambiguation in concept identification in ontology cons... by Innovation Quotient Pvt Ltd
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US... by cscpconf
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
cscpconf68 views
Construction of Keyword Extraction using Statistical Approaches and Document ... by IJERA Editor
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
IJERA Editor154 views
Construction of Keyword Extraction using Statistical Approaches and Document ... by IJERA Editor
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
IJERA Editor29 views
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF... by ijcsitcejournal
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
ijcsitcejournal77 views
Research on ontology based information retrieval techniques by Kausar Mukadam
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
Kausar Mukadam557 views
Document ranking using qprp with concept of multi dimensional subspace by Prakash Dubey
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey67 views

More from Scientific Information Analytics Group, Prof. Gipp

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co... by
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Scientific Information Analytics Group, Prof. Gipp
49 views28 slides
Towards Formula Concept Discovery and Recognition by
Towards Formula Concept Discovery and RecognitionTowards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and RecognitionScientific Information Analytics Group, Prof. Gipp
73 views21 slides
Too Late to Collaborate: Challenges to the Discovery of in-progress Research by
Too Late to Collaborate:Challenges tothe Discovery ofin-progress ResearchToo Late to Collaborate:Challenges tothe Discovery ofin-progress Research
Too Late to Collaborate: Challenges to the Discovery of in-progress ResearchScientific Information Analytics Group, Prof. Gipp
127 views15 slides
Repurposing Open Source Tools for Open Science: a Practical Guide by
Repurposing Open Source Tools for Open Science: a Practical GuideRepurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical GuideScientific Information Analytics Group, Prof. Gipp
114 views8 slides
Blockchain based Trusted Timestamping for Research Data and Preprints using O... by
Blockchain based Trusted Timestamping for Research Data and Preprints using O...Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...Scientific Information Analytics Group, Prof. Gipp
512 views19 slides
An Adaptive Image-based Plagiarism Detection Approach by
An Adaptive Image-based Plagiarism Detection ApproachAn Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection ApproachScientific Information Analytics Group, Prof. Gipp
1.7K views31 slides

More from Scientific Information Analytics Group, Prof. Gipp(7)

Recently uploaded

MILK LIPIDS 2.pptx by
MILK LIPIDS 2.pptxMILK LIPIDS 2.pptx
MILK LIPIDS 2.pptxabhinambroze18
7 views15 slides
application of genetic engineering 2.pptx by
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptxSankSurezz
7 views12 slides
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptxabhinashsahoo2001
118 views22 slides
1978 NASA News Release Log by
1978 NASA News Release Log1978 NASA News Release Log
1978 NASA News Release Logpurrterminator
8 views146 slides
PRINCIPLES-OF ASSESSMENT by
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENTrbalmagro
11 views12 slides
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
5 views1 slide

Recently uploaded(20)

application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz7 views
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by abhinashsahoo2001
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptx
abhinashsahoo2001118 views
PRINCIPLES-OF ASSESSMENT by rbalmagro
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro11 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
Distinct distributions of elliptical and disk galaxies across the Local Super... by Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani30 views
Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles725 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya13 views
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx by MN
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
MN6 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen135 views
How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens460 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew5 views
Guinea Pig as a Model for Translation Research by PervaizDar1
Guinea Pig as a Model for Translation ResearchGuinea Pig as a Model for Translation Research
Guinea Pig as a Model for Translation Research
PervaizDar111 views
himalay baruah acid fast staining.pptx by HimalayBaruah
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptx
HimalayBaruah5 views
A training, certification and marketing scheme for informal dairy vendors in ... by ILRI
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...
ILRI11 views

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

  • 1. Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, Bela Gipp
  • 2. Outline 1. Problem Detecting academic plagiarism in math-heavy STEM disciplines 2. Methodology Combined analysis of math-based and citation-based features for confirmed cases of plagiarism and exploratory search for unknown cases in arXiv documents 3. Results Math-based and citation-based methods are a valuable complement to text-based methods and can identify so far undiscovered cases of academic plagiarism in math-heavy STEM documents 2
  • 4. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” 3 Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
  • 5. Problem Summary • Current Plagiarism Detection Systems: – Perform sophisticated text analysis – Find copy & paste plagiarism typical for students – Miss disguised plagiarism frequent among researchers • Our prior research: – Analyzing citation patterns [1] – Analyzing image similarity [2] Doc C Doc E Doc D Section 1 This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is ain-text citation [1]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analy sis for plagiarism detection. Section 2 Another in-text citation [2]. tThi s is an exampletext with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection.Thi s is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.This is a repeated in-text citation [1]. This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an example text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Setion 3 A third in-text citation [3]. This is anexample text with references todifferent documents for illustrating the usage of citation analysis for plagiari sm detection.This is an example text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection.afinal i n-text-citation[2]. References [1] [2] [3] Document B This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is ain-text citation [1]. This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection.Another exampl efor ani n-text citation [2]. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an example text withreferences to differentdocuments fori llustrating the usage ofcitation analysi s for pl agiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Here s a third in-text citation [3]. This is anexample text with references to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. Document A References [1] [2] [3] EDC DECDC Citation Pattern Citation Pattern Doc A Doc B Ins.EIns.DC DECDC Pattern Comparison Doc A Doc B [1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011. Analysis of in-text citation patterns. Analysis of image similarity using Perceptual Hashing [2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. 5
  • 6. Problem Summary Cont. • Non-textual feature analysis (in-text citations, images) achieves good detection effectiveness for disguised plagiarism • Papers in math-heavy disciplines: – Mix natural language and mathematical content – Cite comparably fewer sources than other STEM disciplines – Use figures sparsely • Text-based, citation-based, and image-based methods are less effective for math-heavy disciplines Plagiarized engineering paper (original: left, plagiarized: right , matching mathematical content: yellow, matching text of 10+ chars: blue). 6
  • 8. Approach • Combined analysis of math-based and citation-based similarity • Pilot study [3]: [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 8 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  • 9. Dataset • Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) Retraction Watch VroniPlag Wiki 39 Compilation of Test Cases Expert inspection to create ground truth Retrieval of confirmed cases of plagiarism File conversion & cleaning Infty Reader NTCIR-11 MathIR Task Dataset (ntcir-math.nii.ac.jp) LaTeXML Selection of arXiv documents File conversion & cleaning Provision for research 1010 Formulae from: 10 plagiarized doc. 10 source doc. Formulae from: 105,120 arXiv doc. arXiv.org LaTeXML (X)HTML5 Test Collection 9
  • 10. 0 1 2 Doc1 Doc2 Distance 0 1 2 Do Do Dis r x Δ Identifiers Doc 1 r xx − − )²( 2 3 Histo Doc 2 r xx )3²2( − ∆𝑓 = 0 → 𝑠 = 1 Identifier Frequency Histograms (Histo) • Order-agnostic bag of identifiers • Similarity = relative difference in occurrence frequency 10
  • 11. Approach • Combined analysis of math-based and citation-based similarity • Pilot study [3]: • MRR = 0.86 (ident., doc.) • MRR = 0.70 (comb., part.) [3] N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism,” in Proc. Int. Conf. on Information and Knowledge Management (CIKM), 2017. 11 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx − − )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( −
  • 12. Contributions of this Study 1. Two-step retrieval process – Candidate retrieval (CR) – Detailed analysis (DA) 2. New math-based similarity measures – Candidate Retrieval: Adaption of Lucene’s scoring function to math features – Detailed Analysis: Order-considering similarity measures 3. Combined analysis with citation-based methods – Performed well for disguised plagiarism in our prior research 4. Exploratory study in 102,524 arXiv documents – Search for undiscovered cases of plagiarism 12
  • 13. Candidate Retrieval • Dataset same as in pilot study – Ten confirmed cases (plagiarized document and its source) embedded in NTCIR-11 MathIR Task Dataset (105,120 arXiv docs., 60 M formulae) • Lucene’s Practical Scoring Function – Combination of tf/idf vector space and Boolean retrieval model • Features: – Mathematical Identifiers (boost = # of occurrences in document) – In-text citations – Terms (default) • Retrieve 100 top-ranked documents for each query as candidates 13
  • 14. Detailed Analysis - Mathematical Features • Identifier Frequency Histograms (Histo) – same as in pilot study • Greedy Identifier Tiles (GIT) – Contiguous identifiers in same order (minimum length of 5) • Longest Common Identifier Subsequence (LCIS) – Identifiers in same order but not necessarily contiguous 14 Doc 1 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) Doc 2 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) GIT 𝑙GIT = 6 Doc 1 Doc 2 LCIS 𝛽 𝑥 = (𝐿 𝑔 𝑗 𝐿𝑓 𝑝−1 ℎ𝑖 𝑥 ) 𝛽𝑖,𝑗 𝑥 = 𝐿 𝑔,𝑗 𝐿𝑓 𝑘 𝑖−1 ℎ𝑖(𝑥) 𝑙LCIS = 10
  • 15. Detailed Analysis - Citation & Text Features • Bibliographic Coupling (BC) – Order-agnostic bag of references • Greedy Citation Tiles (GCT) – Contiguous in-text citations in same order – minimum length of 2 • Longest Common Citation Sequence (LCCS) – Citations in same order but not necessarily contiguous • Text-based: Encoplot – Efficiency-optimized character 16-gram comparison 15 xxx6x54xx321 6xxxx321xx54 I III II III III Tiles: I (1,5,3) II (6,1,2) III (9,12,1) Doc A: Doc B: 6543xx2xx1xx x34xxx256x1x LCCS: 1,2,3 Doc A: Doc B: 𝑠BC = 3Doc A citing Doc B citing [1] [2] [3] cites cites
  • 17. Candidate Retrieval C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 R Math + + + – – – + + + + 0.7 Cit. + + – + + + + + + + 0.9 Text + + + + + + – + + + 0.9 17 • Citation-based and text-based approaches perform better than math-based analysis (potential for improvement) • Basic union of the result sets of any two of the approaches achieves 100% recall
  • 18. Determining Significance Thresholds for Scores • Goal: Derive approximation for maximum similarity by chance • Analysis of score distribution for 1M (hopefully) unrelated document pairs (no common authors, do not cite each other) • Threshold = score of highest ranked document pair without noticeable topical relatedness Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that 18
  • 19. Retrieval Effectiveness for Confirmed Plagiarism Cases • Text-based approach performs best for complete retrieval process (known deficiency of test collection) • Only 6 / 10 cases clearly suspicious (𝑠 ≥ 0.20) 19
  • 20. Retrieval Effectiveness for Confirmed Plagiarism Cases • Best math-based approach (GIT) achieves same MRR as text-based approach (Enco) for detailed analysis • Result achievable using multi-feature candidate retrieval 20
  • 21. Retrieval Effectiveness for Confirmed Plagiarism Cases • Non-textual detection approaches provide valuable indicators for suspicious similarity in case of low textual similarity 21 Meuschke, Stange, Schubotz, Kramer, Gipp Table 3: Significancethresholds for similarity measures. Histo LCIS GIT BC LCCS GCT Enco s ≥.56 ≥.76 ≥.15 ≥.13 ≥.22 ≥.10 ≥.06 pairs. The selection criteria ought to eliminate document pairs that exhibit high content similarity for likely legitimate reasons, i.e., reusing own work and referring to the work of others with due attribution. Our goal was to estimate an upper bound for similarity scores that likely result from random feature matches. To do so, we manually assessed the topical relatedness of the top-ranked document pairs within the random sample of 1M documents for each similarity measure. We picked as the significance threshold for a similarity measure the rank of the first document pair for which we could not identify a topical relatedness. Table 3 shows the significance scores we derived using this procedure. Figure 2 shows the distribution of the similarity scores s (vertical axis) computed using each similarity measure for the random sam- ple of 1M documents. Large horizontal bars shaded in blue indicate the median score; small horizontal bars shaded in grey mark the
  • 22. Retrieval Effectiveness for Confirmed Plagiarism Cases • Combined feature analysis, e.g., basic set union, achieves suspicious scores for 9 / 10 test cases (see underlined values) • Only C7 could not be identified 22
  • 23. Exploratory Study • Retrieve candidate set (100 doc.) using best-performing math-based, citation-based, and text-based approach for all 102,524 documents • Form union of candidate sets • Detailed Analysis of each document to candidate set • Manual Investigation of top-10 results 23
  • 24. Results Exploratory Study • Two known cases of plagiarism (Plag) • One so far undiscovered case confirmed as plagiarism by the author of the source document (Susp.) • Five cases of content reuse (CR) – All duly cited but citation not recognized • Two false positives (FP) 24 Table 5: Top-rankeddocuments in exploratory study. Rank 1 2 3 4 5 6 7 8 9 10 Case C3 C11 C12 C13 C10 C14 C15 C16 C17 C18 Rating Plag. Susp. CR FP Plag. FP CR CR CR CR
  • 25. Newly Discovered Suspicious Case Source Documents (S1, S2) Suspicious Document 25
  • 26. Conclusion & Future Work • Math-based and citation-based detection methods complement text-based approaches – Improve recall for candidate retrieval stage – Perform equally well as text-based methods for detailed analysis in many cases – Provide indicators for suspicious similarity in cases with low textual similarity – Can identify so far undiscovered cases • Extraction of citation data for challenging STEM documents must be improved – Citation-based methods currently do no achieve full potential • Improve math-based methods – Include positional information for candidate retrieval stage – Include structural and semantic information for detailed analysis 26
  • 27. Contact: Norman Meuschke n@meuschke.org | @normeu Paper, Data, Code, Prototype: purl.org/hybridPD Other Projects & Publications: dke.uni-wuppertal.de Many Thanks to DAAD for Travel Support! German Academic Exchange Service