Analyzing Nontextual Content Features to Detect Academic Plagiarism

Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Norman Meuschke
Information Science Group
University of Konstanz
www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 1

Outline
• Introduction
• Short Bio
• Overview of Research Group
• Overview of Academic Plagiarism Detection
• Research Approach: Analyzing Nontextual Content Features
• (Analyzing Academic Citations)
• Analyzing Mathematics
• Analyzing Images

Short Bio
UC Berkeley, California
MA Thesis, Dept. of Statistics
SciPlore Startup
(2011 – 2014)
National Institute of
Informatics Tokyo
PhD
(2014 – Jan 2015)
Univ. of Konstanz
PhD
(since Feb. 2015)
Univ. of Magdeburg
BA / MA
Information Systems
2011
Feb 2015 - Today

University of Konstanz
An Adaptive Image-based Plagiarism Detection Approach - Meuschke et al. 4
Map data ©2018 GeoBasis-DE/BKG (©2009), Google
Map data ©2018 GeoBasis-DE/BKG (©2009), Google

Research Group – www.isg.uni.kn
Prof. Bela Gipp Dr. Moritz Schubotz Corinna Breitinger Philip Ehret
André Greiner-Petter Felix Hamborg Thomas Hepp Philipp Scharpf
Alexander Schönhals Malte Schwarzer Vincent Stange Patrick Wortner
visiting researcher
at NII

Research Areas
• Information Science
• applied, use-case driven and user-focused computer science
• We focus on three areas
• Semantic Document Analysis & Document Retrieval
• Mathematical Information Retrieval
• Blockchain Applications

Semantic Document Analysis & Document Retrieval
• Plagiarism Detection
• Literature Recommendation
• Research Papers
• Wikipedia
• Legal Documents
• News Analysis
• Media Bias Detection
Corinna Breitinger
Felix Hamborg
Philipp ScharpfMalte Schwarzer
Vincent StangeNorman Meuschke Gent Ymeri
Max KutznerAnastasia Zhukova

Mathematical Information Retrieval
• Conversion and Enrichment
• Search
• Recommendation
Philipp Scharpf
Dr. Moritz Schubotz André Greiner-Petter
Felix Petersen

Blockchain Applications
• Trusted Timestamping
• Blockchain for Science
Vincent StangeThomas Hepp Christopher Gondek
Daniel Muffler Jannik Bamberger
Philip EhretAlexander Schönhals Patrick Wortner

Analyzing Nontextual Content Features
to Detect Academic Plagiarism

Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originality is expected.”
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3

Plagiarism Forms
Note: plagiarism forms are not mutually exclusive
Paraphrasing
 intentional rewriting
 no / insufficient reference the source
Structural and idea plagiarism
 little or no verbatim text overlap
Cross-language plagiarism
 manual/automated conversion of text into
other language to hide its origin
Copy & paste
 taking content verbatim from other source
Shake & paste
 copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
 techniques that exploit weaknesses of
current detection methods
Weak Strong
level of obfuscation

- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
 n-gram fingerprinting
 vector space models
 text alignment
 exhaustive string matching
Technical disguise
 encoding checks
 checks for textual content
 checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
 synonym expansion (WordNet)
 Semantic Role Labeling
 Latent Semantic Analysis
 POS-aware text matching
Cross-language plagiarism
 CL Character N-Gram Comp.
 CL Explicit Semantic Analysis
 CL Alignment-based Similarity Analysis
Weak Strong
level of obfuscation
5

Research Approach
• Analyze similarity features that:
• contain a high degree of semantic information
• exhibit low variability in their representations
• are not easily substitutable
• Combine analysis non-textual and textual content features

heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research

Analyzing Academic Citations
B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection:
Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on
Document Engineering (DocEng ’11), 2011.
B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger, “Demonstration of Citation Pattern
Analysis for Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2013.
B. Gipp, N. Meuschke, and C. Breitinger, “Citation-based Plagiarism Detection: Practicability on a Large-
scale Scientific Corpus,” JASIST, vol. 65, iss. 2, pp. 1527-1540, 2014.
B. Gipp, Citation-based Plagiarism Detection – Detecting Disguised and Cross-language Plagiarism using
Citation Pattern Analysis, Springer Vieweg Research, 2014.

Analyzing Citation Patterns
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
This is ain-text citation [1].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.
Section 2
Another in-text citation [2].tThis is anexample text with references todifferent
documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. This is arepeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
Setion 3
A third in-text citation [3].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection. a final i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1].This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. Another exampl efor ani n-text citation [2].
ofci tation analysis for plagi arism detection.
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an exampl etext withreferences to different documents for illustrating
the usage ofcitation analysi s for pl agiarism detection.
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
detection. Here s a third in-text citation [3].This is an exampl etext withreferences to
detection.
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B

Conclusion CbPD Evaluation
• Successes
• Significantly higher detection performance for disguised plagiarism
in biomedicine
• Decrease in user effort (time savings for examination)
• First approach to allow n:n comparisons for large collections
• Limitations
• Effectiveness varies depending on discipline
• Low for mathematics and physics
• Effectiveness is low for shorter cases of disguised plagiarism

Analyzing Mathematical Expressions
N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism”, in Proc. ACM CIKM, 2017.
N. Meuschke, M. Schubotz, M. Kramer, V. Stange, and B. Gipp, “Improving External Plagiarism Detection
for Academic Documents by Analyzing Mathematical Expressions”, in review at ACM CIKM, 2018.

Characteristics of Math-Heavy Texts
• Texts in math-heavy disciplines interveawe natural
and symbolic language
… one is not understandable without the other

Characteristics of Math-Heavy Texts
• Mathematical expressions share many characteristics
of academic citations
• much semantic information
• language-independent
• hard to leave out or substitute (yet easy to obfuscate)

Initial MathPD Study
• Preliminary:
• Manual analysis of 39 confirmed cases of plagiarism
• Automated retrieval experiments:
• 10 real-world cases of mathematical plagiarism; source documents
embedded in NTCIR-11 Math Retrieval corpus
(105K arXiv documents, 60M mathematical expressions)

Math-based Feature Comparison
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx 
 )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( 

Results Initial MathPD Study
Case D dci dcn dco D dci dcn dco
C1 3,606 1 27,857 30,784 1 1 85,418 99,201
C2 1 1 88,891 90,962 1 1 12,266 10,277
C3 11,628 2 28,415 3,144 1 16 34,966 5,757
C4 2,581 1 1,950 86 189 6 54,560 18,374
C5 1 1 5,790 22,408 1 6 92,951 16,180
C6 25,498 12 19,862 38,145 7,976 3 24,405 72,687
C7 1 1 4,690 1,627 19,900 1 67,614 14,758
C8 1 1 39,215 11,576 1 1 21,152 9,475
C9 1 1 13,591 35,393 1 1 11,519 32,687
C10 1 1 76,678 30,673 1 1,223 89,703 3,280
0.60 0.86 < 0.01 < 0.01 0.70 0.57 < 0.01 < 0.01
full document
source retrieved at rank
partitions
source retrieved at rank
MRR

Extension to Math-based Retrieval Process
• Candidate Retrieval using Elastic Search
• Detailed Analysis using Pattern Analysis for Identifiers
input document(s)
Human
Inspection
Detailed
Analysis
Candidate
Retrieval
candidate
documents
similar documents
user
document
full text
MathML
formulae
citations &
references
text
fingerprints
Pre-
processing
math. identifiers
(list & histogram)
Indexing
unified document
Start analysis
Doc. ID=78
Data Storage
ID
78

Results
• Math-based approach
• candidate retrieval step reduced recall (R=0.7)
• detailed pattern analysis increased MRR=0.93
• Combined analysis (math, citation, text) could identify suspicious
similarity for all cases (R =1, MRR=1)

Analyzing Images
N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based
Plagiarism Detection Approach,” in Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018.
N. Meuschke, V. Stange, M. Schubotz, and B. Gipp, “HyPlag: A Hybrid Approach to Academic Plagiarism
Detection,” in Proc. ACM SIGIR Conf., 2018.

Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarism detection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution, noise)

Research Gap
• Current image-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenous images in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future

Process
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.

Perceptual Hashing
• Efficient CBIR method to reliably find near image copies
• Uses most apparent visual features in images
• Creates non-unique fingerprints that can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use Discrete Cosine Transformation and Hamming Distance
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e

k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2

Position-aware Text Matching
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh
1800pxh
25pxr 

Ratio Hashing
• First approach to target reuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00

Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scores of highly similar set of images
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id    1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c

Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor (3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id    1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c

Evaluation
• Source for test images: VroniPlag collection
• crowd-sourced effort investigating plagiarism allegations
• 196 manually examined academic works (mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representative cases (mostly from life sciences)
• Cases imbedded in 4,500 images obtained from PubMed Central
16

Example: Near Copies
17
Source Image Reused Image

Example: Weak Alteration

Example: Moderate Alteration

Example: Strong Alteration

Results
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true source image at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text content is present.

Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matching more robust to low OCR quality
• k-gram matching identified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)

Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictive outlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessing in parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd

Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity

Hybrid Detection System

Questions?
Norman Meuschke
n@meuschke.org | @normeu
• Slides for this talk (and other talks):
www.slideshare.net/GroupGipp
• Contact, publications, other projects:
www.isg.uni.kn
• Code: www.github.com/ag-gipp
25

Analyzing Nontextual Content Features to Detect Academic Plagiarism

More Related Content

What's hot

Similar to Analyzing Nontextual Content Features to Detect Academic Plagiarism

More from Scientific Information Analytics Group, Prof. Gipp

Recently uploaded

Analyzing Nontextual Content Features to Detect Academic Plagiarism

Editor's Notes