Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Norman Meuschke
Information Science Group
University of Konstanz
www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 1
Outline
• Introduction
• Short Bio
• Overview of Research Group
• Overview of Academic Plagiarism Detection
• Research Approach: Analyzing Nontextual Content Features
• (Analyzing Academic Citations)
• Analyzing Mathematics
• Analyzing Images
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 2
Short Bio
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 3
UC Berkeley, California
MA Thesis, Dept. of Statistics
SciPlore Startup
(2011 – 2014)
National Institute of
Informatics Tokyo
PhD
(2014 – Jan 2015)
Univ. of Konstanz
PhD
(since Feb. 2015)
Univ. of Magdeburg
BA / MA
Information Systems
2011
Feb 2015 - Today
University of Konstanz
An Adaptive Image-based Plagiarism Detection Approach - Meuschke et al. 4
Map data ©2018 GeoBasis-DE/BKG (©2009), Google
Map data ©2018 GeoBasis-DE/BKG (©2009), Google
Research Group – www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 5
Prof. Bela Gipp Dr. Moritz Schubotz Corinna Breitinger Philip Ehret
André Greiner-Petter Felix Hamborg Thomas Hepp Philipp Scharpf
Alexander Schönhals Malte Schwarzer Vincent Stange Patrick Wortner
visiting researcher
at NII
Research Areas
• Information Science
• applied, use-case driven and user-focused computer science
• We focus on three areas
• Semantic Document Analysis & Document Retrieval
• Mathematical Information Retrieval
• Blockchain Applications
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
Semantic Document Analysis & Document Retrieval
• Plagiarism Detection
• Literature Recommendation
• Research Papers
• Wikipedia
• Legal Documents
• News Analysis
• Media Bias Detection
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
Corinna Breitinger
Felix Hamborg
Philipp ScharpfMalte Schwarzer
Vincent StangeNorman Meuschke Gent Ymeri
Max KutznerAnastasia Zhukova
Mathematical Information Retrieval
• Conversion and Enrichment
• Search
• Recommendation
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
Philipp Scharpf
Dr. Moritz Schubotz André Greiner-Petter
Felix Petersen
Blockchain Applications
• Trusted Timestamping
• Blockchain for Science
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9
Vincent StangeThomas Hepp Christopher Gondek
Daniel Muffler Jannik Bamberger
Philip EhretAlexander Schönhals Patrick Wortner
Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originality is expected.”
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3
Plagiarism Forms
Note: plagiarism forms are not mutually exclusive
Paraphrasing
 intentional rewriting
 no / insufficient reference the source
Structural and idea plagiarism
 little or no verbatim text overlap
Cross-language plagiarism
 manual/automated conversion of text into
other language to hide its origin
Copy & paste
 taking content verbatim from other source
Shake & paste
 copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
 techniques that exploit weaknesses of
current detection methods
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 4
Weak Strong
level of obfuscation
- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
 n-gram fingerprinting
 vector space models
 text alignment
 exhaustive string matching
Technical disguise
 encoding checks
 checks for textual content
 checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
 synonym expansion (WordNet)
 Semantic Role Labeling
 Latent Semantic Analysis
 POS-aware text matching
Cross-language plagiarism
 CL Character N-Gram Comp.
 CL Explicit Semantic Analysis
 CL Alignment-based Similarity Analysis
Weak Strong
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
level of obfuscation
5
Research Approach
• Analyze similarity features that:
• contain a high degree of semantic information
• exhibit low variability in their representations
• are not easily substitutable
• Combine analysis non-textual and textual content features
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
Analyzing Academic Citations
B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection:
Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on
Document Engineering (DocEng ’11), 2011.
B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger, “Demonstration of Citation Pattern
Analysis for Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2013.
B. Gipp, N. Meuschke, and C. Breitinger, “Citation-based Plagiarism Detection: Practicability on a Large-
scale Scientific Corpus,” JASIST, vol. 65, iss. 2, pp. 1527-1540, 2014.
B. Gipp, Citation-based Plagiarism Detection – Detecting Disguised and Cross-language Plagiarism using
Citation Pattern Analysis, Springer Vieweg Research, 2014.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 16
Analyzing Citation Patterns
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
This is ain-text citation [1].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.
Section 2
Another in-text citation [2].tThis is anexample text with references todifferent
documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. This is arepeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
Setion 3
A third in-text citation [3].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection. a final i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1].This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an exampl etext withreferences to different documents for illustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3].This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
Conclusion CbPD Evaluation
• Successes
• Significantly higher detection performance for disguised plagiarism
in biomedicine
• Decrease in user effort (time savings for examination)
• First approach to allow n:n comparisons for large collections
• Limitations
• Effectiveness varies depending on discipline
• Low for mathematics and physics
• Effectiveness is low for shorter cases of disguised plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
Analyzing Mathematical Expressions
N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism”, in Proc. ACM CIKM, 2017.
N. Meuschke, M. Schubotz, M. Kramer, V. Stange, and B. Gipp, “Improving External Plagiarism Detection
for Academic Documents by Analyzing Mathematical Expressions”, in review at ACM CIKM, 2018.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
Characteristics of Math-Heavy Texts
• Texts in math-heavy disciplines interveawe natural
and symbolic language
… one is not understandable without the other
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
Characteristics of Math-Heavy Texts
• Mathematical expressions share many characteristics
of academic citations
• much semantic information
• language-independent
• hard to leave out or substitute (yet easy to obfuscate)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
Initial MathPD Study
• Preliminary:
• Manual analysis of 39 confirmed cases of plagiarism
• Automated retrieval experiments:
• 10 real-world cases of mathematical plagiarism; source documents
embedded in NTCIR-11 Math Retrieval corpus
(105K arXiv documents, 60M mathematical expressions)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
Math-based Feature Comparison
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx 
 )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( 
Results Initial MathPD Study
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24
Case D dci dcn dco D dci dcn dco
C1 3,606 1 27,857 30,784 1 1 85,418 99,201
C2 1 1 88,891 90,962 1 1 12,266 10,277
C3 11,628 2 28,415 3,144 1 16 34,966 5,757
C4 2,581 1 1,950 86 189 6 54,560 18,374
C5 1 1 5,790 22,408 1 6 92,951 16,180
C6 25,498 12 19,862 38,145 7,976 3 24,405 72,687
C7 1 1 4,690 1,627 19,900 1 67,614 14,758
C8 1 1 39,215 11,576 1 1 21,152 9,475
C9 1 1 13,591 35,393 1 1 11,519 32,687
C10 1 1 76,678 30,673 1 1,223 89,703 3,280
0.60 0.86 < 0.01 < 0.01 0.70 0.57 < 0.01 < 0.01
full document
source retrieved at rank
partitions
source retrieved at rank
MRR
Extension to Math-based Retrieval Process
• Candidate Retrieval using Elastic Search
• Detailed Analysis using Pattern Analysis for Identifiers
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 25
input document(s)
Human
Inspection
Detailed
Analysis
Candidate
Retrieval
candidate
documents
similar documents
user
document
full text
MathML
formulae
citations &
references
text
fingerprints
Pre-
processing
math. identifiers
(list & histogram)
Indexing
unified document
Start analysis
Doc. ID=78
Data Storage
ID
78
Results
• Math-based approach
• candidate retrieval step reduced recall (R=0.7)
• detailed pattern analysis increased MRR=0.93
• Combined analysis (math, citation, text) could identify suspicious
similarity for all cases (R =1, MRR=1)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 26
Analyzing Images
N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based
Plagiarism Detection Approach,” in Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018.
N. Meuschke, V. Stange, M. Schubotz, and B. Gipp, “HyPlag: A Hybrid Approach to Academic Plagiarism
Detection,” in Proc. ACM SIGIR Conf., 2018.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 27
Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarism detection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution, noise)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
Research Gap
• Current image-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenous images in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
Process
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.
Perceptual Hashing
• Efficient CBIR method to reliably find near image copies
• Uses most apparent visual features in images
• Creates non-unique fingerprints that can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use Discrete Cosine Transformation and Hamming Distance
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 11
Position-aware Text Matching
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 12
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh
1800pxh
25pxr 
Ratio Hashing
• First approach to target reuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00
Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scores of highly similar set of images
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id    1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor (3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 15
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id    1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Evaluation
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
• Source for test images: VroniPlag collection
• crowd-sourced effort investigating plagiarism allegations
• 196 manually examined academic works (mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representative cases (mostly from life sciences)
• Cases imbedded in 4,500 images obtained from PubMed Central
16
Example: Near Copies
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
17
Source Image Reused Image
Example: Weak Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
Source Image Reused Image
Example: Moderate Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
Source Image Reused Image
Example: Strong Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
Source Image Reused Image
Results
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true source image at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text content is present.
Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matching more robust to low OCR quality
• k-gram matching identified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
Source Image Reused Image
Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictive outlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessing in parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Hybrid Detection System
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 46
Questions?
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Norman Meuschke
n@meuschke.org | @normeu
• Slides for this talk (and other talks):
www.slideshare.net/GroupGipp
• Contact, publications, other projects:
www.isg.uni.kn
• Code: www.github.com/ag-gipp
25

Analyzing Nontextual Content Features to Detect Academic Plagiarism

  • 1.
    Analyzing Nontextual ContentFeatures to Detect Academic Plagiarism Norman Meuschke Information Science Group University of Konstanz www.isg.uni.kn Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 1
  • 2.
    Outline • Introduction • ShortBio • Overview of Research Group • Overview of Academic Plagiarism Detection • Research Approach: Analyzing Nontextual Content Features • (Analyzing Academic Citations) • Analyzing Mathematics • Analyzing Images Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 2
  • 3.
    Short Bio Analyzing NontextualContent Features to Detect Academic Plagiarism - Norman Meuschke 3 UC Berkeley, California MA Thesis, Dept. of Statistics SciPlore Startup (2011 – 2014) National Institute of Informatics Tokyo PhD (2014 – Jan 2015) Univ. of Konstanz PhD (since Feb. 2015) Univ. of Magdeburg BA / MA Information Systems 2011 Feb 2015 - Today
  • 4.
    University of Konstanz AnAdaptive Image-based Plagiarism Detection Approach - Meuschke et al. 4 Map data ©2018 GeoBasis-DE/BKG (©2009), Google Map data ©2018 GeoBasis-DE/BKG (©2009), Google
  • 5.
    Research Group –www.isg.uni.kn Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 5 Prof. Bela Gipp Dr. Moritz Schubotz Corinna Breitinger Philip Ehret André Greiner-Petter Felix Hamborg Thomas Hepp Philipp Scharpf Alexander Schönhals Malte Schwarzer Vincent Stange Patrick Wortner visiting researcher at NII
  • 6.
    Research Areas • InformationScience • applied, use-case driven and user-focused computer science • We focus on three areas • Semantic Document Analysis & Document Retrieval • Mathematical Information Retrieval • Blockchain Applications Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
  • 7.
    Semantic Document Analysis& Document Retrieval • Plagiarism Detection • Literature Recommendation • Research Papers • Wikipedia • Legal Documents • News Analysis • Media Bias Detection Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7 Corinna Breitinger Felix Hamborg Philipp ScharpfMalte Schwarzer Vincent StangeNorman Meuschke Gent Ymeri Max KutznerAnastasia Zhukova
  • 8.
    Mathematical Information Retrieval •Conversion and Enrichment • Search • Recommendation Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8 Philipp Scharpf Dr. Moritz Schubotz André Greiner-Petter Felix Petersen
  • 9.
    Blockchain Applications • TrustedTimestamping • Blockchain for Science Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9 Vincent StangeThomas Hepp Christopher Gondek Daniel Muffler Jannik Bamberger Philip EhretAlexander Schönhals Patrick Wortner
  • 10.
    Analyzing Nontextual ContentFeatures to Detect Academic Plagiarism Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
  • 11.
    Academic Plagiarism “The useof ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity. 3
  • 12.
    Plagiarism Forms Note: plagiarismforms are not mutually exclusive Paraphrasing  intentional rewriting  no / insufficient reference the source Structural and idea plagiarism  little or no verbatim text overlap Cross-language plagiarism  manual/automated conversion of text into other language to hide its origin Copy & paste  taking content verbatim from other source Shake & paste  copy & paste of text segments with slight adjustments, e.g., synonym substitutions Technical disguise  techniques that exploit weaknesses of current detection methods Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 4 Weak Strong level of obfuscation
  • 13.
    - intense research -methods limited by text-based candidate retrieval (R approx. 0.8 for moderate disguise) - solvable, no research needed solvedCopy & paste Shake & paste  n-gram fingerprinting  vector space models  text alignment  exhaustive string matching Technical disguise  encoding checks  checks for textual content  checks for large images Detection Capabilities Paraphrasing Structural and idea plagiarism  synonym expansion (WordNet)  Semantic Role Labeling  Latent Semantic Analysis  POS-aware text matching Cross-language plagiarism  CL Character N-Gram Comp.  CL Explicit Semantic Analysis  CL Alignment-based Similarity Analysis Weak Strong Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke level of obfuscation 5
  • 14.
    Research Approach • Analyzesimilarity features that: • contain a high degree of semantic information • exhibit low variability in their representations • are not easily substitutable • Combine analysis non-textual and textual content features Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
  • 15.
  • 16.
    Analyzing Academic Citations B.Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engineering (DocEng ’11), 2011. B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger, “Demonstration of Citation Pattern Analysis for Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2013. B. Gipp, N. Meuschke, and C. Breitinger, “Citation-based Plagiarism Detection: Practicability on a Large- scale Scientific Corpus,” JASIST, vol. 65, iss. 2, pp. 1527-1540, 2014. B. Gipp, Citation-based Plagiarism Detection – Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis, Springer Vieweg Research, 2014. Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 16
  • 17.
    Analyzing Citation Patterns AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17 Doc C Doc E Doc D Section 1 This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection . This is ain-text citation [1].This is an exampl etext withreferences to different documents for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection. Section 2 Another in-text citation [2].tThis is anexample text with references todifferent documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection. This is arepeated in-text citation [1]. This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection . Setion 3 A third in-text citation [3].This is an exampl etext withreferences to different documents for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection. a final i n-text-citation[2]. References [1] [2] [3] Document B This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is ain-text citation [1].This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection. Another exampl efor ani n-text citation [2]. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustrating the usage ofcitation analysi s for pl agiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Here s a third in-text citation [3].This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. Document A References [1] [2] [3] EDC DECDC Citation Pattern Citation Pattern Doc A Doc B Ins.EIns.DC DECDC Pattern Comparison Doc A Doc B
  • 18.
    Conclusion CbPD Evaluation •Successes • Significantly higher detection performance for disguised plagiarism in biomedicine • Decrease in user effort (time savings for examination) • First approach to allow n:n comparisons for large collections • Limitations • Effectiveness varies depending on discipline • Low for mathematics and physics • Effectiveness is low for shorter cases of disguised plagiarism Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
  • 19.
    Analyzing Mathematical Expressions N.Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism”, in Proc. ACM CIKM, 2017. N. Meuschke, M. Schubotz, M. Kramer, V. Stange, and B. Gipp, “Improving External Plagiarism Detection for Academic Documents by Analyzing Mathematical Expressions”, in review at ACM CIKM, 2018. Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
  • 20.
    Characteristics of Math-HeavyTexts • Texts in math-heavy disciplines interveawe natural and symbolic language … one is not understandable without the other Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
  • 21.
    Characteristics of Math-HeavyTexts • Mathematical expressions share many characteristics of academic citations • much semantic information • language-independent • hard to leave out or substitute (yet easy to obfuscate) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
  • 22.
    Initial MathPD Study •Preliminary: • Manual analysis of 39 confirmed cases of plagiarism • Automated retrieval experiments: • 10 real-world cases of mathematical plagiarism; source documents embedded in NTCIR-11 Math Retrieval corpus (105K arXiv documents, 60M mathematical expressions) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
  • 23.
    Math-based Feature Comparison AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx   )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( 
  • 24.
    Results Initial MathPDStudy Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24 Case D dci dcn dco D dci dcn dco C1 3,606 1 27,857 30,784 1 1 85,418 99,201 C2 1 1 88,891 90,962 1 1 12,266 10,277 C3 11,628 2 28,415 3,144 1 16 34,966 5,757 C4 2,581 1 1,950 86 189 6 54,560 18,374 C5 1 1 5,790 22,408 1 6 92,951 16,180 C6 25,498 12 19,862 38,145 7,976 3 24,405 72,687 C7 1 1 4,690 1,627 19,900 1 67,614 14,758 C8 1 1 39,215 11,576 1 1 21,152 9,475 C9 1 1 13,591 35,393 1 1 11,519 32,687 C10 1 1 76,678 30,673 1 1,223 89,703 3,280 0.60 0.86 < 0.01 < 0.01 0.70 0.57 < 0.01 < 0.01 full document source retrieved at rank partitions source retrieved at rank MRR
  • 25.
    Extension to Math-basedRetrieval Process • Candidate Retrieval using Elastic Search • Detailed Analysis using Pattern Analysis for Identifiers Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 25 input document(s) Human Inspection Detailed Analysis Candidate Retrieval candidate documents similar documents user document full text MathML formulae citations & references text fingerprints Pre- processing math. identifiers (list & histogram) Indexing unified document Start analysis Doc. ID=78 Data Storage ID 78
  • 26.
    Results • Math-based approach •candidate retrieval step reduced recall (R=0.7) • detailed pattern analysis increased MRR=0.93 • Combined analysis (math, citation, text) could identify suspicious similarity for all cases (R =1, MRR=1) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 26
  • 27.
    Analyzing Images N. Meuschke,C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018. N. Meuschke, V. Stange, M. Schubotz, and B. Gipp, “HyPlag: A Hybrid Approach to Academic Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2018. Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 27
  • 28.
    Idea of Image-basedPlagiarism Detection • Images in academic documents convey much semantic information in compressed format independent of the text • Much research on Content-based Image Retrieval (CBIR) • Little adaption of CBIR methods to plagiarism detection(PD) • exact and cropped images copies • affinely transformed images (scaling, rotation, projection) • slight alterations of appearance (blurring, lower resolution, noise) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
  • 29.
    Research Gap • Currentimage-based PD approaches problematic for: • compound images • rearranged images • images mostly containing text (typically tables inserted as figures) • visually differing, semantically equivalent data visualizations • Goal: image-based PD process that: • combines established and new analysis methods to cover heterogenous images in academic documents • adaptively applies suitable analysis steps • flexibly quantifies suspiciousness • is extensible in the future Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
  • 30.
    Process Analyzing Nontextual ContentFeatures to Detect Academic Plagiarism - Norman Meuschke 9 decompose image classify image extract image perceptual hashing OCR ratio hashing positional text matching k-gram text matching reference DB distance calculation DpHash, DrHash, DkTM, DposTM outlier detection: s(Dm)>r potential source images input doc.
  • 31.
    Perceptual Hashing • EfficientCBIR method to reliably find near image copies • Uses most apparent visual features in images • Creates non-unique fingerprints that can be compared • Fingerprints are invariant to: • scaling • aspect ratio changes • changes to brightness, contrast and colors • We use Discrete Cosine Transformation and Hamming Distance Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10 Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
  • 32.
    k-gram Text Matching •To identify tables inserted as figures and images with little visual similarity • Text extracted using open source OCR engine Tesseract • Granularity: • character 3-grams • no chunk selection • Similarity measure 𝑑 = 𝐾1⊖𝐾2 𝐾1∩𝐾2 Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 11
  • 33.
    Position-aware Text Matching AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 12 • To account for typically small amount of text in images aggravated by OCR errors • Process: • Scale images to same height (here: 800px) • Define proximity region around identified text (here 50px circle) • Project proximity regions of input image to potential source • Only consider matching characters in projected proximity regions 𝑠 = 𝐾1 ∩ 𝐾2 max( 𝐾1 , |𝐾2|) A C B B positional character match input image D A X reference image A positional character mismatchB Legend: D 1w 2w 2800pxh 1800pxh 25pxr 
  • 34.
    Ratio Hashing • Firstapproach to target reuse of data (and its visualization) • identifies equivalent, yet visually differing bar charts Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 13 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 𝑑 = 1.00-1.00+ 0.80-0.80+ 0.61-0.61+ 0.44-0.44+ 0.30-0.30+ 0.07-0.07 = 0.00
  • 35.
    Outlier Detection • Toquantify suspiciousness of method-specific distance scores • Two assumptions: • image only suspicious if comparably high similarity (small distance) to small set (c=9) of other images • clear separation of distance scores of highly similar set of images Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id    1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 36.
    Outlier Detection Continued •Find outlier group: • split list of relative distance deltas if a distance is at least twice as large as its predecessor (3x as large for k-gram matching) • Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance margin to collection that is twice as large as outlier’s distance to the input image Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 15 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id    1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 37.
    Evaluation Analyzing Nontextual ContentFeatures to Detect Academic Plagiarism - Norman Meuschke • Source for test images: VroniPlag collection • crowd-sourced effort investigating plagiarism allegations • 196 manually examined academic works (mostly PhD theses) • most allegations confirmed by responsible universities • Targeted crawl for all annotated ‘fragments’ containing images • confirmed by at least two examiners • Selection of 15 representative cases (mostly from life sciences) • Cases imbedded in 4,500 images obtained from PubMed Central 16
  • 38.
    Example: Near Copies AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17 Source Image Reused Image
  • 39.
    Example: Weak Alteration AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18 Source Image Reused Image
  • 40.
    Example: Moderate Alteration AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19 Source Image Reused Image
  • 41.
    Example: Strong Alteration AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20 Source Image Reused Image
  • 42.
    Results Analyzing Nontextual ContentFeatures to Detect Academic Plagiarism - Norman Meuschke 21 • Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at least one of the methods (𝑅 = 0.73) • Outlier detection effective (𝑃 = 1): • For all input images with 𝑠 ≥ 0.5, true source image at the top rank • For all input images with 𝑠 < 0.5, no source image retrieved among the top-ten most similar images, i.e. no false positives • Perceptual hashing with sub-image extraction worked best for near copies and weakly altered images (found 6 of 9 cases) • Text analysis performed better than perceptual hashing for moderately and strongly altered images • if quality of the image was high enough to perform OCR reliably and sufficient text content is present.
  • 43.
    Results Continued • Textanalysis approaches identified 3 of 4 cases involving tables • position-aware text matching more robust to low OCR quality • k-gram matching identified more cases • combination of approaches allows processing more images • Dataset contained only one bar chart, for which ratio hashing yielded extremely suspicious score (𝑠 = 0.92) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22 Source Image Reused Image
  • 44.
    Discussion & Conclusion •Image-based PD promising complement to other methods • Small test collection, but restrictive outlier detection procedure will prevent false positives also in larger collections • if reduced precision is acceptable, threshold can be changed interactively by user • Approach well suited for scaling • Preprocessing in parallel • Options described to scale analysis methods • Approach easily extensible with new methods • New input scores to outlier detection • Code: www.purl.org/imagepd Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
  • 45.
    Future Work • Moredetection methods tailored to specific data visualizations • Scale the process • parallelization of preprocessing • candidate selection for feature descriptors • Realize hybrid process Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24 heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity
  • 46.
    Hybrid Detection System AnalyzingNontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 46
  • 47.
    Questions? Analyzing Nontextual ContentFeatures to Detect Academic Plagiarism - Norman Meuschke Norman Meuschke n@meuschke.org | @normeu • Slides for this talk (and other talks): www.slideshare.net/GroupGipp • Contact, publications, other projects: www.isg.uni.kn • Code: www.github.com/ag-gipp 25

Editor's Notes

  • #6 aside from me 10 other PhD students and one postodc
  • #12 The research we report on in the paper is on improving the detection of academic plagiarism, which we define as … not just copied text, but any substantial intelectual contribution
  • #13 Academic plagiarism occurs in a number of forms, which can be broadly categorized by their degree of obfuscation, but are are not mutually exclusive
  • #14 Extensive research has been conducted on plagiarism detection methods, particularly for finding text plagiarism.
  • #15 specifically
  • #16 Given the inherent limitation of purely text-based analysis methods, our research focusses on complementing successful text-based analysis methods with methods that analyze nontextual content features.
  • #18 A while ago, Bela Gipp had the idea that academic citations fulfill all these criteria, so
  • #24 Features: identifiers (ci), numbers (cn), operators (co), feature combination Descriptors: frequency histograms of feature occurrence Granularity: i) entire document, ii) document partitions Similarity measure: relative distance of feature occurrence frequencies for individual features d ci, cn, co and combination of all features D
  • #25 Analyzing identifiers worked best 8 of 10 test cases at top rank MRR = 0.86 Analyzing operators and numbers generally noisy However, for partitions including operators and numbers in aggregated distance performs better (7/10) than identifiers (5/10)
  • #31 With these requirements in mind, we devised the following image-based detection process. accepts input documents in PDF format extracts images from the PDF decomposes compound images, which is a weakness of prior works on IBPD reduces computational load by classifying images using CNN The core of our process are currently 4 analysis methods, applied independently - methods compute method-specific feature descriptors and compare them to the stored descriptors for all images in the collection - comparisons yield separate lists of distance scores for each method - list are input to outlier detection process - returns potential sources for an image
  • #32 also weakly altered = near copies for image sections Hamming dist. = number of bits that differ
  • #33 typically word 3-5 grams, i.e. 15-30 characters here: finer granularity to account for smaller amount of text potential recognition errors OCR turned out to be a real problem in our experiments
  • #34 Idea: only consider text matches that occur in roughly the same regions of the pictures other shapes and dynamic sizing of the shape, e.g., dependent on the length of the text fragment, are also possible normalization of sim score reflects the assumption that two images are less likely to be similar if their amount of textual content differs strongly.
  • #35 Process: determine bar heights (details see paper) sort bars by height in decreasing order compute relative bar heights, i.e., ℎ 𝑖 ℎ max ratio hash = bar-wise difference Only consider charts with same number of bars (min 3) to reduce computational effort, can be changed
  • #36 First assumption basically a heuristic filter for false positives, e.g. common images like logos missed to exclude in preprocessing (small distance for many images) multiple versions of same document in collection Second assumption assures outlier group
  • #37 Store absolute distance scores in ascending order of distance Compute list of relative distance deltas Scan through list of relative dist. deltas to find distance at least twice as large as its predecessor Check whether sublist does not contain more than 9 elements
  • #39 share large majority of their visual content exhibit minor differences introduced by i) removing non-essential content (e.g., numeric labels or watermarks) ii) cropping or padding iii) performing affine transformations (e.g., scaling or rotation) iv) changing the resolution, contrast or color space. Especially iii) and iv) can be introduced inadvertently by extracting and reusing images from a PDF or printed document.
  • #40 typically reuse parts of an original image as near copies
  • #41 typically reuse most or all the visual components of the original image, yet rearrange the components Gap between elements Three instead of two elements in last row
  • #42 typically redrawn versions of the source with changes made to the arrangement and/or visual appearance of image components shape of the curves in main and enlarged view placement of the enlarged view brackets used