Slides for a talk given by Norman Meuschke at the National Institute of Informatics Tokyo in June 2018.
The talk gives an overview of the research of the Information Science group and presents recent work of Norman on analyzing nontextual content features to improve the detection capabilities for academic plagiarism. Identifying academic plagiarism is an important task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting instances of concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas, is an open research problem. Norman’s research focuses on analyzing nontextual features of academic documents, such as citations, images, and mathematical expressions, to complement the well-performing text-based detection methods. His goal is to devise a hybrid detection approach that analyzes textual and nontextual content features in academic documents to best possibly cover the wide range of plagiarism forms. More information on the Information Science research group and Norman can be found at: www.isg.uni.kn and www.meuschke.org
Analyzing Nontextual Content Features to Detect Academic Plagiarism
1. Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Norman Meuschke
Information Science Group
University of Konstanz
www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 1
2. Outline
• Introduction
• Short Bio
• Overview of Research Group
• Overview of Academic Plagiarism Detection
• Research Approach: Analyzing Nontextual Content Features
• (Analyzing Academic Citations)
• Analyzing Mathematics
• Analyzing Images
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 2
3. Short Bio
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 3
UC Berkeley, California
MA Thesis, Dept. of Statistics
SciPlore Startup
(2011 – 2014)
National Institute of
Informatics Tokyo
PhD
(2014 – Jan 2015)
Univ. of Konstanz
PhD
(since Feb. 2015)
Univ. of Magdeburg
BA / MA
Information Systems
2011
Feb 2015 - Today
5. Research Group – www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 5
Prof. Bela Gipp Dr. Moritz Schubotz Corinna Breitinger Philip Ehret
André Greiner-Petter Felix Hamborg Thomas Hepp Philipp Scharpf
Alexander Schönhals Malte Schwarzer Vincent Stange Patrick Wortner
visiting researcher
at NII
6. Research Areas
• Information Science
• applied, use-case driven and user-focused computer science
• We focus on three areas
• Semantic Document Analysis & Document Retrieval
• Mathematical Information Retrieval
• Blockchain Applications
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
7. Semantic Document Analysis & Document Retrieval
• Plagiarism Detection
• Literature Recommendation
• Research Papers
• Wikipedia
• Legal Documents
• News Analysis
• Media Bias Detection
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
Corinna Breitinger
Felix Hamborg
Philipp ScharpfMalte Schwarzer
Vincent StangeNorman Meuschke Gent Ymeri
Max KutznerAnastasia Zhukova
8. Mathematical Information Retrieval
• Conversion and Enrichment
• Search
• Recommendation
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
Philipp Scharpf
Dr. Moritz Schubotz André Greiner-Petter
Felix Petersen
9. Blockchain Applications
• Trusted Timestamping
• Blockchain for Science
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9
Vincent StangeThomas Hepp Christopher Gondek
Daniel Muffler Jannik Bamberger
Philip EhretAlexander Schönhals Patrick Wortner
10. Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
11. Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originality is expected.”
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3
12. Plagiarism Forms
Note: plagiarism forms are not mutually exclusive
Paraphrasing
intentional rewriting
no / insufficient reference the source
Structural and idea plagiarism
little or no verbatim text overlap
Cross-language plagiarism
manual/automated conversion of text into
other language to hide its origin
Copy & paste
taking content verbatim from other source
Shake & paste
copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
techniques that exploit weaknesses of
current detection methods
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 4
Weak Strong
level of obfuscation
13. - intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
n-gram fingerprinting
vector space models
text alignment
exhaustive string matching
Technical disguise
encoding checks
checks for textual content
checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
synonym expansion (WordNet)
Semantic Role Labeling
Latent Semantic Analysis
POS-aware text matching
Cross-language plagiarism
CL Character N-Gram Comp.
CL Explicit Semantic Analysis
CL Alignment-based Similarity Analysis
Weak Strong
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
level of obfuscation
5
14. Research Approach
• Analyze similarity features that:
• contain a high degree of semantic information
• exhibit low variability in their representations
• are not easily substitutable
• Combine analysis non-textual and textual content features
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
16. Analyzing Academic Citations
B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection:
Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on
Document Engineering (DocEng ’11), 2011.
B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger, “Demonstration of Citation Pattern
Analysis for Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2013.
B. Gipp, N. Meuschke, and C. Breitinger, “Citation-based Plagiarism Detection: Practicability on a Large-
scale Scientific Corpus,” JASIST, vol. 65, iss. 2, pp. 1527-1540, 2014.
B. Gipp, Citation-based Plagiarism Detection – Detecting Disguised and Cross-language Plagiarism using
Citation Pattern Analysis, Springer Vieweg Research, 2014.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 16
17. Analyzing Citation Patterns
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
This is ain-text citation [1].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.
Section 2
Another in-text citation [2].tThis is anexample text with references todifferent
documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. This is arepeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
Setion 3
A third in-text citation [3].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection. a final i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1].This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an exampl etext withreferences to different documents for illustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3].This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
18. Conclusion CbPD Evaluation
• Successes
• Significantly higher detection performance for disguised plagiarism
in biomedicine
• Decrease in user effort (time savings for examination)
• First approach to allow n:n comparisons for large collections
• Limitations
• Effectiveness varies depending on discipline
• Low for mathematics and physics
• Effectiveness is low for shorter cases of disguised plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
19. Analyzing Mathematical Expressions
N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism”, in Proc. ACM CIKM, 2017.
N. Meuschke, M. Schubotz, M. Kramer, V. Stange, and B. Gipp, “Improving External Plagiarism Detection
for Academic Documents by Analyzing Mathematical Expressions”, in review at ACM CIKM, 2018.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
20. Characteristics of Math-Heavy Texts
• Texts in math-heavy disciplines interveawe natural
and symbolic language
… one is not understandable without the other
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
21. Characteristics of Math-Heavy Texts
• Mathematical expressions share many characteristics
of academic citations
• much semantic information
• language-independent
• hard to leave out or substitute (yet easy to obfuscate)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
22. Initial MathPD Study
• Preliminary:
• Manual analysis of 39 confirmed cases of plagiarism
• Automated retrieval experiments:
• 10 real-world cases of mathematical plagiarism; source documents
embedded in NTCIR-11 Math Retrieval corpus
(105K arXiv documents, 60M mathematical expressions)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
23. Math-based Feature Comparison
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx
)²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2(
25. Extension to Math-based Retrieval Process
• Candidate Retrieval using Elastic Search
• Detailed Analysis using Pattern Analysis for Identifiers
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 25
input document(s)
Human
Inspection
Detailed
Analysis
Candidate
Retrieval
candidate
documents
similar documents
user
document
full text
MathML
formulae
citations &
references
text
fingerprints
Pre-
processing
math. identifiers
(list & histogram)
Indexing
unified document
Start analysis
Doc. ID=78
Data Storage
ID
78
26. Results
• Math-based approach
• candidate retrieval step reduced recall (R=0.7)
• detailed pattern analysis increased MRR=0.93
• Combined analysis (math, citation, text) could identify suspicious
similarity for all cases (R =1, MRR=1)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 26
27. Analyzing Images
N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based
Plagiarism Detection Approach,” in Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018.
N. Meuschke, V. Stange, M. Schubotz, and B. Gipp, “HyPlag: A Hybrid Approach to Academic Plagiarism
Detection,” in Proc. ACM SIGIR Conf., 2018.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 27
28. Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarism detection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution, noise)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
29. Research Gap
• Current image-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenous images in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
30. Process
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.
31. Perceptual Hashing
• Efficient CBIR method to reliably find near image copies
• Uses most apparent visual features in images
• Creates non-unique fingerprints that can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use Discrete Cosine Transformation and Hamming Distance
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
32. k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 11
33. Position-aware Text Matching
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 12
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh
1800pxh
25pxr
34. Ratio Hashing
• First approach to target reuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00
35. Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scores of highly similar set of images
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id 1
'
,1mD '
,2mD
' 1kd
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
36. Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor (3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 15
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id 1
'
,1mD '
,2mD
' 1kd
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
37. Evaluation
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
• Source for test images: VroniPlag collection
• crowd-sourced effort investigating plagiarism allegations
• 196 manually examined academic works (mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representative cases (mostly from life sciences)
• Cases imbedded in 4,500 images obtained from PubMed Central
16
38. Example: Near Copies
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
17
Source Image Reused Image
39. Example: Weak Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
Source Image Reused Image
42. Results
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true source image at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text content is present.
43. Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matching more robust to low OCR quality
• k-gram matching identified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
Source Image Reused Image
44. Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictive outlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessing in parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
45. Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
47. Questions?
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Norman Meuschke
n@meuschke.org | @normeu
• Slides for this talk (and other talks):
www.slideshare.net/GroupGipp
• Contact, publications, other projects:
www.isg.uni.kn
• Code: www.github.com/ag-gipp
25
Editor's Notes
aside from me 10 other PhD students and one postodc
The research we report on in the paper is on improving the detection of academic plagiarism, which we define as …
not just copied text, but any substantial intelectual contribution
Academic plagiarism occurs in a number of forms, which can be broadly categorized by their degree of obfuscation, but are are not mutually exclusive
Extensive research has been conducted on plagiarism detection methods, particularly for finding text plagiarism.
specifically
Given the inherent limitation of purely text-based analysis methods, our research focusses on
complementing successful text-based analysis methods with methods that analyze nontextual content features.
A while ago, Bela Gipp had the idea that academic citations fulfill all these criteria, so
Features: identifiers (ci), numbers (cn), operators (co), feature combination
Descriptors: frequency histograms of feature occurrence
Granularity: i) entire document, ii) document partitions
Similarity measure: relative distance of feature occurrence frequencies for individual features d ci, cn, co and combination of all features D
Analyzing identifiers worked best
8 of 10 test cases at top rank MRR = 0.86
Analyzing operators and numbers generally noisy
However, for partitions including operators and numbers in aggregated distance performs better (7/10) than identifiers (5/10)
With these requirements in mind, we devised the following image-based detection process.
accepts input documents in PDF format
extracts images from the PDF
decomposes compound images, which is a weakness of prior works on IBPD
reduces computational load by classifying images using CNN
The core of our process are currently 4 analysis methods, applied independently
- methods compute method-specific feature descriptors and compare them to the stored descriptors for all images in the collection
- comparisons yield separate lists of distance scores for each method
- list are input to outlier detection process
- returns potential sources for an image
also weakly altered = near copies for image sections
Hamming dist. = number of bits that differ
typically word 3-5 grams, i.e. 15-30 characters
here: finer granularity to account for
smaller amount of text
potential recognition errors
OCR turned out to be a real problem in our experiments
Idea: only consider text matches that occur in roughly the same regions of the pictures
other shapes and dynamic sizing of the shape, e.g., dependent on the length of the text fragment, are also possible
normalization of sim score reflects the assumption that two images are less likely to be similar if their amount of textual content differs strongly.
Process:
determine bar heights (details see paper)
sort bars by height in decreasing order
compute relative bar heights, i.e., ℎ 𝑖 ℎ max
ratio hash = bar-wise difference
Only consider charts with same number of bars (min 3) to reduce computational effort, can be changed
First assumption basically a heuristic filter for false positives, e.g.
common images like logos missed to exclude in preprocessing (small distance for many images)
multiple versions of same document in collection
Second assumption assures outlier group
Store absolute distance scores in ascending order of distance
Compute list of relative distance deltas
Scan through list of relative dist. deltas to find distance at least twice as large as its predecessor
Check whether sublist does not contain more than 9 elements
share large majority of their visual content
exhibit minor differences introduced by
i) removing non-essential content (e.g., numeric labels or watermarks)
ii) cropping or padding
iii) performing affine transformations (e.g., scaling or rotation)
iv) changing the resolution, contrast or color space.
Especially iii) and iv) can be introduced inadvertently by
extracting and reusing images from a PDF or printed document.
typically reuse parts of an original image as near copies
typically reuse most or all the visual components of the original image, yet rearrange the components
Gap between elements
Three instead of two elements in last row
typically redrawn versions of the source with changes made to the arrangement
and/or visual appearance of image components
shape of the curves in main and enlarged view
placement of the enlarged view
brackets used