An Adaptive Image-based Plagiarism Detection Approach

N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp
An Adaptive Image-based
Plagiarism Detection Approach
Norman Meuschke
Information ScienceGroup
University of Konstanz
www.isg.uni.kn
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1

University of Konstanz
Map data ©2018 GeoBasis-DE/BKG(©2009), Google
Map data ©2018 GeoBasis-DE/BKG(©2009), Google

Outline
• Overview of Research on Academic PlagiarismDetection
• Image-based PlagiarismDetection Approach
• Evaluation Results

Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originalityis expected.”
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3

Plagiarism Forms
Note: plagiarismformsare not mutually exclusive
Paraphrasing
▪ intentional rewriting
▪ no / insufficient reference the source
Structural and idea plagiarism
▪ little or no verbatim text overlap
Cross-language plagiarism
▪ manual/automated conversion of text into
other language to hide its origin
Copy & paste
▪ taking content verbatim from other source
Shake & paste
▪ copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
▪ techniques that exploit weaknesses of
current detection methods
Weak Strong
level of obfuscation

- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
▪ n-gramfingerprinting
▪ vector space models
▪ text alignment
▪ exhaustive string matching
Technical disguise
▪ encoding checks
▪ checks for textual content
▪ checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
▪ synonym expansion(WordNet)
▪ Semantic Role Labeling
▪ Latent Semantic Analysis
▪ POS-aware text matching
Cross-language plagiarism
▪ CL Character N-Gram Comp.
▪ CL Explicit Semantic Analysis
▪ CL Alignment-based Similarity Analysis
Weak Strong
level of obfuscation
5

heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research
• Combine analysis of textual and non-textual content features

Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarismdetection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution,noise)

Research Gap
• Currentimage-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenousimages in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future

Process
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.

Perceptual Hashing
• Efficient CBIR method to reliably find near imagecopies
• Uses most apparent visual features in images
• Creates non-uniquefingerprintsthat can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use DiscreteCosine Transformation and Hamming Distance
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e

k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2

Position-aware Text Matching
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh=
1800pxh=
25pxr =

Ratio Hashing
• First approach to targetreuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00

Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scoresof highly similar set of images
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c

Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor(3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c

Evaluation
• Source for test images: VroniPlag collection
• crowd-sourced effortinvestigating plagiarism allegations
• 196 manually examined academic works(mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representativecases (mostly from life sciences)
• Cases imbedded in 4,500images obtained from PubMed Central
16

Example: Near Copies
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al.
17
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Dsa/014

Example: Weak Alteration
Source: http://de.vroniplag.wikia.com/wiki/Ry/073

Example: Moderate Alteration
Source: http://de.vroniplag.wikia.com/wiki/Ab/017

Example: Strong Alteration
Source: http://de.vroniplag.wikia.com/wiki/Ad/068

Results
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text contentis present.

Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matchingmore robust to low OCR quality
• k-gram matchingidentified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
Source: http://de.vroniplag.wikia.com/wiki/Cz/047

Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictiveoutlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessingin parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd

Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity

Questions?
Norman Meuschke
n@meuschke.org
• Code:
www.purl.org/imagepd
• Contact, publications, other projects:
www.isg.uni.kn
25

Image Extraction & Decomposition
• Extraction:
• poppler framework
• convertto JPEG
• discard images smaller than 7.5 KB (typically logos)
• Decomposition:
• assume white pixels separate sub-images
• assume rectangular sub-images aligned horizontally or vertically
• tradeoff (images remain analyzable if decomposition fails)

Decomposition
• Process:
• conversionto grayscale to reduceruntime
• padding with white pixels to removea potential border
• binarization using adaptive thresholdingto obtain a b/w image
• dilation to ensureblack pixels are connected
• floodfill of white areas with black pixels
• subtract original image
• invert image
• blob detection using the algorithm of Suzuki and Abe [1]
• estimate boundingbox by looking for large contoursaligned along
the image axes
• crop and store the identified sub-images
[1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by
Border Following. CVGIP 30, 1 (1985).

Image Classification
• Depp CNN realized using Caffe and AlexNet architecture [2]
• CNN classifies images into:
• photographs(pHash only)
• bar charts (ratio hashing only)
• other image types (pHash and OCR text matching)
• Manual checks of 100 classified images
• Accuracy 0.92 for photographs and 1.00 for bar charts

Perceptual Hashing
• Process:
• Reduce size to 32x32 pixels
• Convertto grayscale
• Compute 32x32 DiscreteCosine Transform(DCT)
• Reduce DCT to 8x8 for lowest frequencies
• Compute average DCT value
• Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT
value (1 - above mean, 0 – below mean)
• Similarity measure: Hammingdistance

Extraction of Bar Heights
• Process:
• convertto grayscale
• binarize using global threshold to obtain b/w image (sharp contours)
• pad image with white pixels to ensurebars can be filled
• clean artifacts of black pixels using a threshold on the relative area
covered by the pixels
• removeimage border
• floodfill with black pixels and invert
• find candidates for bars by determining the lengths of all vertical lines
of black pixels
• determine bars by clustering vertical lines
• removenoise from whiskers, labels, and legend entries
• assume the average height of the lines in a cluster as the bar height

An Adaptive Image-based Plagiarism Detection Approach

More Related Content

Similar to An Adaptive Image-based Plagiarism Detection Approach

More from Scientific Information Analytics Group, Prof. Gipp

Recently uploaded

An Adaptive Image-based Plagiarism Detection Approach