SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
An Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection Approach
1.
N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp
An Adaptive Image-based
Plagiarism Detection Approach
Norman Meuschke
Information ScienceGroup
University of Konstanz
www.isg.uni.kn
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
3.
Outline
• Overview of Research on Academic PlagiarismDetection
• Image-based PlagiarismDetection Approach
• Evaluation Results
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
4.
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originalityis expected.”
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3
5.
Plagiarism Forms
Note: plagiarismformsare not mutually exclusive
Paraphrasing
▪ intentional rewriting
▪ no / insufficient reference the source
Structural and idea plagiarism
▪ little or no verbatim text overlap
Cross-language plagiarism
▪ manual/automated conversion of text into
other language to hide its origin
Copy & paste
▪ taking content verbatim from other source
Shake & paste
▪ copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
▪ techniques that exploit weaknesses of
current detection methods
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4
Weak Strong
level of obfuscation
6.
- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
▪ n-gramfingerprinting
▪ vector space models
▪ text alignment
▪ exhaustive string matching
Technical disguise
▪ encoding checks
▪ checks for textual content
▪ checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
▪ synonym expansion(WordNet)
▪ Semantic Role Labeling
▪ Latent Semantic Analysis
▪ POS-aware text matching
Cross-language plagiarism
▪ CL Character N-Gram Comp.
▪ CL Explicit Semantic Analysis
▪ CL Alignment-based Similarity Analysis
Weak Strong
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
level of obfuscation
5
7.
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 6
• Combine analysis of textual and non-textual content features
8.
Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarismdetection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution,noise)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
9.
Research Gap
• Currentimage-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenousimages in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
10.
Process
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 9
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.
11.
Perceptual Hashing
• Efficient CBIR method to reliably find near imagecopies
• Uses most apparent visual features in images
• Creates non-uniquefingerprintsthat can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use DiscreteCosine Transformation and Hamming Distance
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
12.
k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
13.
Position-aware Text Matching
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh=
1800pxh=
25pxr =
14.
Ratio Hashing
• First approach to targetreuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00
15.
Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scoresof highly similar set of images
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
16.
Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor(3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
17.
Evaluation
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
• Source for test images: VroniPlag collection
• crowd-sourced effortinvestigating plagiarism allegations
• 196 manually examined academic works(mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representativecases (mostly from life sciences)
• Cases imbedded in 4,500images obtained from PubMed Central
16
18.
Example: Near Copies
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al.
17
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
22.
Results
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 21
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text contentis present.
23.
Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matchingmore robust to low OCR quality
• k-gram matchingidentified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Cz/047
24.
Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictiveoutlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessingin parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
25.
Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
26.
Questions?
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Norman Meuschke
n@meuschke.org
• Code:
www.purl.org/imagepd
• Contact, publications, other projects:
www.isg.uni.kn
25
27.
Image Extraction & Decomposition
• Extraction:
• poppler framework
• convertto JPEG
• discard images smaller than 7.5 KB (typically logos)
• Decomposition:
• assume white pixels separate sub-images
• assume rectangular sub-images aligned horizontally or vertically
• tradeoff (images remain analyzable if decomposition fails)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
28.
Decomposition
• Process:
• conversionto grayscale to reduceruntime
• padding with white pixels to removea potential border
• binarization using adaptive thresholdingto obtain a b/w image
• dilation to ensureblack pixels are connected
• floodfill of white areas with black pixels
• subtract original image
• invert image
• blob detection using the algorithm of Suzuki and Abe [1]
• estimate boundingbox by looking for large contoursaligned along
the image axes
• crop and store the identified sub-images
[1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by
Border Following. CVGIP 30, 1 (1985).
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
29.
Image Classification
• Depp CNN realized using Caffe and AlexNet architecture [2]
• CNN classifies images into:
• photographs(pHash only)
• bar charts (ratio hashing only)
• other image types (pHash and OCR text matching)
• Manual checks of 100 classified images
• Accuracy 0.92 for photographs and 1.00 for bar charts
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
30.
Perceptual Hashing
• Process:
• Reduce size to 32x32 pixels
• Convertto grayscale
• Compute 32x32 DiscreteCosine Transform(DCT)
• Reduce DCT to 8x8 for lowest frequencies
• Compute average DCT value
• Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT
value (1 - above mean, 0 – below mean)
• Similarity measure: Hammingdistance
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
31.
Extraction of Bar Heights
• Process:
• convertto grayscale
• binarize using global threshold to obtain b/w image (sharp contours)
• pad image with white pixels to ensurebars can be filled
• clean artifacts of black pixels using a threshold on the relative area
covered by the pixels
• removeimage border
• floodfill with black pixels and invert
• find candidates for bars by determining the lengths of all vertical lines
of black pixels
• determine bars by clustering vertical lines
• removenoise from whiskers, labels, and legend entries
• assume the average height of the lines in a cluster as the bar height
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30