N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp
An Adaptive Image-based
Plagiarism Detection Approach
Norman Meuschke
Information ScienceGroup
University of Konstanz
www.isg.uni.kn
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
University of Konstanz
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
Map data ©2018 GeoBasis-DE/BKG(©2009), Google
Map data ©2018 GeoBasis-DE/BKG(©2009), Google
Outline
• Overview of Research on Academic PlagiarismDetection
• Image-based PlagiarismDetection Approach
• Evaluation Results
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originalityis expected.”
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3
Plagiarism Forms
Note: plagiarismformsare not mutually exclusive
Paraphrasing
▪ intentional rewriting
▪ no / insufficient reference the source
Structural and idea plagiarism
▪ little or no verbatim text overlap
Cross-language plagiarism
▪ manual/automated conversion of text into
other language to hide its origin
Copy & paste
▪ taking content verbatim from other source
Shake & paste
▪ copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
▪ techniques that exploit weaknesses of
current detection methods
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4
Weak Strong
level of obfuscation
- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
▪ n-gramfingerprinting
▪ vector space models
▪ text alignment
▪ exhaustive string matching
Technical disguise
▪ encoding checks
▪ checks for textual content
▪ checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
▪ synonym expansion(WordNet)
▪ Semantic Role Labeling
▪ Latent Semantic Analysis
▪ POS-aware text matching
Cross-language plagiarism
▪ CL Character N-Gram Comp.
▪ CL Explicit Semantic Analysis
▪ CL Alignment-based Similarity Analysis
Weak Strong
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
level of obfuscation
5
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 6
• Combine analysis of textual and non-textual content features
Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarismdetection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution,noise)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
Research Gap
• Currentimage-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenousimages in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
Process
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 9
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.
Perceptual Hashing
• Efficient CBIR method to reliably find near imagecopies
• Uses most apparent visual features in images
• Creates non-uniquefingerprintsthat can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use DiscreteCosine Transformation and Hamming Distance
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
Position-aware Text Matching
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh=
1800pxh=
25pxr =
Ratio Hashing
• First approach to targetreuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00
Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scoresof highly similar set of images
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor(3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Evaluation
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
• Source for test images: VroniPlag collection
• crowd-sourced effortinvestigating plagiarism allegations
• 196 manually examined academic works(mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representativecases (mostly from life sciences)
• Cases imbedded in 4,500images obtained from PubMed Central
16
Example: Near Copies
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al.
17
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
Example: Weak Alteration
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 18
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Ry/073
Example: Moderate Alteration
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 19
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Ab/017
Example: Strong Alteration
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 20
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Ad/068
Results
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 21
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text contentis present.
Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matchingmore robust to low OCR quality
• k-gram matchingidentified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Cz/047
Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictiveoutlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessingin parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Questions?
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Norman Meuschke
n@meuschke.org
• Code:
www.purl.org/imagepd
• Contact, publications, other projects:
www.isg.uni.kn
25
Image Extraction & Decomposition
• Extraction:
• poppler framework
• convertto JPEG
• discard images smaller than 7.5 KB (typically logos)
• Decomposition:
• assume white pixels separate sub-images
• assume rectangular sub-images aligned horizontally or vertically
• tradeoff (images remain analyzable if decomposition fails)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
Decomposition
• Process:
• conversionto grayscale to reduceruntime
• padding with white pixels to removea potential border
• binarization using adaptive thresholdingto obtain a b/w image
• dilation to ensureblack pixels are connected
• floodfill of white areas with black pixels
• subtract original image
• invert image
• blob detection using the algorithm of Suzuki and Abe [1]
• estimate boundingbox by looking for large contoursaligned along
the image axes
• crop and store the identified sub-images
[1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by
Border Following. CVGIP 30, 1 (1985).
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
Image Classification
• Depp CNN realized using Caffe and AlexNet architecture [2]
• CNN classifies images into:
• photographs(pHash only)
• bar charts (ratio hashing only)
• other image types (pHash and OCR text matching)
• Manual checks of 100 classified images
• Accuracy 0.92 for photographs and 1.00 for bar charts
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
Perceptual Hashing
• Process:
• Reduce size to 32x32 pixels
• Convertto grayscale
• Compute 32x32 DiscreteCosine Transform(DCT)
• Reduce DCT to 8x8 for lowest frequencies
• Compute average DCT value
• Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT
value (1 - above mean, 0 – below mean)
• Similarity measure: Hammingdistance
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
Extraction of Bar Heights
• Process:
• convertto grayscale
• binarize using global threshold to obtain b/w image (sharp contours)
• pad image with white pixels to ensurebars can be filled
• clean artifacts of black pixels using a threshold on the relative area
covered by the pixels
• removeimage border
• floodfill with black pixels and invert
• find candidates for bars by determining the lengths of all vertical lines
of black pixels
• determine bars by clustering vertical lines
• removenoise from whiskers, labels, and legend entries
• assume the average height of the lines in a cluster as the bar height
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30

An Adaptive Image-based Plagiarism Detection Approach

  • 1.
    N. Meuschke, C.Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp An Adaptive Image-based Plagiarism Detection Approach Norman Meuschke Information ScienceGroup University of Konstanz www.isg.uni.kn An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
  • 2.
    University of Konstanz AnAdaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2 Map data ©2018 GeoBasis-DE/BKG(©2009), Google Map data ©2018 GeoBasis-DE/BKG(©2009), Google
  • 3.
    Outline • Overview ofResearch on Academic PlagiarismDetection • Image-based PlagiarismDetection Approach • Evaluation Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
  • 4.
    Academic Plagiarism “The useof ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originalityis expected.” An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity. 3
  • 5.
    Plagiarism Forms Note: plagiarismformsarenot mutually exclusive Paraphrasing ▪ intentional rewriting ▪ no / insufficient reference the source Structural and idea plagiarism ▪ little or no verbatim text overlap Cross-language plagiarism ▪ manual/automated conversion of text into other language to hide its origin Copy & paste ▪ taking content verbatim from other source Shake & paste ▪ copy & paste of text segments with slight adjustments, e.g., synonym substitutions Technical disguise ▪ techniques that exploit weaknesses of current detection methods An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4 Weak Strong level of obfuscation
  • 6.
    - intense research -methods limited by text-based candidate retrieval (R approx. 0.8 for moderate disguise) - solvable, no research needed solvedCopy & paste Shake & paste ▪ n-gramfingerprinting ▪ vector space models ▪ text alignment ▪ exhaustive string matching Technical disguise ▪ encoding checks ▪ checks for textual content ▪ checks for large images Detection Capabilities Paraphrasing Structural and idea plagiarism ▪ synonym expansion(WordNet) ▪ Semantic Role Labeling ▪ Latent Semantic Analysis ▪ POS-aware text matching Cross-language plagiarism ▪ CL Character N-Gram Comp. ▪ CL Explicit Semantic Analysis ▪ CL Alignment-based Similarity Analysis Weak Strong An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. level of obfuscation 5
  • 7.
  • 8.
    Idea of Image-basedPlagiarism Detection • Images in academic documents convey much semantic information in compressed format independent of the text • Much research on Content-based Image Retrieval (CBIR) • Little adaption of CBIR methods to plagiarismdetection(PD) • exact and cropped images copies • affinely transformed images (scaling, rotation, projection) • slight alterations of appearance (blurring, lower resolution,noise) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
  • 9.
    Research Gap • Currentimage-basedPD approaches problematic for: • compound images • rearranged images • images mostly containing text (typically tables inserted as figures) • visually differing, semantically equivalent data visualizations • Goal: image-based PD process that: • combines established and new analysis methods to cover heterogenousimages in academic documents • adaptively applies suitable analysis steps • flexibly quantifies suspiciousness • is extensible in the future An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
  • 10.
    Process An Adaptive Image-basedPlagiarismDetection Approach- Meuschkeet al. 9 decompose image classify image extract image perceptual hashing OCR ratio hashing positional text matching k-gram text matching reference DB distance calculation DpHash, DrHash, DkTM, DposTM outlier detection: s(Dm)>r potential source images input doc.
  • 11.
    Perceptual Hashing • EfficientCBIR method to reliably find near imagecopies • Uses most apparent visual features in images • Creates non-uniquefingerprintsthat can be compared • Fingerprints are invariant to: • scaling • aspect ratio changes • changes to brightness, contrast and colors • We use DiscreteCosine Transformation and Hamming Distance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10 Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
  • 12.
    k-gram Text Matching •To identify tables inserted as figures and images with little visual similarity • Text extracted using open source OCR engine Tesseract • Granularity: • character 3-grams • no chunk selection • Similarity measure 𝑑 = 𝐾1⊖𝐾2 𝐾1∩𝐾2 An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
  • 13.
    Position-aware Text Matching AnAdaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12 • To account for typically small amount of text in images aggravated by OCR errors • Process: • Scale images to same height (here: 800px) • Define proximity region around identified text (here 50px circle) • Project proximity regions of input image to potential source • Only consider matching characters in projected proximity regions 𝑠 = 𝐾1 ∩ 𝐾2 max( 𝐾1 , |𝐾2|) A C B B positional character match input image D A X reference image A positional character mismatchB Legend: D 1w 2w 2800pxh= 1800pxh= 25pxr =
  • 14.
    Ratio Hashing • Firstapproach to targetreuse of data (and its visualization) • identifies equivalent, yet visually differing bar charts An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 𝑑 = 1.00-1.00+ 0.80-0.80+ 0.61-0.61+ 0.44-0.44+ 0.30-0.30+ 0.07-0.07 = 0.00
  • 15.
    Outlier Detection • Toquantify suspiciousness of method-specific distance scores • Two assumptions: • image only suspicious if comparably high similarity (small distance) to small set (c=9) of other images • clear separation of distance scoresof highly similar set of images An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 16.
    Outlier Detection Continued •Find outlier group: • split list of relative distance deltas if a distance is at least twice as large as its predecessor(3x as large for k-gram matching) • Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance margin to collection that is twice as large as outlier’s distance to the input image An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 17.
    Evaluation An Adaptive Image-basedPlagiarismDetection Approach- Meuschkeet al. • Source for test images: VroniPlag collection • crowd-sourced effortinvestigating plagiarism allegations • 196 manually examined academic works(mostly PhD theses) • most allegations confirmed by responsible universities • Targeted crawl for all annotated ‘fragments’ containing images • confirmed by at least two examiners • Selection of 15 representativecases (mostly from life sciences) • Cases imbedded in 4,500images obtained from PubMed Central 16
  • 18.
    Example: Near Copies AnAdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 17 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
  • 19.
    Example: Weak Alteration AnAdaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 18 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ry/073
  • 20.
    Example: Moderate Alteration AnAdaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 19 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ab/017
  • 21.
    Example: Strong Alteration AnAdaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 20 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ad/068
  • 22.
    Results An Adaptive Image-basedPlagiarismDetection Approach- Meuschkeet al. 21 • Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at least one of the methods (𝑅 = 0.73) • Outlier detection effective (𝑃 = 1): • For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank • For all input images with 𝑠 < 0.5, no source image retrieved among the top-ten most similar images, i.e. no false positives • Perceptual hashing with sub-image extraction worked best for near copies and weakly altered images (found 6 of 9 cases) • Text analysis performed better than perceptual hashing for moderately and strongly altered images • if quality of the image was high enough to perform OCR reliably and sufficient text contentis present.
  • 23.
    Results Continued • Textanalysis approaches identified 3 of 4 cases involving tables • position-aware text matchingmore robust to low OCR quality • k-gram matchingidentified more cases • combination of approaches allows processing more images • Dataset contained only one bar chart, for which ratio hashing yielded extremely suspicious score (𝑠 = 0.92) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Cz/047
  • 24.
    Discussion & Conclusion •Image-based PD promising complement to other methods • Small test collection, but restrictiveoutlier detection procedure will prevent false positives also in larger collections • if reduced precision is acceptable, threshold can be changed interactively by user • Approach well suited for scaling • Preprocessingin parallel • Options described to scale analysis methods • Approach easily extensible with new methods • New input scores to outlier detection • Code: www.purl.org/imagepd An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
  • 25.
    Future Work • Moredetection methods tailored to specific data visualizations • Scale the process • parallelization of preprocessing • candidate selection for feature descriptors • Realize hybrid process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24 heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity
  • 26.
    Questions? An Adaptive Image-basedPlagiarismDetection Approach- Meuschkeet al. Norman Meuschke n@meuschke.org • Code: www.purl.org/imagepd • Contact, publications, other projects: www.isg.uni.kn 25
  • 27.
    Image Extraction &Decomposition • Extraction: • poppler framework • convertto JPEG • discard images smaller than 7.5 KB (typically logos) • Decomposition: • assume white pixels separate sub-images • assume rectangular sub-images aligned horizontally or vertically • tradeoff (images remain analyzable if decomposition fails) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
  • 28.
    Decomposition • Process: • conversiontograyscale to reduceruntime • padding with white pixels to removea potential border • binarization using adaptive thresholdingto obtain a b/w image • dilation to ensureblack pixels are connected • floodfill of white areas with black pixels • subtract original image • invert image • blob detection using the algorithm of Suzuki and Abe [1] • estimate boundingbox by looking for large contoursaligned along the image axes • crop and store the identified sub-images [1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by Border Following. CVGIP 30, 1 (1985). An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
  • 29.
    Image Classification • DeppCNN realized using Caffe and AlexNet architecture [2] • CNN classifies images into: • photographs(pHash only) • bar charts (ratio hashing only) • other image types (pHash and OCR text matching) • Manual checks of 100 classified images • Accuracy 0.92 for photographs and 1.00 for bar charts An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
  • 30.
    Perceptual Hashing • Process: •Reduce size to 32x32 pixels • Convertto grayscale • Compute 32x32 DiscreteCosine Transform(DCT) • Reduce DCT to 8x8 for lowest frequencies • Compute average DCT value • Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT value (1 - above mean, 0 – below mean) • Similarity measure: Hammingdistance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
  • 31.
    Extraction of BarHeights • Process: • convertto grayscale • binarize using global threshold to obtain b/w image (sharp contours) • pad image with white pixels to ensurebars can be filled • clean artifacts of black pixels using a threshold on the relative area covered by the pixels • removeimage border • floodfill with black pixels and invert • find candidates for bars by determining the lengths of all vertical lines of black pixels • determine bars by clustering vertical lines • removenoise from whiskers, labels, and legend entries • assume the average height of the lines in a cluster as the bar height An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30