Successfully reported this slideshow.

An Adaptive Image-based Plagiarism Detection Approach

1

Share

Loading in …3
×
1 of 31
1 of 31

An Adaptive Image-based Plagiarism Detection Approach

1

Share

Download to read offline

JCDL 2018 slides for the full paper ''An Adaptive Image-based Plagiarism Detection Approach''. Research presented by Norman Meuschke.

Find the associated paper here: https://www.gipp.com/wp-content/papercite-data/pdf/meuschke2018.pdf

JCDL 2018 slides for the full paper ''An Adaptive Image-based Plagiarism Detection Approach''. Research presented by Norman Meuschke.

Find the associated paper here: https://www.gipp.com/wp-content/papercite-data/pdf/meuschke2018.pdf

More Related Content

An Adaptive Image-based Plagiarism Detection Approach

  1. 1. N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp An Adaptive Image-based Plagiarism Detection Approach Norman Meuschke Information ScienceGroup University of Konstanz www.isg.uni.kn An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
  2. 2. University of Konstanz An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2 Map data ©2018 GeoBasis-DE/BKG(©2009), Google Map data ©2018 GeoBasis-DE/BKG(©2009), Google
  3. 3. Outline • Overview of Research on Academic PlagiarismDetection • Image-based PlagiarismDetection Approach • Evaluation Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
  4. 4. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originalityis expected.” An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity. 3
  5. 5. Plagiarism Forms Note: plagiarismformsare not mutually exclusive Paraphrasing ▪ intentional rewriting ▪ no / insufficient reference the source Structural and idea plagiarism ▪ little or no verbatim text overlap Cross-language plagiarism ▪ manual/automated conversion of text into other language to hide its origin Copy & paste ▪ taking content verbatim from other source Shake & paste ▪ copy & paste of text segments with slight adjustments, e.g., synonym substitutions Technical disguise ▪ techniques that exploit weaknesses of current detection methods An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4 Weak Strong level of obfuscation
  6. 6. - intense research - methods limited by text-based candidate retrieval (R approx. 0.8 for moderate disguise) - solvable, no research needed solvedCopy & paste Shake & paste ▪ n-gramfingerprinting ▪ vector space models ▪ text alignment ▪ exhaustive string matching Technical disguise ▪ encoding checks ▪ checks for textual content ▪ checks for large images Detection Capabilities Paraphrasing Structural and idea plagiarism ▪ synonym expansion(WordNet) ▪ Semantic Role Labeling ▪ Latent Semantic Analysis ▪ POS-aware text matching Cross-language plagiarism ▪ CL Character N-Gram Comp. ▪ CL Explicit Semantic Analysis ▪ CL Alignment-based Similarity Analysis Weak Strong An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. level of obfuscation 5
  7. 7. heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity Our Research An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 6 • Combine analysis of textual and non-textual content features
  8. 8. Idea of Image-based Plagiarism Detection • Images in academic documents convey much semantic information in compressed format independent of the text • Much research on Content-based Image Retrieval (CBIR) • Little adaption of CBIR methods to plagiarismdetection(PD) • exact and cropped images copies • affinely transformed images (scaling, rotation, projection) • slight alterations of appearance (blurring, lower resolution,noise) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
  9. 9. Research Gap • Currentimage-based PD approaches problematic for: • compound images • rearranged images • images mostly containing text (typically tables inserted as figures) • visually differing, semantically equivalent data visualizations • Goal: image-based PD process that: • combines established and new analysis methods to cover heterogenousimages in academic documents • adaptively applies suitable analysis steps • flexibly quantifies suspiciousness • is extensible in the future An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
  10. 10. Process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 9 decompose image classify image extract image perceptual hashing OCR ratio hashing positional text matching k-gram text matching reference DB distance calculation DpHash, DrHash, DkTM, DposTM outlier detection: s(Dm)>r potential source images input doc.
  11. 11. Perceptual Hashing • Efficient CBIR method to reliably find near imagecopies • Uses most apparent visual features in images • Creates non-uniquefingerprintsthat can be compared • Fingerprints are invariant to: • scaling • aspect ratio changes • changes to brightness, contrast and colors • We use DiscreteCosine Transformation and Hamming Distance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10 Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
  12. 12. k-gram Text Matching • To identify tables inserted as figures and images with little visual similarity • Text extracted using open source OCR engine Tesseract • Granularity: • character 3-grams • no chunk selection • Similarity measure 𝑑 = 𝐾1⊖𝐾2 𝐾1∩𝐾2 An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
  13. 13. Position-aware Text Matching An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12 • To account for typically small amount of text in images aggravated by OCR errors • Process: • Scale images to same height (here: 800px) • Define proximity region around identified text (here 50px circle) • Project proximity regions of input image to potential source • Only consider matching characters in projected proximity regions 𝑠 = 𝐾1 ∩ 𝐾2 max( 𝐾1 , |𝐾2|) A C B B positional character match input image D A X reference image A positional character mismatchB Legend: D 1w 2w 2800pxh= 1800pxh= 25pxr =
  14. 14. Ratio Hashing • First approach to targetreuse of data (and its visualization) • identifies equivalent, yet visually differing bar charts An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 𝑑 = 1.00-1.00+ 0.80-0.80+ 0.61-0.61+ 0.44-0.44+ 0.30-0.30+ 0.07-0.07 = 0.00
  15. 15. Outlier Detection • To quantify suspiciousness of method-specific distance scores • Two assumptions: • image only suspicious if comparably high similarity (small distance) to small set (c=9) of other images • clear separation of distance scoresof highly similar set of images An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  16. 16. Outlier Detection Continued • Find outlier group: • split list of relative distance deltas if a distance is at least twice as large as its predecessor(3x as large for k-gram matching) • Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance margin to collection that is twice as large as outlier’s distance to the input image An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  17. 17. Evaluation An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. • Source for test images: VroniPlag collection • crowd-sourced effortinvestigating plagiarism allegations • 196 manually examined academic works(mostly PhD theses) • most allegations confirmed by responsible universities • Targeted crawl for all annotated ‘fragments’ containing images • confirmed by at least two examiners • Selection of 15 representativecases (mostly from life sciences) • Cases imbedded in 4,500images obtained from PubMed Central 16
  18. 18. Example: Near Copies An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 17 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
  19. 19. Example: Weak Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 18 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ry/073
  20. 20. Example: Moderate Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 19 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ab/017
  21. 21. Example: Strong Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 20 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ad/068
  22. 22. Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 21 • Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at least one of the methods (𝑅 = 0.73) • Outlier detection effective (𝑃 = 1): • For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank • For all input images with 𝑠 < 0.5, no source image retrieved among the top-ten most similar images, i.e. no false positives • Perceptual hashing with sub-image extraction worked best for near copies and weakly altered images (found 6 of 9 cases) • Text analysis performed better than perceptual hashing for moderately and strongly altered images • if quality of the image was high enough to perform OCR reliably and sufficient text contentis present.
  23. 23. Results Continued • Text analysis approaches identified 3 of 4 cases involving tables • position-aware text matchingmore robust to low OCR quality • k-gram matchingidentified more cases • combination of approaches allows processing more images • Dataset contained only one bar chart, for which ratio hashing yielded extremely suspicious score (𝑠 = 0.92) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Cz/047
  24. 24. Discussion & Conclusion • Image-based PD promising complement to other methods • Small test collection, but restrictiveoutlier detection procedure will prevent false positives also in larger collections • if reduced precision is acceptable, threshold can be changed interactively by user • Approach well suited for scaling • Preprocessingin parallel • Options described to scale analysis methods • Approach easily extensible with new methods • New input scores to outlier detection • Code: www.purl.org/imagepd An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
  25. 25. Future Work • More detection methods tailored to specific data visualizations • Scale the process • parallelization of preprocessing • candidate selection for feature descriptors • Realize hybrid process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24 heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity
  26. 26. Questions? An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Norman Meuschke n@meuschke.org • Code: www.purl.org/imagepd • Contact, publications, other projects: www.isg.uni.kn 25
  27. 27. Image Extraction & Decomposition • Extraction: • poppler framework • convertto JPEG • discard images smaller than 7.5 KB (typically logos) • Decomposition: • assume white pixels separate sub-images • assume rectangular sub-images aligned horizontally or vertically • tradeoff (images remain analyzable if decomposition fails) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
  28. 28. Decomposition • Process: • conversionto grayscale to reduceruntime • padding with white pixels to removea potential border • binarization using adaptive thresholdingto obtain a b/w image • dilation to ensureblack pixels are connected • floodfill of white areas with black pixels • subtract original image • invert image • blob detection using the algorithm of Suzuki and Abe [1] • estimate boundingbox by looking for large contoursaligned along the image axes • crop and store the identified sub-images [1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by Border Following. CVGIP 30, 1 (1985). An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
  29. 29. Image Classification • Depp CNN realized using Caffe and AlexNet architecture [2] • CNN classifies images into: • photographs(pHash only) • bar charts (ratio hashing only) • other image types (pHash and OCR text matching) • Manual checks of 100 classified images • Accuracy 0.92 for photographs and 1.00 for bar charts An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
  30. 30. Perceptual Hashing • Process: • Reduce size to 32x32 pixels • Convertto grayscale • Compute 32x32 DiscreteCosine Transform(DCT) • Reduce DCT to 8x8 for lowest frequencies • Compute average DCT value • Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT value (1 - above mean, 0 – below mean) • Similarity measure: Hammingdistance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
  31. 31. Extraction of Bar Heights • Process: • convertto grayscale • binarize using global threshold to obtain b/w image (sharp contours) • pad image with white pixels to ensurebars can be filled • clean artifacts of black pixels using a threshold on the relative area covered by the pixels • removeimage border • floodfill with black pixels and invert • find candidates for bars by determining the lengths of all vertical lines of black pixels • determine bars by clustering vertical lines • removenoise from whiskers, labels, and legend entries • assume the average height of the lines in a cluster as the bar height An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30

×