Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Adaptive Image-based Plagiarism Detection Approach

851 views

Published on

JCDL 2018 slides for the full paper ''An Adaptive Image-based Plagiarism Detection Approach''. Research presented by Norman Meuschke.

Find the associated paper here: https://www.gipp.com/wp-content/papercite-data/pdf/meuschke2018.pdf

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

An Adaptive Image-based Plagiarism Detection Approach

  1. 1. N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp An Adaptive Image-based Plagiarism Detection Approach Norman Meuschke Information ScienceGroup University of Konstanz www.isg.uni.kn An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
  2. 2. University of Konstanz An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2 Map data ©2018 GeoBasis-DE/BKG(©2009), Google Map data ©2018 GeoBasis-DE/BKG(©2009), Google
  3. 3. Outline • Overview of Research on Academic PlagiarismDetection • Image-based PlagiarismDetection Approach • Evaluation Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
  4. 4. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originalityis expected.” An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity. 3
  5. 5. Plagiarism Forms Note: plagiarismformsare not mutually exclusive Paraphrasing ▪ intentional rewriting ▪ no / insufficient reference the source Structural and idea plagiarism ▪ little or no verbatim text overlap Cross-language plagiarism ▪ manual/automated conversion of text into other language to hide its origin Copy & paste ▪ taking content verbatim from other source Shake & paste ▪ copy & paste of text segments with slight adjustments, e.g., synonym substitutions Technical disguise ▪ techniques that exploit weaknesses of current detection methods An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4 Weak Strong level of obfuscation
  6. 6. - intense research - methods limited by text-based candidate retrieval (R approx. 0.8 for moderate disguise) - solvable, no research needed solvedCopy & paste Shake & paste ▪ n-gramfingerprinting ▪ vector space models ▪ text alignment ▪ exhaustive string matching Technical disguise ▪ encoding checks ▪ checks for textual content ▪ checks for large images Detection Capabilities Paraphrasing Structural and idea plagiarism ▪ synonym expansion(WordNet) ▪ Semantic Role Labeling ▪ Latent Semantic Analysis ▪ POS-aware text matching Cross-language plagiarism ▪ CL Character N-Gram Comp. ▪ CL Explicit Semantic Analysis ▪ CL Alignment-based Similarity Analysis Weak Strong An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. level of obfuscation 5
  7. 7. heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity Our Research An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 6 • Combine analysis of textual and non-textual content features
  8. 8. Idea of Image-based Plagiarism Detection • Images in academic documents convey much semantic information in compressed format independent of the text • Much research on Content-based Image Retrieval (CBIR) • Little adaption of CBIR methods to plagiarismdetection(PD) • exact and cropped images copies • affinely transformed images (scaling, rotation, projection) • slight alterations of appearance (blurring, lower resolution,noise) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
  9. 9. Research Gap • Currentimage-based PD approaches problematic for: • compound images • rearranged images • images mostly containing text (typically tables inserted as figures) • visually differing, semantically equivalent data visualizations • Goal: image-based PD process that: • combines established and new analysis methods to cover heterogenousimages in academic documents • adaptively applies suitable analysis steps • flexibly quantifies suspiciousness • is extensible in the future An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
  10. 10. Process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 9 decompose image classify image extract image perceptual hashing OCR ratio hashing positional text matching k-gram text matching reference DB distance calculation DpHash, DrHash, DkTM, DposTM outlier detection: s(Dm)>r potential source images input doc.
  11. 11. Perceptual Hashing • Efficient CBIR method to reliably find near imagecopies • Uses most apparent visual features in images • Creates non-uniquefingerprintsthat can be compared • Fingerprints are invariant to: • scaling • aspect ratio changes • changes to brightness, contrast and colors • We use DiscreteCosine Transformation and Hamming Distance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10 Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
  12. 12. k-gram Text Matching • To identify tables inserted as figures and images with little visual similarity • Text extracted using open source OCR engine Tesseract • Granularity: • character 3-grams • no chunk selection • Similarity measure 𝑑 = 𝐾1⊖𝐾2 𝐾1∩𝐾2 An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
  13. 13. Position-aware Text Matching An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12 • To account for typically small amount of text in images aggravated by OCR errors • Process: • Scale images to same height (here: 800px) • Define proximity region around identified text (here 50px circle) • Project proximity regions of input image to potential source • Only consider matching characters in projected proximity regions 𝑠 = 𝐾1 ∩ 𝐾2 max( 𝐾1 , |𝐾2|) A C B B positional character match input image D A X reference image A positional character mismatchB Legend: D 1w 2w 2800pxh= 1800pxh= 25pxr =
  14. 14. Ratio Hashing • First approach to targetreuse of data (and its visualization) • identifies equivalent, yet visually differing bar charts An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 𝑑 = 1.00-1.00+ 0.80-0.80+ 0.61-0.61+ 0.44-0.44+ 0.30-0.30+ 0.07-0.07 = 0.00
  15. 15. Outlier Detection • To quantify suspiciousness of method-specific distance scores • Two assumptions: • image only suspicious if comparably high similarity (small distance) to small set (c=9) of other images • clear separation of distance scoresof highly similar set of images An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  16. 16. Outlier Detection Continued • Find outlier group: • split list of relative distance deltas if a distance is at least twice as large as its predecessor(3x as large for k-gram matching) • Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance margin to collection that is twice as large as outlier’s distance to the input image An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  17. 17. Evaluation An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. • Source for test images: VroniPlag collection • crowd-sourced effortinvestigating plagiarism allegations • 196 manually examined academic works(mostly PhD theses) • most allegations confirmed by responsible universities • Targeted crawl for all annotated ‘fragments’ containing images • confirmed by at least two examiners • Selection of 15 representativecases (mostly from life sciences) • Cases imbedded in 4,500images obtained from PubMed Central 16
  18. 18. Example: Near Copies An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 17 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
  19. 19. Example: Weak Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 18 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ry/073
  20. 20. Example: Moderate Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 19 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ab/017
  21. 21. Example: Strong Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 20 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ad/068
  22. 22. Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 21 • Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at least one of the methods (𝑅 = 0.73) • Outlier detection effective (𝑃 = 1): • For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank • For all input images with 𝑠 < 0.5, no source image retrieved among the top-ten most similar images, i.e. no false positives • Perceptual hashing with sub-image extraction worked best for near copies and weakly altered images (found 6 of 9 cases) • Text analysis performed better than perceptual hashing for moderately and strongly altered images • if quality of the image was high enough to perform OCR reliably and sufficient text contentis present.
  23. 23. Results Continued • Text analysis approaches identified 3 of 4 cases involving tables • position-aware text matchingmore robust to low OCR quality • k-gram matchingidentified more cases • combination of approaches allows processing more images • Dataset contained only one bar chart, for which ratio hashing yielded extremely suspicious score (𝑠 = 0.92) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Cz/047
  24. 24. Discussion & Conclusion • Image-based PD promising complement to other methods • Small test collection, but restrictiveoutlier detection procedure will prevent false positives also in larger collections • if reduced precision is acceptable, threshold can be changed interactively by user • Approach well suited for scaling • Preprocessingin parallel • Options described to scale analysis methods • Approach easily extensible with new methods • New input scores to outlier detection • Code: www.purl.org/imagepd An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
  25. 25. Future Work • More detection methods tailored to specific data visualizations • Scale the process • parallelization of preprocessing • candidate selection for feature descriptors • Realize hybrid process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24 heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity
  26. 26. Questions? An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Norman Meuschke n@meuschke.org • Code: www.purl.org/imagepd • Contact, publications, other projects: www.isg.uni.kn 25
  27. 27. Image Extraction & Decomposition • Extraction: • poppler framework • convertto JPEG • discard images smaller than 7.5 KB (typically logos) • Decomposition: • assume white pixels separate sub-images • assume rectangular sub-images aligned horizontally or vertically • tradeoff (images remain analyzable if decomposition fails) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
  28. 28. Decomposition • Process: • conversionto grayscale to reduceruntime • padding with white pixels to removea potential border • binarization using adaptive thresholdingto obtain a b/w image • dilation to ensureblack pixels are connected • floodfill of white areas with black pixels • subtract original image • invert image • blob detection using the algorithm of Suzuki and Abe [1] • estimate boundingbox by looking for large contoursaligned along the image axes • crop and store the identified sub-images [1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by Border Following. CVGIP 30, 1 (1985). An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
  29. 29. Image Classification • Depp CNN realized using Caffe and AlexNet architecture [2] • CNN classifies images into: • photographs(pHash only) • bar charts (ratio hashing only) • other image types (pHash and OCR text matching) • Manual checks of 100 classified images • Accuracy 0.92 for photographs and 1.00 for bar charts An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
  30. 30. Perceptual Hashing • Process: • Reduce size to 32x32 pixels • Convertto grayscale • Compute 32x32 DiscreteCosine Transform(DCT) • Reduce DCT to 8x8 for lowest frequencies • Compute average DCT value • Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT value (1 - above mean, 0 – below mean) • Similarity measure: Hammingdistance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
  31. 31. Extraction of Bar Heights • Process: • convertto grayscale • binarize using global threshold to obtain b/w image (sharp contours) • pad image with white pixels to ensurebars can be filled • clean artifacts of black pixels using a threshold on the relative area covered by the pixels • removeimage border • floodfill with black pixels and invert • find candidates for bars by determining the lengths of all vertical lines of black pixels • determine bars by clustering vertical lines • removenoise from whiskers, labels, and legend entries • assume the average height of the lines in a cluster as the bar height An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30

×