SlideShare a Scribd company logo
1 of 47
Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Norman Meuschke
Information Science Group
University of Konstanz
www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 1
Outline
• Introduction
• Short Bio
• Overview of Research Group
• Overview of Academic Plagiarism Detection
• Research Approach: Analyzing Nontextual Content Features
• (Analyzing Academic Citations)
• Analyzing Mathematics
• Analyzing Images
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 2
Short Bio
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 3
UC Berkeley, California
MA Thesis, Dept. of Statistics
SciPlore Startup
(2011 – 2014)
National Institute of
Informatics Tokyo
PhD
(2014 – Jan 2015)
Univ. of Konstanz
PhD
(since Feb. 2015)
Univ. of Magdeburg
BA / MA
Information Systems
2011
Feb 2015 - Today
University of Konstanz
An Adaptive Image-based Plagiarism Detection Approach - Meuschke et al. 4
Map data ©2018 GeoBasis-DE/BKG (©2009), Google
Map data ©2018 GeoBasis-DE/BKG (©2009), Google
Research Group – www.isg.uni.kn
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 5
Prof. Bela Gipp Dr. Moritz Schubotz Corinna Breitinger Philip Ehret
André Greiner-Petter Felix Hamborg Thomas Hepp Philipp Scharpf
Alexander Schönhals Malte Schwarzer Vincent Stange Patrick Wortner
visiting researcher
at NII
Research Areas
• Information Science
• applied, use-case driven and user-focused computer science
• We focus on three areas
• Semantic Document Analysis & Document Retrieval
• Mathematical Information Retrieval
• Blockchain Applications
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
Semantic Document Analysis & Document Retrieval
• Plagiarism Detection
• Literature Recommendation
• Research Papers
• Wikipedia
• Legal Documents
• News Analysis
• Media Bias Detection
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
Corinna Breitinger
Felix Hamborg
Philipp ScharpfMalte Schwarzer
Vincent StangeNorman Meuschke Gent Ymeri
Max KutznerAnastasia Zhukova
Mathematical Information Retrieval
• Conversion and Enrichment
• Search
• Recommendation
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
Philipp Scharpf
Dr. Moritz Schubotz André Greiner-Petter
Felix Petersen
Blockchain Applications
• Trusted Timestamping
• Blockchain for Science
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9
Vincent StangeThomas Hepp Christopher Gondek
Daniel Muffler Jannik Bamberger
Philip EhretAlexander Schönhals Patrick Wortner
Analyzing Nontextual Content Features
to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originality is expected.”
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3
Plagiarism Forms
Note: plagiarism forms are not mutually exclusive
Paraphrasing
 intentional rewriting
 no / insufficient reference the source
Structural and idea plagiarism
 little or no verbatim text overlap
Cross-language plagiarism
 manual/automated conversion of text into
other language to hide its origin
Copy & paste
 taking content verbatim from other source
Shake & paste
 copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
 techniques that exploit weaknesses of
current detection methods
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 4
Weak Strong
level of obfuscation
- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
 n-gram fingerprinting
 vector space models
 text alignment
 exhaustive string matching
Technical disguise
 encoding checks
 checks for textual content
 checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
 synonym expansion (WordNet)
 Semantic Role Labeling
 Latent Semantic Analysis
 POS-aware text matching
Cross-language plagiarism
 CL Character N-Gram Comp.
 CL Explicit Semantic Analysis
 CL Alignment-based Similarity Analysis
Weak Strong
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
level of obfuscation
5
Research Approach
• Analyze similarity features that:
• contain a high degree of semantic information
• exhibit low variability in their representations
• are not easily substitutable
• Combine analysis non-textual and textual content features
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
Analyzing Academic Citations
B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection:
Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on
Document Engineering (DocEng ’11), 2011.
B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger, “Demonstration of Citation Pattern
Analysis for Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2013.
B. Gipp, N. Meuschke, and C. Breitinger, “Citation-based Plagiarism Detection: Practicability on a Large-
scale Scientific Corpus,” JASIST, vol. 65, iss. 2, pp. 1527-1540, 2014.
B. Gipp, Citation-based Plagiarism Detection – Detecting Disguised and Cross-language Plagiarism using
Citation Pattern Analysis, Springer Vieweg Research, 2014.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 16
Analyzing Citation Patterns
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17
Doc C
Doc E
Doc D
Section 1
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
This is ain-text citation [1].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection.
Section 2
Another in-text citation [2].tThis is anexample text with references todifferent
documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. This is arepeated in-text citation [1].
This is an exampl etext withreferences to different documents for illustratingtheusageof
citation analysis for plagiari sm detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism detection .
Setion 3
A third in-text citation [3].This is an exampl etext withreferences to different documents
for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e
text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis
for plagiarism detection. a final i n-text-citation[2].
References
[1]
[2]
[3]
Document B
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is ain-text citation [1].This is an
ex ampletext with references to different documents for illustrati ng the usageof citation
anal ysis for plagiarism detection. Another exampl efor ani n-text citation [2].
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. This is an exampl etext withreferences to different documents for illustrating
the usage ofcitation analysi s for pl agiarism detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection. Here s a third in-text citation [3].This is an exampl etext withreferences to
different documents for illustrati ng the usage of citationanalysis forplagiarism
detection.
This is an exampl etext withreferences to different documents for illustratingtheusage
ofci tation analysis for plagi arism detection.
Document A
References
[1]
[2]
[3]
EDC DECDC
Citation Pattern Citation Pattern
Doc A Doc B
Ins.EIns.DC
DECDC
Pattern Comparison
Doc A
Doc B
Conclusion CbPD Evaluation
• Successes
• Significantly higher detection performance for disguised plagiarism
in biomedicine
• Decrease in user effort (time savings for examination)
• First approach to allow n:n comparisons for large collections
• Limitations
• Effectiveness varies depending on discipline
• Low for mathematics and physics
• Effectiveness is low for shorter cases of disguised plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
Analyzing Mathematical Expressions
N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect
Academic Plagiarism”, in Proc. ACM CIKM, 2017.
N. Meuschke, M. Schubotz, M. Kramer, V. Stange, and B. Gipp, “Improving External Plagiarism Detection
for Academic Documents by Analyzing Mathematical Expressions”, in review at ACM CIKM, 2018.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
Characteristics of Math-Heavy Texts
• Texts in math-heavy disciplines interveawe natural
and symbolic language
… one is not understandable without the other
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
Characteristics of Math-Heavy Texts
• Mathematical expressions share many characteristics
of academic citations
• much semantic information
• language-independent
• hard to leave out or substitute (yet easy to obfuscate)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
Initial MathPD Study
• Preliminary:
• Manual analysis of 39 confirmed cases of plagiarism
• Automated retrieval experiments:
• 10 real-world cases of mathematical plagiarism; source documents
embedded in NTCIR-11 Math Retrieval corpus
(105K arXiv documents, 60M mathematical expressions)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
Math-based Feature Comparison
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x Δ
Identifiers (ci)
Doc1
r
xx 
 )²( 2
3
Formulae from:
10 plagiarized doc.
10 source doc.
105,120 arXiv doc.
Mathosphere Framework
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
r x 2 3 - Δ
Feature Combination
.13
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
2 3 Δ
Numbers (cn)
0
1
2
Doc1
Doc2
Distance
0
1
2
Doc1
Doc2
Distance
- Δ
Operators (co)
.50
All-to-all comparison
of documents and
document partitions
Computation:
relative distance of
frequency histograms of
feature occurrences
Doc2
r
xx )3²2( 
Results Initial MathPD Study
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24
Case D dci dcn dco D dci dcn dco
C1 3,606 1 27,857 30,784 1 1 85,418 99,201
C2 1 1 88,891 90,962 1 1 12,266 10,277
C3 11,628 2 28,415 3,144 1 16 34,966 5,757
C4 2,581 1 1,950 86 189 6 54,560 18,374
C5 1 1 5,790 22,408 1 6 92,951 16,180
C6 25,498 12 19,862 38,145 7,976 3 24,405 72,687
C7 1 1 4,690 1,627 19,900 1 67,614 14,758
C8 1 1 39,215 11,576 1 1 21,152 9,475
C9 1 1 13,591 35,393 1 1 11,519 32,687
C10 1 1 76,678 30,673 1 1,223 89,703 3,280
0.60 0.86 < 0.01 < 0.01 0.70 0.57 < 0.01 < 0.01
full document
source retrieved at rank
partitions
source retrieved at rank
MRR
Extension to Math-based Retrieval Process
• Candidate Retrieval using Elastic Search
• Detailed Analysis using Pattern Analysis for Identifiers
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 25
input document(s)
Human
Inspection
Detailed
Analysis
Candidate
Retrieval
candidate
documents
similar documents
user
document
full text
MathML
formulae
citations &
references
text
fingerprints
Pre-
processing
math. identifiers
(list & histogram)
Indexing
unified document
Start analysis
Doc. ID=78
Data Storage
ID
78
Results
• Math-based approach
• candidate retrieval step reduced recall (R=0.7)
• detailed pattern analysis increased MRR=0.93
• Combined analysis (math, citation, text) could identify suspicious
similarity for all cases (R =1, MRR=1)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 26
Analyzing Images
N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based
Plagiarism Detection Approach,” in Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018.
N. Meuschke, V. Stange, M. Schubotz, and B. Gipp, “HyPlag: A Hybrid Approach to Academic Plagiarism
Detection,” in Proc. ACM SIGIR Conf., 2018.
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 27
Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarism detection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution, noise)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
Research Gap
• Current image-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenous images in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
Process
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.
Perceptual Hashing
• Efficient CBIR method to reliably find near image copies
• Uses most apparent visual features in images
• Creates non-unique fingerprints that can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use Discrete Cosine Transformation and Hamming Distance
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 11
Position-aware Text Matching
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 12
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh
1800pxh
25pxr 
Ratio Hashing
• First approach to target reuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00
Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scores of highly similar set of images
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id    1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor (3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 15
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id    1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Evaluation
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
• Source for test images: VroniPlag collection
• crowd-sourced effort investigating plagiarism allegations
• 196 manually examined academic works (mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representative cases (mostly from life sciences)
• Cases imbedded in 4,500 images obtained from PubMed Central
16
Example: Near Copies
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
17
Source Image Reused Image
Example: Weak Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
Source Image Reused Image
Example: Moderate Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
Source Image Reused Image
Example: Strong Alteration
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
Source Image Reused Image
Results
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true source image at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text content is present.
Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matching more robust to low OCR quality
• k-gram matching identified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
Source Image Reused Image
Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictive outlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessing in parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Hybrid Detection System
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 46
Questions?
Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke
Norman Meuschke
n@meuschke.org | @normeu
• Slides for this talk (and other talks):
www.slideshare.net/GroupGipp
• Contact, publications, other projects:
www.isg.uni.kn
• Code: www.github.com/ag-gipp
25

More Related Content

What's hot

A framework for plagiarism
A framework for plagiarismA framework for plagiarism
A framework for plagiarismcsandit
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingShalin Hai-Jew
 
Introduction to Text Analysis
Introduction to Text AnalysisIntroduction to Text Analysis
Introduction to Text AnalysisLauren Klein
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biologyLaura Berry
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Shalin Hai-Jew
 
Intruder adaptability
Intruder adaptabilityIntruder adaptability
Intruder adaptabilityugilad
 
Plagiarism 9 th and 10th February
Plagiarism 9 th and 10th FebruaryPlagiarism 9 th and 10th February
Plagiarism 9 th and 10th Februarythirumaraikkadu
 
Issues of plagiarism in classroom
Issues of plagiarism in classroomIssues of plagiarism in classroom
Issues of plagiarism in classroomSaurav Aryal
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
Dpp404plagiarism
Dpp404plagiarismDpp404plagiarism
Dpp404plagiarismMuhd Sayuty
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 
Semantic Grounding Strategies for Tagbased Recommender Systems
Semantic Grounding Strategies for Tagbased Recommender Systems  Semantic Grounding Strategies for Tagbased Recommender Systems
Semantic Grounding Strategies for Tagbased Recommender Systems dannyijwest
 

What's hot (19)

[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
A framework for plagiarism
A framework for plagiarismA framework for plagiarism
A framework for plagiarism
 
Entropy of Fingerprints
Entropy of FingerprintsEntropy of Fingerprints
Entropy of Fingerprints
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
 
Introduction to Text Analysis
Introduction to Text AnalysisIntroduction to Text Analysis
Introduction to Text Analysis
 
Plag detection
Plag detectionPlag detection
Plag detection
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biology
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
 
Intruder adaptability
Intruder adaptabilityIntruder adaptability
Intruder adaptability
 
Plagiarism 9 th and 10th February
Plagiarism 9 th and 10th FebruaryPlagiarism 9 th and 10th February
Plagiarism 9 th and 10th February
 
Issues of plagiarism in classroom
Issues of plagiarism in classroomIssues of plagiarism in classroom
Issues of plagiarism in classroom
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Plagiarism
PlagiarismPlagiarism
Plagiarism
 
Dpp404plagiarism
Dpp404plagiarismDpp404plagiarism
Dpp404plagiarism
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 
ACL-IJCNLP 2015
ACL-IJCNLP 2015ACL-IJCNLP 2015
ACL-IJCNLP 2015
 
Semantic Grounding Strategies for Tagbased Recommender Systems
Semantic Grounding Strategies for Tagbased Recommender Systems  Semantic Grounding Strategies for Tagbased Recommender Systems
Semantic Grounding Strategies for Tagbased Recommender Systems
 
De Waard Carusi
De Waard CarusiDe Waard Carusi
De Waard Carusi
 

Similar to Analyzing Nontextual Content Features to Detect Academic Plagiarism

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMININGA STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMININGAllison Thompson
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Review of plagiarism detection and control & copyrights in India
Review of plagiarism detection and control & copyrights in IndiaReview of plagiarism detection and control & copyrights in India
Review of plagiarism detection and control & copyrights in Indiaijiert bestjournal
 
Content analysis
Content analysisContent analysis
Content analysisAtul Thakur
 
Content analysis
Content analysisContent analysis
Content analysisAtul Thakur
 
Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...CSCJournals
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
 
Quantitative Research review.pptx
Quantitative Research review.pptxQuantitative Research review.pptx
Quantitative Research review.pptxBobbyPabores1
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsLisa Graves
 
Workshop unpad2014 with ref
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with refLola Devung
 
Content analysis
Content analysisContent analysis
Content analysisHans Mallen
 
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSISTEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSISacijjournal
 
Developing an arabic plagiarism detection corpus
Developing an arabic plagiarism detection corpusDeveloping an arabic plagiarism detection corpus
Developing an arabic plagiarism detection corpuscsandit
 
A Review Of Plagiarism Detection Based On Lexical And Semantic Approach
A Review Of Plagiarism Detection Based On Lexical And Semantic ApproachA Review Of Plagiarism Detection Based On Lexical And Semantic Approach
A Review Of Plagiarism Detection Based On Lexical And Semantic ApproachCourtney Esco
 
Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate GroupsApplication Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate GroupsLeonard Goudy
 
An efficient concept based mining model for enhancing text clustering(synopsis)
An efficient concept based mining model for enhancing text clustering(synopsis)An efficient concept based mining model for enhancing text clustering(synopsis)
An efficient concept based mining model for enhancing text clustering(synopsis)Mumbai Academisc
 
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...ijtsrd
 

Similar to Analyzing Nontextual Content Features to Detect Academic Plagiarism (20)

A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMININGA STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Review of plagiarism detection and control & copyrights in India
Review of plagiarism detection and control & copyrights in IndiaReview of plagiarism detection and control & copyrights in India
Review of plagiarism detection and control & copyrights in India
 
Content analysis
Content analysisContent analysis
Content analysis
 
Content analysis
Content analysisContent analysis
Content analysis
 
Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...Combining Approximate String Matching Algorithms and Term Frequency In The De...
Combining Approximate String Matching Algorithms and Term Frequency In The De...
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
 
Quantitative Research review.pptx
Quantitative Research review.pptxQuantitative Research review.pptx
Quantitative Research review.pptx
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
Workshop unpad2014 with ref
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with ref
 
Content analysis
Content analysisContent analysis
Content analysis
 
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSISTEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
 
Developing an arabic plagiarism detection corpus
Developing an arabic plagiarism detection corpusDeveloping an arabic plagiarism detection corpus
Developing an arabic plagiarism detection corpus
 
A Review Of Plagiarism Detection Based On Lexical And Semantic Approach
A Review Of Plagiarism Detection Based On Lexical And Semantic ApproachA Review Of Plagiarism Detection Based On Lexical And Semantic Approach
A Review Of Plagiarism Detection Based On Lexical And Semantic Approach
 
Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate GroupsApplication Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups
 
An efficient concept based mining model for enhancing text clustering(synopsis)
An efficient concept based mining model for enhancing text clustering(synopsis)An efficient concept based mining model for enhancing text clustering(synopsis)
An efficient concept based mining model for enhancing text clustering(synopsis)
 
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
 
Content analysis
Content analysisContent analysis
Content analysis
 

More from Scientific Information Analytics Group, Prof. Gipp

More from Scientific Information Analytics Group, Prof. Gipp (9)

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
 
A First Step Towards Content Protecting Plagiarism Detection
A First Step Towards Content Protecting Plagiarism Detection  A First Step Towards Content Protecting Plagiarism Detection
A First Step Towards Content Protecting Plagiarism Detection
 
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
 
Towards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and RecognitionTowards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and Recognition
 
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
Too Late to Collaborate:Challenges tothe Discovery ofin-progress ResearchToo Late to Collaborate:Challenges tothe Discovery ofin-progress Research
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
 
Repurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical GuideRepurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical Guide
 
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
 
An Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection ApproachAn Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection Approach
 
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Analyzing Nontextual Content Features to Detect Academic Plagiarism

  • 1. Analyzing Nontextual Content Features to Detect Academic Plagiarism Norman Meuschke Information Science Group University of Konstanz www.isg.uni.kn Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 1
  • 2. Outline • Introduction • Short Bio • Overview of Research Group • Overview of Academic Plagiarism Detection • Research Approach: Analyzing Nontextual Content Features • (Analyzing Academic Citations) • Analyzing Mathematics • Analyzing Images Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 2
  • 3. Short Bio Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 3 UC Berkeley, California MA Thesis, Dept. of Statistics SciPlore Startup (2011 – 2014) National Institute of Informatics Tokyo PhD (2014 – Jan 2015) Univ. of Konstanz PhD (since Feb. 2015) Univ. of Magdeburg BA / MA Information Systems 2011 Feb 2015 - Today
  • 4. University of Konstanz An Adaptive Image-based Plagiarism Detection Approach - Meuschke et al. 4 Map data ©2018 GeoBasis-DE/BKG (©2009), Google Map data ©2018 GeoBasis-DE/BKG (©2009), Google
  • 5. Research Group – www.isg.uni.kn Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 5 Prof. Bela Gipp Dr. Moritz Schubotz Corinna Breitinger Philip Ehret André Greiner-Petter Felix Hamborg Thomas Hepp Philipp Scharpf Alexander Schönhals Malte Schwarzer Vincent Stange Patrick Wortner visiting researcher at NII
  • 6. Research Areas • Information Science • applied, use-case driven and user-focused computer science • We focus on three areas • Semantic Document Analysis & Document Retrieval • Mathematical Information Retrieval • Blockchain Applications Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 6
  • 7. Semantic Document Analysis & Document Retrieval • Plagiarism Detection • Literature Recommendation • Research Papers • Wikipedia • Legal Documents • News Analysis • Media Bias Detection Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7 Corinna Breitinger Felix Hamborg Philipp ScharpfMalte Schwarzer Vincent StangeNorman Meuschke Gent Ymeri Max KutznerAnastasia Zhukova
  • 8. Mathematical Information Retrieval • Conversion and Enrichment • Search • Recommendation Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8 Philipp Scharpf Dr. Moritz Schubotz André Greiner-Petter Felix Petersen
  • 9. Blockchain Applications • Trusted Timestamping • Blockchain for Science Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9 Vincent StangeThomas Hepp Christopher Gondek Daniel Muffler Jannik Bamberger Philip EhretAlexander Schönhals Patrick Wortner
  • 10. Analyzing Nontextual Content Features to Detect Academic Plagiarism Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10
  • 11. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity. 3
  • 12. Plagiarism Forms Note: plagiarism forms are not mutually exclusive Paraphrasing  intentional rewriting  no / insufficient reference the source Structural and idea plagiarism  little or no verbatim text overlap Cross-language plagiarism  manual/automated conversion of text into other language to hide its origin Copy & paste  taking content verbatim from other source Shake & paste  copy & paste of text segments with slight adjustments, e.g., synonym substitutions Technical disguise  techniques that exploit weaknesses of current detection methods Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 4 Weak Strong level of obfuscation
  • 13. - intense research - methods limited by text-based candidate retrieval (R approx. 0.8 for moderate disguise) - solvable, no research needed solvedCopy & paste Shake & paste  n-gram fingerprinting  vector space models  text alignment  exhaustive string matching Technical disguise  encoding checks  checks for textual content  checks for large images Detection Capabilities Paraphrasing Structural and idea plagiarism  synonym expansion (WordNet)  Semantic Role Labeling  Latent Semantic Analysis  POS-aware text matching Cross-language plagiarism  CL Character N-Gram Comp.  CL Explicit Semantic Analysis  CL Alignment-based Similarity Analysis Weak Strong Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke level of obfuscation 5
  • 14. Research Approach • Analyze similarity features that: • contain a high degree of semantic information • exhibit low variability in their representations • are not easily substitutable • Combine analysis non-textual and textual content features Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14
  • 16. Analyzing Academic Citations B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engineering (DocEng ’11), 2011. B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger, “Demonstration of Citation Pattern Analysis for Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2013. B. Gipp, N. Meuschke, and C. Breitinger, “Citation-based Plagiarism Detection: Practicability on a Large- scale Scientific Corpus,” JASIST, vol. 65, iss. 2, pp. 1527-1540, 2014. B. Gipp, Citation-based Plagiarism Detection – Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis, Springer Vieweg Research, 2014. Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 16
  • 17. Analyzing Citation Patterns Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17 Doc C Doc E Doc D Section 1 This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection . This is ain-text citation [1].This is an exampl etext withreferences to different documents for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection. Section 2 Another in-text citation [2].tThis is anexample text with references todifferent documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection. This is arepeated in-text citation [1]. This is an exampl etext withreferences to different documents for illustratingtheusageof citation analysis for plagiari sm detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection . Setion 3 A third in-text citation [3].This is an exampl etext withreferences to different documents for illustrating the usage of citation analysis for plagiari sm detection . This is an exampl e text withreferenc es to differentdocuments fori llustratingthe usage ofci tation analysis for plagiarism detection. a final i n-text-citation[2]. References [1] [2] [3] Document B This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is ain-text citation [1].This is an ex ampletext with references to different documents for illustrati ng the usageof citation anal ysis for plagiarism detection. Another exampl efor ani n-text citation [2]. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustrating the usage ofcitation analysi s for pl agiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. Here s a third in-text citation [3].This is an exampl etext withreferences to different documents for illustrati ng the usage of citationanalysis forplagiarism detection. This is an exampl etext withreferences to different documents for illustratingtheusage ofci tation analysis for plagi arism detection. Document A References [1] [2] [3] EDC DECDC Citation Pattern Citation Pattern Doc A Doc B Ins.EIns.DC DECDC Pattern Comparison Doc A Doc B
  • 18. Conclusion CbPD Evaluation • Successes • Significantly higher detection performance for disguised plagiarism in biomedicine • Decrease in user effort (time savings for examination) • First approach to allow n:n comparisons for large collections • Limitations • Effectiveness varies depending on discipline • Low for mathematics and physics • Effectiveness is low for shorter cases of disguised plagiarism Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18
  • 19. Analyzing Mathematical Expressions N. Meuschke, M. Schubotz, F. Hamborg, T. Skopal, and B. Gipp, “Analyzing Mathematical Content to Detect Academic Plagiarism”, in Proc. ACM CIKM, 2017. N. Meuschke, M. Schubotz, M. Kramer, V. Stange, and B. Gipp, “Improving External Plagiarism Detection for Academic Documents by Analyzing Mathematical Expressions”, in review at ACM CIKM, 2018. Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19
  • 20. Characteristics of Math-Heavy Texts • Texts in math-heavy disciplines interveawe natural and symbolic language … one is not understandable without the other Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20
  • 21. Characteristics of Math-Heavy Texts • Mathematical expressions share many characteristics of academic citations • much semantic information • language-independent • hard to leave out or substitute (yet easy to obfuscate) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21
  • 22. Initial MathPD Study • Preliminary: • Manual analysis of 39 confirmed cases of plagiarism • Automated retrieval experiments: • 10 real-world cases of mathematical plagiarism; source documents embedded in NTCIR-11 Math Retrieval corpus (105K arXiv documents, 60M mathematical expressions) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22
  • 23. Math-based Feature Comparison Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x Δ Identifiers (ci) Doc1 r xx   )²( 2 3 Formulae from: 10 plagiarized doc. 10 source doc. 105,120 arXiv doc. Mathosphere Framework 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance r x 2 3 - Δ Feature Combination .13 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance 2 3 Δ Numbers (cn) 0 1 2 Doc1 Doc2 Distance 0 1 2 Doc1 Doc2 Distance - Δ Operators (co) .50 All-to-all comparison of documents and document partitions Computation: relative distance of frequency histograms of feature occurrences Doc2 r xx )3²2( 
  • 24. Results Initial MathPD Study Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24 Case D dci dcn dco D dci dcn dco C1 3,606 1 27,857 30,784 1 1 85,418 99,201 C2 1 1 88,891 90,962 1 1 12,266 10,277 C3 11,628 2 28,415 3,144 1 16 34,966 5,757 C4 2,581 1 1,950 86 189 6 54,560 18,374 C5 1 1 5,790 22,408 1 6 92,951 16,180 C6 25,498 12 19,862 38,145 7,976 3 24,405 72,687 C7 1 1 4,690 1,627 19,900 1 67,614 14,758 C8 1 1 39,215 11,576 1 1 21,152 9,475 C9 1 1 13,591 35,393 1 1 11,519 32,687 C10 1 1 76,678 30,673 1 1,223 89,703 3,280 0.60 0.86 < 0.01 < 0.01 0.70 0.57 < 0.01 < 0.01 full document source retrieved at rank partitions source retrieved at rank MRR
  • 25. Extension to Math-based Retrieval Process • Candidate Retrieval using Elastic Search • Detailed Analysis using Pattern Analysis for Identifiers Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 25 input document(s) Human Inspection Detailed Analysis Candidate Retrieval candidate documents similar documents user document full text MathML formulae citations & references text fingerprints Pre- processing math. identifiers (list & histogram) Indexing unified document Start analysis Doc. ID=78 Data Storage ID 78
  • 26. Results • Math-based approach • candidate retrieval step reduced recall (R=0.7) • detailed pattern analysis increased MRR=0.93 • Combined analysis (math, citation, text) could identify suspicious similarity for all cases (R =1, MRR=1) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 26
  • 27. Analyzing Images N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018. N. Meuschke, V. Stange, M. Schubotz, and B. Gipp, “HyPlag: A Hybrid Approach to Academic Plagiarism Detection,” in Proc. ACM SIGIR Conf., 2018. Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 27
  • 28. Idea of Image-based Plagiarism Detection • Images in academic documents convey much semantic information in compressed format independent of the text • Much research on Content-based Image Retrieval (CBIR) • Little adaption of CBIR methods to plagiarism detection(PD) • exact and cropped images copies • affinely transformed images (scaling, rotation, projection) • slight alterations of appearance (blurring, lower resolution, noise) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 7
  • 29. Research Gap • Current image-based PD approaches problematic for: • compound images • rearranged images • images mostly containing text (typically tables inserted as figures) • visually differing, semantically equivalent data visualizations • Goal: image-based PD process that: • combines established and new analysis methods to cover heterogenous images in academic documents • adaptively applies suitable analysis steps • flexibly quantifies suspiciousness • is extensible in the future Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 8
  • 30. Process Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 9 decompose image classify image extract image perceptual hashing OCR ratio hashing positional text matching k-gram text matching reference DB distance calculation DpHash, DrHash, DkTM, DposTM outlier detection: s(Dm)>r potential source images input doc.
  • 31. Perceptual Hashing • Efficient CBIR method to reliably find near image copies • Uses most apparent visual features in images • Creates non-unique fingerprints that can be compared • Fingerprints are invariant to: • scaling • aspect ratio changes • changes to brightness, contrast and colors • We use Discrete Cosine Transformation and Hamming Distance Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 10 Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
  • 32. k-gram Text Matching • To identify tables inserted as figures and images with little visual similarity • Text extracted using open source OCR engine Tesseract • Granularity: • character 3-grams • no chunk selection • Similarity measure 𝑑 = 𝐾1⊖𝐾2 𝐾1∩𝐾2 Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 11
  • 33. Position-aware Text Matching Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 12 • To account for typically small amount of text in images aggravated by OCR errors • Process: • Scale images to same height (here: 800px) • Define proximity region around identified text (here 50px circle) • Project proximity regions of input image to potential source • Only consider matching characters in projected proximity regions 𝑠 = 𝐾1 ∩ 𝐾2 max( 𝐾1 , |𝐾2|) A C B B positional character match input image D A X reference image A positional character mismatchB Legend: D 1w 2w 2800pxh 1800pxh 25pxr 
  • 34. Ratio Hashing • First approach to target reuse of data (and its visualization) • identifies equivalent, yet visually differing bar charts Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 13 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 𝑑 = 1.00-1.00+ 0.80-0.80+ 0.61-0.61+ 0.44-0.44+ 0.30-0.30+ 0.07-0.07 = 0.00
  • 35. Outlier Detection • To quantify suspiciousness of method-specific distance scores • Two assumptions: • image only suspicious if comparably high similarity (small distance) to small set (c=9) of other images • clear separation of distance scores of highly similar set of images Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 14 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id    1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 36. Outlier Detection Continued • Find outlier group: • split list of relative distance deltas if a distance is at least twice as large as its predecessor (3x as large for k-gram matching) • Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance margin to collection that is twice as large as outlier’s distance to the input image Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 15 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id    1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 37. Evaluation Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke • Source for test images: VroniPlag collection • crowd-sourced effort investigating plagiarism allegations • 196 manually examined academic works (mostly PhD theses) • most allegations confirmed by responsible universities • Targeted crawl for all annotated ‘fragments’ containing images • confirmed by at least two examiners • Selection of 15 representative cases (mostly from life sciences) • Cases imbedded in 4,500 images obtained from PubMed Central 16
  • 38. Example: Near Copies Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 17 Source Image Reused Image
  • 39. Example: Weak Alteration Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 18 Source Image Reused Image
  • 40. Example: Moderate Alteration Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 19 Source Image Reused Image
  • 41. Example: Strong Alteration Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 20 Source Image Reused Image
  • 42. Results Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 21 • Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at least one of the methods (𝑅 = 0.73) • Outlier detection effective (𝑃 = 1): • For all input images with 𝑠 ≥ 0.5, true source image at the top rank • For all input images with 𝑠 < 0.5, no source image retrieved among the top-ten most similar images, i.e. no false positives • Perceptual hashing with sub-image extraction worked best for near copies and weakly altered images (found 6 of 9 cases) • Text analysis performed better than perceptual hashing for moderately and strongly altered images • if quality of the image was high enough to perform OCR reliably and sufficient text content is present.
  • 43. Results Continued • Text analysis approaches identified 3 of 4 cases involving tables • position-aware text matching more robust to low OCR quality • k-gram matching identified more cases • combination of approaches allows processing more images • Dataset contained only one bar chart, for which ratio hashing yielded extremely suspicious score (𝑠 = 0.92) Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 22 Source Image Reused Image
  • 44. Discussion & Conclusion • Image-based PD promising complement to other methods • Small test collection, but restrictive outlier detection procedure will prevent false positives also in larger collections • if reduced precision is acceptable, threshold can be changed interactively by user • Approach well suited for scaling • Preprocessing in parallel • Options described to scale analysis methods • Approach easily extensible with new methods • New input scores to outlier detection • Code: www.purl.org/imagepd Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 23
  • 45. Future Work • More detection methods tailored to specific data visualizations • Scale the process • parallelization of preprocessing • candidate selection for feature descriptors • Realize hybrid process Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 24 heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity
  • 46. Hybrid Detection System Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke 46
  • 47. Questions? Analyzing Nontextual Content Features to Detect Academic Plagiarism - Norman Meuschke Norman Meuschke n@meuschke.org | @normeu • Slides for this talk (and other talks): www.slideshare.net/GroupGipp • Contact, publications, other projects: www.isg.uni.kn • Code: www.github.com/ag-gipp 25

Editor's Notes

  1. aside from me 10 other PhD students and one postodc
  2. The research we report on in the paper is on improving the detection of academic plagiarism, which we define as … not just copied text, but any substantial intelectual contribution
  3. Academic plagiarism occurs in a number of forms, which can be broadly categorized by their degree of obfuscation, but are are not mutually exclusive
  4. Extensive research has been conducted on plagiarism detection methods, particularly for finding text plagiarism.
  5. specifically
  6. Given the inherent limitation of purely text-based analysis methods, our research focusses on complementing successful text-based analysis methods with methods that analyze nontextual content features.
  7. A while ago, Bela Gipp had the idea that academic citations fulfill all these criteria, so
  8. Features: identifiers (ci), numbers (cn), operators (co), feature combination Descriptors: frequency histograms of feature occurrence Granularity: i) entire document, ii) document partitions Similarity measure: relative distance of feature occurrence frequencies for individual features d ci, cn, co and combination of all features D
  9. Analyzing identifiers worked best 8 of 10 test cases at top rank MRR = 0.86 Analyzing operators and numbers generally noisy However, for partitions including operators and numbers in aggregated distance performs better (7/10) than identifiers (5/10)
  10. With these requirements in mind, we devised the following image-based detection process. accepts input documents in PDF format extracts images from the PDF decomposes compound images, which is a weakness of prior works on IBPD reduces computational load by classifying images using CNN The core of our process are currently 4 analysis methods, applied independently - methods compute method-specific feature descriptors and compare them to the stored descriptors for all images in the collection - comparisons yield separate lists of distance scores for each method - list are input to outlier detection process - returns potential sources for an image
  11. also weakly altered = near copies for image sections Hamming dist. = number of bits that differ
  12. typically word 3-5 grams, i.e. 15-30 characters here: finer granularity to account for smaller amount of text potential recognition errors OCR turned out to be a real problem in our experiments
  13. Idea: only consider text matches that occur in roughly the same regions of the pictures other shapes and dynamic sizing of the shape, e.g., dependent on the length of the text fragment, are also possible normalization of sim score reflects the assumption that two images are less likely to be similar if their amount of textual content differs strongly.
  14. Process: determine bar heights (details see paper) sort bars by height in decreasing order compute relative bar heights, i.e., ℎ 𝑖 ℎ max ratio hash = bar-wise difference Only consider charts with same number of bars (min 3) to reduce computational effort, can be changed
  15. First assumption basically a heuristic filter for false positives, e.g. common images like logos missed to exclude in preprocessing (small distance for many images) multiple versions of same document in collection Second assumption assures outlier group
  16. Store absolute distance scores in ascending order of distance Compute list of relative distance deltas Scan through list of relative dist. deltas to find distance at least twice as large as its predecessor Check whether sublist does not contain more than 9 elements
  17. share large majority of their visual content exhibit minor differences introduced by i) removing non-essential content (e.g., numeric labels or watermarks) ii) cropping or padding iii) performing affine transformations (e.g., scaling or rotation) iv) changing the resolution, contrast or color space. Especially iii) and iv) can be introduced inadvertently by extracting and reusing images from a PDF or printed document.
  18. typically reuse parts of an original image as near copies
  19. typically reuse most or all the visual components of the original image, yet rearrange the components Gap between elements Three instead of two elements in last row
  20. typically redrawn versions of the source with changes made to the arrangement and/or visual appearance of image components shape of the curves in main and enlarged view placement of the enlarged view brackets used