The document discusses plagiarism and methods for detecting it. It defines plagiarism as passing off another's work as one's own and lists several types, including directly copying text, paraphrasing from one or multiple sources without proper citation, and borrowing from one's own previous work. It then describes an algorithmic approach to detecting plagiarism by comparing documents and text segments at the document, paragraph, and sentence levels using thresholds and word similarity techniques. WordNet and the Lesk algorithm are also referenced for analyzing word meanings and signatures to identify copied text. The document concludes by listing members of an NLP team and mentioning a demo.
2. What is Plagiarism ?
to steal and pass off (the
ideas or words of another)
as one's own
to use (another's
production) without
crediting the source
to commit literary
theft
to present as new and
original an idea or product
derived from an existing
source
Not just Copying or
borrowing
3. Types of Plagiarism ?
CLONE
Submitting
another’s work,
word-for-word,
as one’s own
CTRL-C
Contains significant
portions of text
from a single source
without alterations
FIND - REPLACE
Changing key words
and phrases but
retaining the
essential content of
the source
REMIX
Paraphrases from
multiple sources,
made to fit together
RECYCLE
Borrows generously
from the writer’s
previous work
without citation
HYBRID
Combines perfectly
cited sources with
copied passages
without citation
MASHUP
Mixes copied
material from
multiple sources
404 ERROR
Includes citations to
non-existent or
inaccurate
information about
AGGREGATOR
Includes proper
citation to sources
but the paper
contains almost no
RE-TWEET
Includes proper citation,
but relies too closely on
the text’s original wording
6. How To do it practically
Document 1
• A document is a written, drawn,
presented or recorded representation
of thoughts. Originating from the
Latin Documentum meaning lesson -
the verb doceō means to teach, and is
pronounced similarly, in the past it
was usually used as a term for a
written proof used as evidence. In
the computer age, a document is
usually used to describe a primarily
textual file, along with its structure
and design, such as fonts, colors and
additional images.
Document 2
• A document is a written, drawn,
presented or recorded representation
of thoughts. Originating from the
Latin Documentum meaning lesson -
the verb doceō means to teach, and is
pronounced similarly, in the past it
was usually used as a term for a
written proof used as evidence. In
the computer age, a document is
usually used to describe a primarily
textual file, along with its structure
and design, such as fonts, colors and
additional images.
Threeshold
8. Two input documents
• Input : DocA, DocB // Two input documents
• Output: similarity
• Begin
• DocMinSize = min (|DocA|, |DocB|)
• DocIntersectionSize = |DocA ∩ DocB|
• If (DocIntersectionSize >= DocMinSize*DocThreshold)
• Then
• //Possible similarity
• //Check similarity at paragraph level
• similarity = true
• Else
• similarity = false
• End
9. Two input paragraphs
• Input : ParA, ParB // Two input paragraphs
Output: similarity
• Begin
• ParMinSize = min (|ParA|, |ParB|)
• ParIntersectionSize = |ParA ∩ ParB|
• If (ParIntersectionSize >= ParMinSize*ParThreshold)
• Then
• //Possible similarity
• //Check similarity at sentence level
• similarity = true
• Else
• similarity = false
10. Sentence level
• Algorithm 3: Sentence level
heuristic
• Input : SenA, SenB
• Output: similarity, similar
substrings in SenA and SenB
• Begin
• SenMinSize = min(|SenA|,
|SenB|)
• SenIntersectionSize = |SenA ∩
SenB|
• If (SenIntersectionSize >=
SenMinSize*SenThreshold)
• Then
• //Similarity detected
• //Determine similar
• //substrings
• similarity = true
• Else
• similarity = false
• Else
• similarity = false
• End
11. Wordnet
WordNet
•A very large lexical database of English:
–117K nouns, 11K verbs, 22K adjectives, 4.5K adverbs
•Word senses grouped into synonym sets (“synsets”) linked into a
conceptual-semantic hierarchy
–82K noun synsets, 13K verb synsets, 18K adjectives synsets, 3.6K adverb
synsets
–Avg. # of senses: 1.23/noun, 2.16/verb, 1.41/adj, 1.24/adverb
•Conceptual-semantic relations
–hypernym/hyponym
12. Lesk algorithm
Compare the context with the dictionary definition of the sense
–Construct the signatureof a word in context by the signatures of its
senses in the dictionary
•Signature= set of context words (in examples/gloss or in context)
–Assign the dictionary sense whose gloss and examples are the most
similarto the context in which the word occurs
•Similarity = size of intersection of context signature and sense
signature
13. Sense signatures
-------bank1
Gloss: a financial institution that accepts deposits and channels
the moneyinto lending activities
Examples: “he cashedthe checkat the bank”,
“that bank holdsthe mortgageon my home”
------bank2
Gloss: slopingland(especially the slopebeside a bodyof water)
Examples: “they pulledthe canoeup on the bank”,
“he saton the bank of the riverand watchedthe current”
Signature(bank1) = {financial, institution, accept, deposit,
channel, money, lend, activity, cash, check, hold, mortgage, home}
Signature(bank1) = {slope, land, body, water, pull, canoe, sit,
river, watch, current}