How many people know who Sergey Brin and Larry Page are? For those of you who didn’t raise their hands, they are the founders of Google. Did you know that some of Sergey’s original research wrote was on a plagiarism detection system he wrote with his collaborators. This so called ‘cat and mouse game’ is common place. Rules are meant to be broken. For instance, people who like to drive fast will buy radar detectors instead of abiding by the speed limit. At Turnitin.com we find the same to be true with plagiarism detection. Rules are meant to be broken and students will find or develop new ways to circumvent the system. This talk explores some of the new counter-detection methods being used and what we are doing to counter them. The details are very technical. I’m going to stay away from these details so as to not bore you to death.
First I would like to give a quick survey of the different methods being employed today to avoid detection. None of these are new per se. However, new digital tools are accelerating their use just as the digital authoring of documents, email and the many other modes of document sharing, and, most importantly, the internet made plagiarism a pandemic. Then I’ll switch gears and get a little more technical on you to discuss the finer points of paraphrasing and its comparison to textual entailment.
First I would like to give a quick survey of the different methods being employed today to avoid detection. None of these are new per se. However, new digital tools are accelerating their use just as the digital authoring of documents, email and the many other modes of document sharing, and, most importantly, the internet made plagiarism a pandemic. I believe the most common method of plagiarizing remains copying text from one or more sources where the ‘author’ edits the text to sew the pieces into their paper. Along these lines, you might find it not surprising that Wikipedia is the number one source of internet matches we find in the Turnitin service. Whereas peer collusion accounts for the largest number of matches.
Translated plagiarism or the repurposing of content translated from a foreign language content, isn’t a new phenomenon.
However, growing anecdotal evidence suggests that students and researchers are using this method of plagiarizing to avoid the current detection technologies. In the US the growing population of foreign students and wide availability of machine translation technologies, e.g., Google translate, is thought to be attributing to the rise of this phenomenon.
Tools that paraphrase for you have grown in abundance and sophistication. A simple search for ‘article spinner’ turns up 1.3 million results from Google and a large number of adds for companies promoting their services. This sort of service also goes by other names, such as synonymizers.
Typically these services are aimed at online marketers looking to produce many versions of document that search engines will find unique to artificially promote their site by increasing the number of backlinks to their site. However, they are equally effective in rewriting a student paper.
Having computers understand how similar two sentences are to one another is a rich area of academic and corporate research. The utility of this technology is widespread. Everything from a question and answer system like Wolfram alpha being able to respond to a query the same way despite the multitude of ways you can phrase your question to Google being able to understand the relatedness of web pages at the phrase or sentence level instead of just at a ‘bag of words’ level.
Now I would like to outline the myriad of ways that an author can paraphrase. Although the details of each method isn’t so important. On the whole it demonstrates how rich languages are and, to a lot of people’s surprise, how unique writing is. There are many, many ways to deliver the same information through writing. Initial research is focused on English but the algorithmic framework is being generalized to work with all languages.
Hopefully, by now you can see how hard a natural language processing problem detecting paraphrases is.
I won’t go into detail regarding the algorithms/methods we are exploring but I will highlight some of the tools of the trade. I would also like to point out how different creating production quality code which can deal with enormous scale is in comparison to prototyping a solution. Most solutions simply can’t scale to processing hundreds of thousands of documents against a collection of tens of billions of documents. Microsoft was gracious enough to produce a paraphrase test corpus consisting of 5800 of sentence pairs of which 3900 were considered “semantically equivalent” by two human raters. What is interesting about this is that 83% of the sentence pairs were deemed the same by two raters but the remaining 17% required a third rater to break the stalemate. This elucidates another issue with paraphrase detection, their is a certain level of subjectiveness in ascertaining whether two sentences are equivalent or not. The same situation holds in plagiarism detection. When it comes to small matches one person’s plagiarism is another person’s noise.
The textbook definition of lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. The difference between lemmatization and a stemmer is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. http://en.wikipedia.org/wiki/Lemmatisation
So Turnitin currently does a certain level of paraphrase and textual entailment matching. To that end we’ve spent a lot of time adjusting the algorithms so that they are at the same time effective matching text while not producing ‘noisy’ or spurious matches. This is one of the hard problems that we are trying to solve. If done incorrectly, paraphrased or textually entailed matches which allow for a much higher degree of change could swamp a report in spurious matches. 18/19 18/26
One way to visualize this is to compare the sentences of an document against itself. The document typically is discussing a particular topic and it has a consistent voice. To this end for this presentation, I took a news article about Google’s new buzz messaging service and did a document wise comparison of the sentences. In this example you can see that the sentences are somewhat semantically related but wouldn’t be deemed paraphrases of each other.
So what if I changed each sentence so that they became more similar. At what point do they become semantically equivalent enough for it to go from false positive to a true positive. Although maybe not the best example, I think it illustrates the pitfalls of developing such a system. I believe the answer has to do with context, parts of speech, and the importance of the entailed words. Generalizing this type of AI model is very difficult because it is easy to overfit the model against the training data.
Translated plagiarism is a particular type of cross-lingual information retrieval aimed at finding similar documents across languages.
The first step in offering a translated plagiarism detection service is to find a partner that offers machine translation. Language Weaver currently offers 37 language pairs. Although Google and others offer free translation services their use is contractually limited and not bound by service level agreements. Furthermore, we feel the quality of language weaver’s technology is at the moment superior to the ‘free’ services.
Inter- and intra-language matches will be displayed together. We are considering offering a confidence level of the inter-language match.
Translated andParaphrased Plagiarism The cat and mouse game continues... One man’s rigor is another man’s mortis. - CF Bohren and DR Huffman, 1983
Over viewThe ever changing counter-detectionlandscapeParaphrasing versus textual entailmentWays to paraphraseTools of the trade
Many Roads to PlagiarismThe ‘old fashioned’ way
Many Roads to Plagiarism Paraphrased plagiarism Back-translation: the latest form of plagiarism Michael Jones University of Wollongong, Australia 4th Asia Pacific Conference on Educational Integrity (4APCEI) 28–30 September 2009Paraphrased plagiarism is not new either. However, there are newtools to aid in automatically paraphrasing text which are acceleratingthis form of detection avoidance.Paraphrase plagiat nest pas nouveau non plus. Toutefois, il existede nouveaux outils pour laide dans le texte paraphraseautomatiquement qui sont laccélération de cette forme dévasion dedétection.Paraphrase plagiarism is not new either. However, there arenew tools to help in paraphrasing the text automatically, which areaccelerating this form of escape detection.
Paraphrasing vs Textual EntailmentTwo sentences are paraphrased if they“mean the same thing”: 1) Similarity: they share a substantial amount of information 2) Dissimilarities are extraneous: if extra information in the sentences exists, the effect of its removal is not signiﬁcant.
Paraphrasing vs Textual EntailmentA paraphrase is a special case of textualentailment. A paraphrase is reflexivewhereas textual entailment indicatesthat t wo sentences overlap to a degreewith one sentence being subsumed bythe other.
Ways to Paraphrase Lexical substitution/synonymy Hypo/Syno/Hyper-nym replacement: article, paper or red, crimson• Acronym replacement: Mr., mister• Contractions: do not, don’t Compounding/decompounding: ballgame, ball game• Numeric/Alphabetic numbers: 11, eleven; 12/1/2010, December first t wo-thousand-ten
Ways to Paraphrase Active and passive exchange The gangster killed 3 innocent people. vs Three innocent people are killed by the gangster.• Re-ordering of sentence components Tuesday they met vs They met Tuesday Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Ways to Paraphrase Realization in different syntactic components Palestinian leader Arafat vs Arafat, Palestinian leader Prepositional phrase attachment The Alabama plant vs A plant in AlabamaZhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Ways to Paraphrase Change into different sentence types Who drew this picture? vs Tell me who drew this picture. Morphological derivation He is a good teacher. vs He teaches well. vs He is good at teaching.Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Ways to Paraphrase Light verb construction The film impressed him. vs The film made an impression on him. Comparatives vs. superlatives He is smarter than everyone else. vs He is the smartest one.Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Ways to Paraphrase Converse word substitution John is Marys husband. vs Mary is Johns wife. Verb nominalization He wrote the book. vs He was the author of the book.Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Ways to Paraphrase Substitution using words with overlapping meanings Bob excels at mathematics. vs Bob studies mathematics well. Inference He died of cancer. vs Cancer killed him.Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Ways to Paraphrase Different semantic role realization He enjoyed the game. vs The game pleased him. Subordinate clauses vs separate sentences lined by anaphoric pronouns. The tree healed its wounds by growing new bark. vs The tree healed its wounds. It grew new bark.Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
Tools of the TradeMicrosoft paraphrase corpus Used to test algorithmsWordNet: English only :( Synonyms, hypernyms, hyponyms, and antonyms.Algorithms: Finite State Transducers(FSTs) and/or iterative Longest CommonSequence (LCS) on sets.
Tools of the TradeStemming or lemmatization am, are, is be car, cars, cars, cars car
Word Alignment Examples According to the MS paraphrase corpus: This is a paraphrase 12/14 = 86% 12/16 = 75%Not Paraphrased (However, the first sentence is textually entailed by the second. Turnitin would currently match this.) 18/19 = 95% 18/26 = 69%
Slippery SlopeWhen does textual entailment become arbitrary/noise? 14/48 = 29% 14/34 = 41%
Slippery SlopeWhen does textual entailment become arbitrary/noise?13/24 = 54%13/21 = 62%
Translated PlagiarismNon-English markets, in particular, areconcerned about their English as asecond language students submittingEnglish documents that have beentranslated to their native language.
Translated PlagiarismInitial approach: Non-English documents searched as they are now Additional search performed: Translate document to English, search English documents, and then display English matches with translations (or vice versa)
Translated Plagiarism: Need for Paraphrasing? Machines and humans translate text in many different ways. Paraphrase detection allows us to match the variations. Google translate: The zeitgeist is thinking and feeling one age. The term describes the characteristics of a particular period, or an attempt to remind us it. The German word Zeitgeist is transferred through English as a loanword into numerous other languages been. Bing translate: Zeitgeist is thinking and feeling how an age. Is the nature of a particular era or trying to understand them. The German word Zeitgeist is taken from English as a loanword in many other languages. http://de.wikipedia.org/wiki/Zeitgeist