flickr, cc by-nc jobadge, 2011Technical possibilities of detecting plagiarism -Comparative analysis of detection toolsKatr...
About me   • Research assistant of Prof. Dr. Weber-Wulff     since 2007   • Sofware Test in 2008 and 2010   • Masterthesis...
Contents   • Plagiarism Detection Test 2010   • Doctor Thesis of Karl-Theodor zu Guttenberg   • Discovering plagiarism3 / 52
Teachers and administrations   want an simple solution                 Photo: Flickr cc-by-nc-sa: xtrarant, 2008          ...
Many software companies are glad to help5 / 52
Plagiarism detection software   • Can be extremely expensive!   • Teachers want to have all papers     marked original or ...
Test process 2010         • 9 months of work with 2 persons         • 42 test cases in English, German           and Japan...
Evaluation metric: Effectivness    • Plagiarism or not:      What was found?       • Total       • Without the first 10 te...
Evaluation metric: Usability   • Design, language consistency, navigation,     labelling, print quality of the reports, fi...
Evaluation metric : Professionalism   • Street address with town, telephone     number, name of a person   • Domain regist...
Problems: Effectiveness   • Nothing found from books - not     even if they are in Google     Books!   • We had one 100% p...
Problems: Effectiveness   • Umlauts cause problems, although less so than     in earlier tests   • Redacted texts are foun...
Problems: Usability   • Language mix   • Workflow problems   • The reports are generally not useful13 / 52
Problems: Professionalism   • No info, no names   • The address listed is a parking lot   • Support questions not answered...
How to rank?   • No system was best in all of the metrics   • We set up a ranking for each of the five criteria     (three...
Results: Useful   • There were no systems in     this category - only human     are able to reach this level of     effect...
Results: Partially useful systems17 / 52
Partially useful: PlagAware   • German System   • Good documentation   • Average effectiveness: 61%   • But: each file mus...
PlagAware19 / 52
Partially useful : turnitin   • Best results for material that is stored in their     database   • Translation problems   ...
turnitin Orginality Report21 / 52
turnitin: How colorful!22 / 52
23 / 52
turnitin stores Texts24 / 52
turnitin remembers for a long time25 / 52
26 / 52
Partially useful: Ephorus   •      Dutch system   •      Direct mail-in using Hand-In-Code   •      Reports by E-Mail   • ...
ephorus: Umlauts28 / 52
29 / 52
Partially useful: PlagScan   • Newcomer from Germany   • One purchases “PlagPoints”   • Useful: Subaccounts for teachers  ...
PlagScan31 / 52
PlagScan - Report32 / 52
Partially useful: Urkund   •      Swedish system   •      Second in overall effectiveness   •      13th in usability and p...
34 / 52
Urkund: Report35 / 52
Barely useful Systems   • They find something, but miss a lot   • They are not really easy to use   • They have profession...
Strange tales   • checkforplagiarism.net   • Viper                              cc-by-sa D. Weber-Wulff, 200937 / 52
checkforplagiarism.net   • In 2007 it was called     iPlagiarismcheck.com   • Was a plagiarism of     turnitin, but they s...
39 / 52
Viper   • Is installed on a PC   • In the terms of use: You give us     irrevocable rights to use your text     as we see ...
Viper41 / 52
GuttenPlag   Collaborative documentation of plagiarism42 / 52
The Extent of    the Plagiarism    • 135 sources    • 94% of pages    • 63% of lines  43/43  150
Test Results   • 38 of the (at the time of the test) 131 known     sources were found by at least one of the     systems  ...
We tested these systems on   zu Guttenbergs thesis   • The usability for such large     works was extremely poor   • The n...
The major problem is:     • They don’t find plagiarism! Just (marginally       changed)       copies of text - even proper...
So let’s have a look ourselves....   • But doesn’t the thesis have to be available     digitally?   • And the thesis is so...
Suspicion   • Upon careful reading you find it nicely written,     but .....   • The style is too polished, the vocabulary...
Searching with Google & Co   • Phrase in "..."   • 3-5 nouns                                Flickr, cc-by-nc-nd, Athena197...
Three words suffice!127  50/ 150
Really!51 / 52
Thank you!   • Portal Plagiarism     http://plagiat.htw-berlin.de   • Plagiarism-Blog:     http://copy-shake-paste.blogspo...
Upcoming SlideShare
Loading in …5
×

ALLEA KWAN symposium Amsterdam 2011-12-14

633
-1

Published on

Slides from the ALLEA Kwan symposium, Amsterdam 2011-12-14, about technical possibilities of detecting plagiarism - Comparative analysis of detection tools.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
633
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ALLEA KWAN symposium Amsterdam 2011-12-14

  1. 1. flickr, cc by-nc jobadge, 2011Technical possibilities of detecting plagiarism -Comparative analysis of detection toolsKatrin Köhler (B.SC.)Plagiarism - legal, moral and educational aspects, Amsterdam, 2011-12-14 Slides based on Debora Weber-Wulff, edited by Katrin Köhler
  2. 2. About me • Research assistant of Prof. Dr. Weber-Wulff since 2007 • Sofware Test in 2008 and 2010 • Masterthesis about “Cryptographic Watermarking for Texts”2 / 52
  3. 3. Contents • Plagiarism Detection Test 2010 • Doctor Thesis of Karl-Theodor zu Guttenberg • Discovering plagiarism3 / 52
  4. 4. Teachers and administrations want an simple solution Photo: Flickr cc-by-nc-sa: xtrarant, 2008 Art Installation: Jamie Pawlus, Indianapolis, Indiana, 20034 / 52
  5. 5. Many software companies are glad to help5 / 52
  6. 6. Plagiarism detection software • Can be extremely expensive! • Teachers want to have all papers marked original or plagiarism before they start reading them. • Students are afraid of wrongly being labeled plagiarists. • Only a teacher can decide if it is indeed plagiarism! Software cannot be used to solve social problems. • Prof. Dr. Weber-Wulff has tested plagiarism detection software 4.5 times: 2004, 2007, 2008, 2010 and zu Guttenberg’s thesis 6/6 150
  7. 7. Test process 2010 • 9 months of work with 2 persons • 42 test cases in English, German and Japanese • Different types of plagiarism, a few originals • Market survey • Access to the systems • 48 systems found, 26 could be completely evaluated7 / 52
  8. 8. Evaluation metric: Effectivness • Plagiarism or not: What was found? • Total • Without the first 10 tests (Google accident) • English cases • Japanese cases as additional challenge Flickr, cc-by, arthit, 2005 ➡No winner, continuous between 55% and 64 %8 / 52
  9. 9. Evaluation metric: Usability • Design, language consistency, navigation, labelling, print quality of the reports, fits in university processes • Support by email: Speed, good answers • Top: PlagScan, followed by PlagiarismFinder, Ephorus, PlagAware and TurnItIn Flickr, cc-by, Quapan, 20089 / 52
  10. 10. Evaluation metric : Professionalism • Street address with town, telephone number, name of a person • Domain registration in own name Flickr, cc-by-sa, • No parallel offers of term papers or sludgegulper , 2008 pornography or advertising for such services • German-speaking availability by telephone during German working hours • No installation of viruses ➡ PlagiarismFinder, followed by PlagAware, Strike Plagiarism, TurnItIn, Docoloc, PlagScan, Blackboard10 / 52
  11. 11. Problems: Effectiveness • Nothing found from books - not even if they are in Google Books! • We had one 100% plagiarism from Google books register at less than 25% • Translations are not found11 / 52
  12. 12. Problems: Effectiveness • Umlauts cause problems, although less so than in earlier tests • Redacted texts are found less often • Many systems very difficult to use • Not all companies trustworthy • Some keep copies - and award themselves rights to use the text!12 / 52
  13. 13. Problems: Usability • Language mix • Workflow problems • The reports are generally not useful13 / 52
  14. 14. Problems: Professionalism • No info, no names • The address listed is a parking lot • Support questions not answered, telephone does not pick up • Offer term papers or pornography in parallel, all rights given to the company14 / 52
  15. 15. How to rank? • No system was best in all of the metrics • We set up a ranking for each of the five criteria (three effectiveness, one usability, one professionalism) • Calculated the average ranking15 / 52
  16. 16. Results: Useful • There were no systems in this category - only human are able to reach this level of effectiveness. Flickr, cc-by-nc, dianejp, 200916 / 52
  17. 17. Results: Partially useful systems17 / 52
  18. 18. Partially useful: PlagAware • German System • Good documentation • Average effectiveness: 61% • But: each file must be submitted by itself (5 clicks!), this does not fit with the workflow • Looks for plagiarism in online texts18 / 52
  19. 19. PlagAware19 / 52
  20. 20. Partially useful : turnitin • Best results for material that is stored in their database • Translation problems • Umlaut problems • Return Wikipedia copies with ads for porn • The source URLs reported are often no longer valid • Just adds up the percent values for the “originality” report • Only system to deal with Japanese properly20 / 52
  21. 21. turnitin Orginality Report21 / 52
  22. 22. turnitin: How colorful!22 / 52
  23. 23. 23 / 52
  24. 24. turnitin stores Texts24 / 52
  25. 25. turnitin remembers for a long time25 / 52
  26. 26. 26 / 52
  27. 27. Partially useful: Ephorus • Dutch system • Direct mail-in using Hand-In-Code • Reports by E-Mail • Stores texts aggressively • Problems with umlauts27 / 52
  28. 28. ephorus: Umlauts28 / 52
  29. 29. 29 / 52
  30. 30. Partially useful: PlagScan • Newcomer from Germany • One purchases “PlagPoints” • Useful: Subaccounts for teachers • First place in usability • Three kinds of report, none of which are a side-by-side report • Only 60% in effectiveness30 / 52
  31. 31. PlagScan31 / 52
  32. 32. PlagScan - Report32 / 52
  33. 33. Partially useful: Urkund • Swedish system • Second in overall effectiveness • 13th in usability and professionalism • Language problems • Complex navigation • Catastrophic layout • Unusable reports • Cryptic error messages • Test cases from 2008 were still stored33 / 52
  34. 34. 34 / 52
  35. 35. Urkund: Report35 / 52
  36. 36. Barely useful Systems • They find something, but miss a lot • They are not really easy to use • They have professionalism problems • Docoloc, Copyscape, Blackboard/Safe Assign, Plagiarism Finder, Plagiarisma, Compilatio, StrikePlagiarism, The Plagiarism Checker36 / 52
  37. 37. Strange tales • checkforplagiarism.net • Viper cc-by-sa D. Weber-Wulff, 200937 / 52
  38. 38. checkforplagiarism.net • In 2007 it was called iPlagiarismcheck.com • Was a plagiarism of turnitin, but they said: These are the sources! • Charge 15 € for 5 tests, students are the target group • turnitin set up a Honeypot38 / 52
  39. 39. 39 / 52
  40. 40. Viper • Is installed on a PC • In the terms of use: You give us irrevocable rights to use your text as we see fit • Also runs a paper mill • Complicated reports • Only 24% effectiveness - better to throw a coin! • Advertise in the UK by power cleaning the sidewalks40 / 52
  41. 41. Viper41 / 52
  42. 42. GuttenPlag Collaborative documentation of plagiarism42 / 52
  43. 43. The Extent of the Plagiarism • 135 sources • 94% of pages • 63% of lines 43/43 150
  44. 44. Test Results • 38 of the (at the time of the test) 131 known sources were found by at least one of the systems • Many of these sources (no longer) online • Over all of the possible sources were found: iThenticate 30 23 % PlagScan 19 15 % Urkund 16 12 % PlagAware 7 5% Ephorus 6 5% 44/44 150
  45. 45. We tested these systems on zu Guttenbergs thesis • The usability for such large works was extremely poor • The numbers appear to be random • Many sources throw a 404 “file not found” error with iThenticate • Nothing from books (or the Bundestag) was found45 / 52
  46. 46. The major problem is: • They don’t find plagiarism! Just (marginally changed) copies of text - even properly referenced! Flickr, cc-by-nc, Leeks, 200646 / 52
  47. 47. So let’s have a look ourselves.... • But doesn’t the thesis have to be available digitally? • And the thesis is so long? • And the Internet is extremely large? Flickr, cc-by-nc-nd, t_buchtele, 200947 / 52
  48. 48. Suspicion • Upon careful reading you find it nicely written, but ..... • The style is too polished, the vocabulary not that of your students. • There is some strange formatting • Interesting spelling errors • Lurching breaks in style Flickr, cc-by, redcctshirt, 200948 / 52
  49. 49. Searching with Google & Co • Phrase in "..." • 3-5 nouns Flickr, cc-by-nc-nd, Athena1970, 2008 • The typo • Check the second page of hits • Set a time limit49 / 52
  50. 50. Three words suffice!127 50/ 150
  51. 51. Really!51 / 52
  52. 52. Thank you! • Portal Plagiarism http://plagiat.htw-berlin.de • Plagiarism-Blog: http://copy-shake-paste.blogspot.com/ c. 2011: Axel Völcker, DerWedding.de • Homepage: http://www.f4.htw-berlin.de/~weberwu/ • Kontakt: katrin.koehler@student.htw-berlin.de52 / 52
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×