SlideShare a Scribd company logo
1 of 20
Anti-plagiarism tools for our
repositories
Jan Mach
University of Economics, Prague
Charles University in Prague

23. 10. 2013 Seminar on providing access to grey literature
1.

clone – presenting another’s work, word-for-word, as one’s own

2.

CTRL-C – presenting another’s work as one’s own, with minimum changes

3.

find/replace – changing key words and phrases but retaining the essential
content of the source

4.

remix – paraphrasing from several sources into a single text

6.

hybrid – mixing perfectly cited sources with non-cited ones

7.

mashup – combining several non-cited sources into a text

8.

error 404 – citations to non-existing sources or incorrect information about
a source

9.

aggregator – correct citation of foreign sources, but practically without any
personal input by the author

10.

re-tweet – correct citation, but using the original text/structure without significant
changes

What is plagiarism? without citation
5.
recycle – using an author’s previous texts,

The Plagiarism Spectrum: Tagging 10 Types of Unoriginal Work
10 types of source from which
students copy
a total of 50 documents, a
sentence and paragraph from
each

300 records – fragments of
text using various changes to
the copied sentences
Transformations used
• a sentence with two words transposed,
• a sentence with diacritics removed,
• a sentence with a single word replaced with another
with a similar meaning - paraphrasing a word,
• a sentence with several words replaced with others
with similar meanings - paraphrasing a sentence,
• a sentence machine-translated into Czech/English
Hypotheses checked
1.

The application is able to detect a single sentence copied from a source document.

2.

Hypotéza Thesis
Turnitin Ephorus GooglePl. Průměr
The application is able to detect a single paragraph copied from a source document. The application
is not affected by potential line breaks, indexes etc. in the source or tested document. 28%
1
12%
40%
2%
56%

3.

Successful detection is not impacted if the plagiarist adds/removes a word in the copied sentence.

4.
5.
6.
7.
8.
9.
10.
11.

2
14%
42%
6%
46%
27%
The application can detect Czech texts irrespective of the use of diacritics.
3
100%
100%
0%
0%
50%
Successful detection is not impacted if the plagiarist paraphrases a single word in a sentence.
4
100%
100%
0%
80%
70%
Successful detection is not impacted if the plagiarist paraphrases a whole sentence.
5
67%
100%
0%
4%
43%
Successful detection is not impacted if the plagiarist translates text from/to Czech.
6
0%
88%
na
0%
29%
The Theses.cz system should achieve the best results in the detection of
7
0%
0%
0%
0%
plagiarism in Czech theses and dissertations. 0%
A low percentage of the total number of similarities will be detected from the Anopress source
8
10%
50%
10%
30%
25%
compared to sources freely available on the internet.
9
0%
0%
0%
0%
0%
Better results with EIR and Open Access sources are achieved by foreign tools rather than Czech
10
0%
40%
0%
70%
28%
ones.
50%
0%
80%
38%
Very good results for11 sources20%be achieved by systems using web search services.
web
will
TURNITIN
ABOUT THE APPLICATION

• 15 language variants
without Czech
• large database of texts
• price based on number of
students, hundreds of
thousands of crowns
• integration with MOODLE
system and others, no API
• GradeMark and PeerMark
modules

EVALUATION OF SIMILARITIES

• processing within 30 s
• configurable size of
searched similarities,
possibility to exclude
citations
• very clear and functional
interface with similarities,
association of sources
EPHORUS
ABOUT THE APPLICATION

• the application is used by
over 3,000 schools and
universities, 4 schools in
the CR (Faculty of Business
Administration at the
University of Economics,
Prague)
• interface in Czech
• operator claims a database
with billions of web pages,
submitted works, journal
texts etc.

EVALUATION OF SIMILARITIES

• possible to define a min.
similarity percentage
• results sent by e-mail,
attachments in PDF
• basic web interface
• no source de-duplication
MUNI SYSTEMS
ABOUT THE APPLICATION

EVALUATION OF SIMILARITIES

• theses.cz, odevzdej.cz
and repozitar.cz
• over 30 public and private
schools in the CR and SR
• price per number of
students
• extensive database of
Czech theses and
dissertations, study
materials and selected web
pages
• API for connection

• processing takes some hours
• duplicate documents
• comparing pairs of
documents
→ two lists of similarities
• no overall percentage

of detected similarities
• similarities displayed
only from 5 % of the length

of one of the compared
document in a pair
The first list contains documents with
similarity of
min. 5 % of the inspected file.
bachelor paper with 40 pages: 2 pages

The second list complements the
previous list with other documents,
but only with the length of similarility
min. 5 % of the found file.
GooglePlagiarism
ABOUT THE APPLICATION

• my own desktop
application for personal
computers running
Windows
• intended for personal
analysis of documents
by an individual
• searching for whole
sentences in the Google
search engine

EVALUATION OF SIMILARITIES

• limited number of
searches→ processing
takes several hours
• HTML output without
retaining formatting
• highlighted detected
sentences and the first
corresponding source
Without retention of size and line
breaks, navigation in a text during
checks is much more difficult.
Evaluation of system control and
function
Evaluation
processing time
clarity of results
display of overall similarity
minimum similarity
price
integration with school IR's
de-duplications of sources

Thesis

Turnitin

Ephorus

GooglePl.

The Thesis.cz system stands
out thanks to its low price
and possibility of repository
integration.

The Turnitin application
excels through its user
interface and available
functions, but is expensive
and not easy to integrate.
The Ephorus system would
be a good compromise
between Thesis and Turnitin,
yet …
Number of documents detected
according to source
Category
wikipedia.cz
wikipedia.org (en)
ETDs (cz)
ETDs (en)
NDLTD
Anopress
Arxive.org
Google.cz (cz)
Google.com (en)
el. inf. Resources
total

Corpus
5
5
5
5
5
5
5
5
5
5
50

Thesis
3
1
1
0
0
0
0
2
0
0
7

Turnitin
5
3
2
3
0
0
1
3
2
3
22

Ephorus
2
2
1
0
0
0
0
0
0
0
5

GooglePl.
5
5
1
2
1
0
3
5
3
4
29

average
3,75
2,75
1,25
1,25
0,25
0
1
2,5
1,25
1,75
15,75

Category
wikipedia.cz
wikipedia.org (en)
ETDs (cz)
ETDs (en)
NDLTD
Anopress
Arxive.org
Google.cz (cz)
Google.com (en)
el. inf. Resources
average

Corpus
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%

Thesis
60%
20%
20%
0%
0%
0%
0%
40%
0%
0%
14%

Turnitin
100%
60%
40%
60%
0%
0%
20%
60%
40%
60%
44%

Ephorus
40%
40%
20%
0%
0%
0%
0%
0%
0%
0%
10%

GooglePl.
100%
100%
20%
40%
20%
0%
60%
100%
60%
80%
58%

average
75%
55%
25%
25%
5%
0%
20%
50%
25%
35%
32%

Low number of documents
found using the Ephorus
system.
Documents from Anopress
were not detected by any
system.

Most documents were
detected by the Turnitin and
GooglePlagiarism systems.
Number of documents detected
according to document language
Language
Czech
English
Slovak
total
Language
Czech
English
Slovak

Corpus
19
30
1
50
Corpus
100%
100%
100%

Thesis
6
1
0
7
Thesis
32%
3%
0%

Turnitin
10
12
0
22
Turnitin
53%
40%
0%

Ephorus
3
2
0
5
Ephorus
16%
7%
0%

GooglePl.
11
18
0
29
GooglePl.
58%
60%
0%

average
7,5
8,25
0
15,75
average
39%
28%
0%

The Theses.cz system
detected an average number of
Czech documents, but posted
the worst results for English
documents.
Still, however, more than
Ephorus overall. A reduction in
the 5% limit would significantly
enhance the success rate of
Theses.cz!
Number of records detected
according to type of change
– suspicion of plagiarism
Transformation
one sentence
one paragraph
swapping words
no diacritics
paraphrased sentence
paraphrased word
translation
total

Corpus
50
50
50
19
31
50
50
300

Thesis
6
7
6
5
0
4
0
28

Turnitin
20
21
20
9
10
20
0
100

Ephorus
1
3
1
1
0
1
1
8

GooglePl.
28
23
0
8
0
1
0
60

average
13,75
13,5
6,75
5,75
2,5
6,5
0,25
49,00

Transformation
one sentence
one paragraph
swapping words
no diacritics
paraphrased sentence
paraphrased word
translation
average

Corpus
100%
100%
100%
100%
100%
100%
100%
100%

Thesis
12%
14%
12%
26%
0%
8%
0%
10%

Turnitin
40%
42%
40%
47%
32%
40%
0%
35%

Ephorus
2%
6%
2%
5%
0%
2%
2%
3%

GooglePl.
56%
46%
0%
42%
0%
2%
0%
21%

average
28%
27%
14%
30%
8%
13%
1%
17%

Searching for whole sentences
in the GooglePlagiarism
application does not detect text
changes.
The Ephorus system detected
only 8 similar passages in the
text, however these were
mainly transcriptions of
abbreviations.
Number of detected records
according to type of change
– proof of plagiarism
Transformation
one sentence
one paragraph
swapping words
no diacritics
paraphrased sentence
paraphrased word
translation
total

Corpus
50
50
50
19
31
50
50
300

Thesis
5
6
1
4
0
3
0
19

Turnitin
8
10
7
6
2
8
0
41

Ephorus
0
1
0
0
0
0
0
1

GooglePl.
25
9
0
7
0
1
0
42

average
9,5
6,5
2
4,25
0,5
3
0
25,75

Transformation
one sentence
one paragraph
swapping words
no diacritics
paraphrased sentence
paraphrased word
translation
average

Corpus
100%
100%
100%
100%
100%
100%
100%
100%

Thesis
10%
12%
2%
21%
0%
6%
0%
7%

Turnitin
16%
20%
14%
32%
6%
16%
0%
15%

Ephorus
0%
2%
0%
0%
0%
0%
0%
0%

GooglePl.
50%
18%
0%
37%
0%
2%
0%
15%

average
19%
13%
4%
22%
2%
6%
0%
9%

The Ephorus system actually
detected only one document
clearly showing plagiarism.
As yet none of the systems
is able to search for a
translated text.
GooglePlagiarism best detects
sentences without changes,
while Turnitin best detects
sentences with changes.
Final summary
The Turnitin application achieves very good results,
but is very expensive.
The Ephorus application is inadequate at detecting
duplicates in the text corpus.
The Theses.cz application is a good compromise
between price and capability. Removing the 5% limit
on similarity detection would help.
Searching for sources online in GooglePlagiarism is
very effective at detecting copied texts.
You can find detailed test results in the proceedings of the
Seminar on providing access to grey literature 2013
http://nrgl.techlib.cz/index.php/Workshop

Jan Mach
machj@vse.cz

More Related Content

What's hot

Avoid plagiarism may 2011
Avoid plagiarism may 2011Avoid plagiarism may 2011
Avoid plagiarism may 2011weirnelson
 
Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...
Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...
Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...Valarie Anthony
 
Recognizing Plagiarism
Recognizing PlagiarismRecognizing Plagiarism
Recognizing PlagiarismJanice Orcutt
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarismguestf17a2e
 
Plagarism + Turnitin Bus Induction Feb 2010
Plagarism + Turnitin   Bus Induction Feb 2010Plagarism + Turnitin   Bus Induction Feb 2010
Plagarism + Turnitin Bus Induction Feb 2010esyin
 
In-text Citations - APA 6th ed
In-text Citations - APA 6th edIn-text Citations - APA 6th ed
In-text Citations - APA 6th edJanice Orcutt
 
Plagiarism 9 th and 10th February
Plagiarism 9 th and 10th FebruaryPlagiarism 9 th and 10th February
Plagiarism 9 th and 10th Februarythirumaraikkadu
 
Dpp404plagiarism
Dpp404plagiarismDpp404plagiarism
Dpp404plagiarismMuhd Sayuty
 
Viper plagiarism scanner
Viper plagiarism scannerViper plagiarism scanner
Viper plagiarism scannerBhumi Patel
 
Issues of plagiarism in classroom
Issues of plagiarism in classroomIssues of plagiarism in classroom
Issues of plagiarism in classroomSaurav Aryal
 
Copyright and Plagiarism
Copyright and PlagiarismCopyright and Plagiarism
Copyright and Plagiarismguestd9519f
 
Plagiarism: What it is & How to Avoid it
Plagiarism: What it is & How to Avoid it Plagiarism: What it is & How to Avoid it
Plagiarism: What it is & How to Avoid it Joyner Library
 
Reference Management and Avoiding Plagiarism
Reference Management and Avoiding PlagiarismReference Management and Avoiding Plagiarism
Reference Management and Avoiding PlagiarismVenkitachalam Sriram
 
Plagiarism ppt 1 (2)
Plagiarism ppt 1 (2)Plagiarism ppt 1 (2)
Plagiarism ppt 1 (2)mstaubs
 

What's hot (20)

What is Plagiarism
What is PlagiarismWhat is Plagiarism
What is Plagiarism
 
Avoid plagiarism may 2011
Avoid plagiarism may 2011Avoid plagiarism may 2011
Avoid plagiarism may 2011
 
Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...
Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...
Rhetorical Recycling: When Can You Use Your Ideas and Writing for More than O...
 
Plagiarism
PlagiarismPlagiarism
Plagiarism
 
Recognizing Plagiarism
Recognizing PlagiarismRecognizing Plagiarism
Recognizing Plagiarism
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
 
Plagiarism
PlagiarismPlagiarism
Plagiarism
 
Plagarism + Turnitin Bus Induction Feb 2010
Plagarism + Turnitin   Bus Induction Feb 2010Plagarism + Turnitin   Bus Induction Feb 2010
Plagarism + Turnitin Bus Induction Feb 2010
 
Plagiarism
Plagiarism Plagiarism
Plagiarism
 
In-text Citations - APA 6th ed
In-text Citations - APA 6th edIn-text Citations - APA 6th ed
In-text Citations - APA 6th ed
 
Plagiarism 9 th and 10th February
Plagiarism 9 th and 10th FebruaryPlagiarism 9 th and 10th February
Plagiarism 9 th and 10th February
 
Dpp404plagiarism
Dpp404plagiarismDpp404plagiarism
Dpp404plagiarism
 
Viper plagiarism scanner
Viper plagiarism scannerViper plagiarism scanner
Viper plagiarism scanner
 
Plagiarism
PlagiarismPlagiarism
Plagiarism
 
Issues of plagiarism in classroom
Issues of plagiarism in classroomIssues of plagiarism in classroom
Issues of plagiarism in classroom
 
Copyright and Plagiarism
Copyright and PlagiarismCopyright and Plagiarism
Copyright and Plagiarism
 
Plagiarism
PlagiarismPlagiarism
Plagiarism
 
Plagiarism: What it is & How to Avoid it
Plagiarism: What it is & How to Avoid it Plagiarism: What it is & How to Avoid it
Plagiarism: What it is & How to Avoid it
 
Reference Management and Avoiding Plagiarism
Reference Management and Avoiding PlagiarismReference Management and Avoiding Plagiarism
Reference Management and Avoiding Plagiarism
 
Plagiarism ppt 1 (2)
Plagiarism ppt 1 (2)Plagiarism ppt 1 (2)
Plagiarism ppt 1 (2)
 

Similar to Anti-plagiarism tools for our repositories

Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
JournalTOCs - Introduction and Feedback
JournalTOCs - Introduction and FeedbackJournalTOCs - Introduction and Feedback
JournalTOCs - Introduction and Feedbackazami
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
OOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria PovedaOOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria Povedasemanticsconference
 
OREChem Services and Workflows
OREChem Services and WorkflowsOREChem Services and Workflows
OREChem Services and Workflowsmarpierc
 
Jtelss presentation Paola Monachesi
Jtelss presentation Paola MonachesiJtelss presentation Paola Monachesi
Jtelss presentation Paola Monachesiguestff44453
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...cscpconf
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
A scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for microA scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for microaman341480
 
BioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialBioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialRothamsted Research, UK
 
A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism DetectionKarla Adamson
 
Linking chemistry: wider lessons for how we publish research
Linking chemistry: wider lessons for how we publish researchLinking chemistry: wider lessons for how we publish research
Linking chemistry: wider lessons for how we publish researchRoyal Society of Chemistry
 
Authors' and Publications' Citations knowledge base
Authors' and Publications' Citations knowledge base Authors' and Publications' Citations knowledge base
Authors' and Publications' Citations knowledge base Leila Zemmouchi-Ghomari
 
The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)Oscar Corcho
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 

Similar to Anti-plagiarism tools for our repositories (20)

Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
JournalTOCs - Introduction and Feedback
JournalTOCs - Introduction and FeedbackJournalTOCs - Introduction and Feedback
JournalTOCs - Introduction and Feedback
 
The Chemtools LaBLog
The Chemtools LaBLogThe Chemtools LaBLog
The Chemtools LaBLog
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
OOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria PovedaOOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria Poveda
 
OREChem Services and Workflows
OREChem Services and WorkflowsOREChem Services and Workflows
OREChem Services and Workflows
 
Jtelss presentation Paola Monachesi
Jtelss presentation Paola MonachesiJtelss presentation Paola Monachesi
Jtelss presentation Paola Monachesi
 
Transparent and scalable open url quality metrics
Transparent and scalable open url quality metricsTransparent and scalable open url quality metrics
Transparent and scalable open url quality metrics
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
A scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for microA scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for micro
 
BioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialBioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS Tutorial
 
A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism Detection
 
Linking chemistry: wider lessons for how we publish research
Linking chemistry: wider lessons for how we publish researchLinking chemistry: wider lessons for how we publish research
Linking chemistry: wider lessons for how we publish research
 
Authors' and Publications' Citations knowledge base
Authors' and Publications' Citations knowledge base Authors' and Publications' Citations knowledge base
Authors' and Publications' Citations knowledge base
 
The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 

Recently uploaded (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 

Anti-plagiarism tools for our repositories

  • 1. Anti-plagiarism tools for our repositories Jan Mach University of Economics, Prague Charles University in Prague 23. 10. 2013 Seminar on providing access to grey literature
  • 2. 1. clone – presenting another’s work, word-for-word, as one’s own 2. CTRL-C – presenting another’s work as one’s own, with minimum changes 3. find/replace – changing key words and phrases but retaining the essential content of the source 4. remix – paraphrasing from several sources into a single text 6. hybrid – mixing perfectly cited sources with non-cited ones 7. mashup – combining several non-cited sources into a text 8. error 404 – citations to non-existing sources or incorrect information about a source 9. aggregator – correct citation of foreign sources, but practically without any personal input by the author 10. re-tweet – correct citation, but using the original text/structure without significant changes What is plagiarism? without citation 5. recycle – using an author’s previous texts, The Plagiarism Spectrum: Tagging 10 Types of Unoriginal Work
  • 3. 10 types of source from which students copy a total of 50 documents, a sentence and paragraph from each 300 records – fragments of text using various changes to the copied sentences
  • 4. Transformations used • a sentence with two words transposed, • a sentence with diacritics removed, • a sentence with a single word replaced with another with a similar meaning - paraphrasing a word, • a sentence with several words replaced with others with similar meanings - paraphrasing a sentence, • a sentence machine-translated into Czech/English
  • 5. Hypotheses checked 1. The application is able to detect a single sentence copied from a source document. 2. Hypotéza Thesis Turnitin Ephorus GooglePl. Průměr The application is able to detect a single paragraph copied from a source document. The application is not affected by potential line breaks, indexes etc. in the source or tested document. 28% 1 12% 40% 2% 56% 3. Successful detection is not impacted if the plagiarist adds/removes a word in the copied sentence. 4. 5. 6. 7. 8. 9. 10. 11. 2 14% 42% 6% 46% 27% The application can detect Czech texts irrespective of the use of diacritics. 3 100% 100% 0% 0% 50% Successful detection is not impacted if the plagiarist paraphrases a single word in a sentence. 4 100% 100% 0% 80% 70% Successful detection is not impacted if the plagiarist paraphrases a whole sentence. 5 67% 100% 0% 4% 43% Successful detection is not impacted if the plagiarist translates text from/to Czech. 6 0% 88% na 0% 29% The Theses.cz system should achieve the best results in the detection of 7 0% 0% 0% 0% plagiarism in Czech theses and dissertations. 0% A low percentage of the total number of similarities will be detected from the Anopress source 8 10% 50% 10% 30% 25% compared to sources freely available on the internet. 9 0% 0% 0% 0% 0% Better results with EIR and Open Access sources are achieved by foreign tools rather than Czech 10 0% 40% 0% 70% 28% ones. 50% 0% 80% 38% Very good results for11 sources20%be achieved by systems using web search services. web will
  • 6. TURNITIN ABOUT THE APPLICATION • 15 language variants without Czech • large database of texts • price based on number of students, hundreds of thousands of crowns • integration with MOODLE system and others, no API • GradeMark and PeerMark modules EVALUATION OF SIMILARITIES • processing within 30 s • configurable size of searched similarities, possibility to exclude citations • very clear and functional interface with similarities, association of sources
  • 7.
  • 8. EPHORUS ABOUT THE APPLICATION • the application is used by over 3,000 schools and universities, 4 schools in the CR (Faculty of Business Administration at the University of Economics, Prague) • interface in Czech • operator claims a database with billions of web pages, submitted works, journal texts etc. EVALUATION OF SIMILARITIES • possible to define a min. similarity percentage • results sent by e-mail, attachments in PDF • basic web interface • no source de-duplication
  • 9.
  • 10. MUNI SYSTEMS ABOUT THE APPLICATION EVALUATION OF SIMILARITIES • theses.cz, odevzdej.cz and repozitar.cz • over 30 public and private schools in the CR and SR • price per number of students • extensive database of Czech theses and dissertations, study materials and selected web pages • API for connection • processing takes some hours • duplicate documents • comparing pairs of documents → two lists of similarities • no overall percentage of detected similarities • similarities displayed only from 5 % of the length of one of the compared document in a pair
  • 11. The first list contains documents with similarity of min. 5 % of the inspected file. bachelor paper with 40 pages: 2 pages The second list complements the previous list with other documents, but only with the length of similarility min. 5 % of the found file.
  • 12. GooglePlagiarism ABOUT THE APPLICATION • my own desktop application for personal computers running Windows • intended for personal analysis of documents by an individual • searching for whole sentences in the Google search engine EVALUATION OF SIMILARITIES • limited number of searches→ processing takes several hours • HTML output without retaining formatting • highlighted detected sentences and the first corresponding source
  • 13. Without retention of size and line breaks, navigation in a text during checks is much more difficult.
  • 14. Evaluation of system control and function Evaluation processing time clarity of results display of overall similarity minimum similarity price integration with school IR's de-duplications of sources Thesis Turnitin Ephorus GooglePl. The Thesis.cz system stands out thanks to its low price and possibility of repository integration. The Turnitin application excels through its user interface and available functions, but is expensive and not easy to integrate. The Ephorus system would be a good compromise between Thesis and Turnitin, yet …
  • 15. Number of documents detected according to source Category wikipedia.cz wikipedia.org (en) ETDs (cz) ETDs (en) NDLTD Anopress Arxive.org Google.cz (cz) Google.com (en) el. inf. Resources total Corpus 5 5 5 5 5 5 5 5 5 5 50 Thesis 3 1 1 0 0 0 0 2 0 0 7 Turnitin 5 3 2 3 0 0 1 3 2 3 22 Ephorus 2 2 1 0 0 0 0 0 0 0 5 GooglePl. 5 5 1 2 1 0 3 5 3 4 29 average 3,75 2,75 1,25 1,25 0,25 0 1 2,5 1,25 1,75 15,75 Category wikipedia.cz wikipedia.org (en) ETDs (cz) ETDs (en) NDLTD Anopress Arxive.org Google.cz (cz) Google.com (en) el. inf. Resources average Corpus 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% Thesis 60% 20% 20% 0% 0% 0% 0% 40% 0% 0% 14% Turnitin 100% 60% 40% 60% 0% 0% 20% 60% 40% 60% 44% Ephorus 40% 40% 20% 0% 0% 0% 0% 0% 0% 0% 10% GooglePl. 100% 100% 20% 40% 20% 0% 60% 100% 60% 80% 58% average 75% 55% 25% 25% 5% 0% 20% 50% 25% 35% 32% Low number of documents found using the Ephorus system. Documents from Anopress were not detected by any system. Most documents were detected by the Turnitin and GooglePlagiarism systems.
  • 16. Number of documents detected according to document language Language Czech English Slovak total Language Czech English Slovak Corpus 19 30 1 50 Corpus 100% 100% 100% Thesis 6 1 0 7 Thesis 32% 3% 0% Turnitin 10 12 0 22 Turnitin 53% 40% 0% Ephorus 3 2 0 5 Ephorus 16% 7% 0% GooglePl. 11 18 0 29 GooglePl. 58% 60% 0% average 7,5 8,25 0 15,75 average 39% 28% 0% The Theses.cz system detected an average number of Czech documents, but posted the worst results for English documents. Still, however, more than Ephorus overall. A reduction in the 5% limit would significantly enhance the success rate of Theses.cz!
  • 17. Number of records detected according to type of change – suspicion of plagiarism Transformation one sentence one paragraph swapping words no diacritics paraphrased sentence paraphrased word translation total Corpus 50 50 50 19 31 50 50 300 Thesis 6 7 6 5 0 4 0 28 Turnitin 20 21 20 9 10 20 0 100 Ephorus 1 3 1 1 0 1 1 8 GooglePl. 28 23 0 8 0 1 0 60 average 13,75 13,5 6,75 5,75 2,5 6,5 0,25 49,00 Transformation one sentence one paragraph swapping words no diacritics paraphrased sentence paraphrased word translation average Corpus 100% 100% 100% 100% 100% 100% 100% 100% Thesis 12% 14% 12% 26% 0% 8% 0% 10% Turnitin 40% 42% 40% 47% 32% 40% 0% 35% Ephorus 2% 6% 2% 5% 0% 2% 2% 3% GooglePl. 56% 46% 0% 42% 0% 2% 0% 21% average 28% 27% 14% 30% 8% 13% 1% 17% Searching for whole sentences in the GooglePlagiarism application does not detect text changes. The Ephorus system detected only 8 similar passages in the text, however these were mainly transcriptions of abbreviations.
  • 18. Number of detected records according to type of change – proof of plagiarism Transformation one sentence one paragraph swapping words no diacritics paraphrased sentence paraphrased word translation total Corpus 50 50 50 19 31 50 50 300 Thesis 5 6 1 4 0 3 0 19 Turnitin 8 10 7 6 2 8 0 41 Ephorus 0 1 0 0 0 0 0 1 GooglePl. 25 9 0 7 0 1 0 42 average 9,5 6,5 2 4,25 0,5 3 0 25,75 Transformation one sentence one paragraph swapping words no diacritics paraphrased sentence paraphrased word translation average Corpus 100% 100% 100% 100% 100% 100% 100% 100% Thesis 10% 12% 2% 21% 0% 6% 0% 7% Turnitin 16% 20% 14% 32% 6% 16% 0% 15% Ephorus 0% 2% 0% 0% 0% 0% 0% 0% GooglePl. 50% 18% 0% 37% 0% 2% 0% 15% average 19% 13% 4% 22% 2% 6% 0% 9% The Ephorus system actually detected only one document clearly showing plagiarism. As yet none of the systems is able to search for a translated text. GooglePlagiarism best detects sentences without changes, while Turnitin best detects sentences with changes.
  • 19. Final summary The Turnitin application achieves very good results, but is very expensive. The Ephorus application is inadequate at detecting duplicates in the text corpus. The Theses.cz application is a good compromise between price and capability. Removing the 5% limit on similarity detection would help. Searching for sources online in GooglePlagiarism is very effective at detecting copied texts.
  • 20. You can find detailed test results in the proceedings of the Seminar on providing access to grey literature 2013 http://nrgl.techlib.cz/index.php/Workshop Jan Mach machj@vse.cz

Editor's Notes

  1. [List of important URLs]