SlideShare a Scribd company logo
Antiplagiat Research
Rita Kuznetsova, Oleg Bakhteev, Alexey Romanov
12.11.2016 AINL FRUCT’16 1 / 29
Outline
Intro
Cross-Language Plagiarism Detection
Machine-Generated Text Detection
Intrinsic Plagiarism Detection
Collaboration
12.11.2016 AINL FRUCT’16 2 / 29
What’s Anti-Plagiat JSC
Anti-Plagiat System
• Detects text reuse in any language and for any popular
file type
• Discovers cheating
Few numbers
• Over 500 universities
• 140 M sources in search databases
• 25 M text checked per year
12.11.2016 AINL FRUCT’16 3 / 29
What’s Antiplagiat Research?
Antiplagiat Research tackles the most challenging problems
in the area of natural language processing and plagiarism
detection.
• Development of advancing technology
• Propagation of scientific thought
• Unity of young talents from leading institutions
— Moscow Phystech (MIPT)
— Computing Centre of RAS
— Moscow State University
12.11.2016 AINL FRUCT’16 4 / 29
History of the Project
• Oct ’14: launch of the project by Antiplagiat JSC
• Aug ’15: first conference participation
• Nov ’15: comprehensive study on machine-generated text
detection in real-world data
• Apr ’16: PAN 2016 participation (Top-1 in 2 tracks of
Author Diarization task)
• Jul ’16: development of cross-language plagiarism
detection tool powered by state-of-the-art techniques
. . . and great growth opportunities
12.11.2016 AINL FRUCT’16 5 / 29
Areas of Interest
• Cross-Language Plagiarism
• Paraphrase Detection
• Machine-Generated Text Detection
• Automatic Text Categorization
• Intelligent Search and Topic Search
• Author Diarization
• Smart Evaluation of Research Papers
12.11.2016 AINL FRUCT’16 6 / 29
Problems in Focus
12.11.2016 AINL FRUCT’16 7 / 29
Types of Text Reuse
Text reuse can be classified into several categories:
• copying text ”as is”
• text reuse with paraphrasing
— Mr.Dursley always sat with his back to the window in his
office on the ninth floor.
— Mr.Dursley always propped his back on the glass window on
the ninth floor of the office.
• cross-language plagiarism
— A cat was sitting on the table.
— На столе сидела кошка.
12.11.2016 AINL FRUCT’16 8 / 29
Cross-Language Plagiarism Problem
The problem has ancient origins and still remains topical...
12.11.2016 AINL FRUCT’16 9 / 29
Cross-Language Plagiarism Problem
The problem has ancient origins and still remains topical...
12.11.2016 AINL FRUCT’16 10 / 29
Cross-Language Plagiarism Problem
Problem
• A large proportion of texts contain reused fragments from
another language.
• The problem of cross-lingual textual similarity in the case
of Russian being one of the languages in a pair is poorly
known.
• The majority of methods that involve machine translation
stage, generates texts that differ too much from the
sources of plagiarism.
Our goal
Develop a method for cross-lingual (Russian and English) text
reuse detection that based on the monolingual approach.
12.11.2016 AINL FRUCT’16 11 / 29
Cross-Language Plagiarism Detection Tool
• Explicit Semantic Analysis for Cross-Language Retrieval in Case of
Russian-English Translation — RuSSIR 2015
• A Monolingual Approach to Detection of Text Reuse in Russian-English
Collection — AINL-ISMW FRUCT 2015
• Candidate Document Retrieval for Cross-Lingual Plagiarism Detection — IDP
2016
12.11.2016 AINL FRUCT’16 12 / 29
Cross-Language Plagiarism Detection - main stages
• Given: English document collection and suspicious
Russian document
• The first stage:
— Find candidate documents, which possibly contain reused text
from the suspicious document, in the collection.
— Rank these documents according to their relevance values.
• The second stage:
— Split the suspicious document and candidate documents into
segments.
— Compare with each other.
12.11.2016 AINL FRUCT’16 13 / 29
Machine-Generated Text Detection Problem
• Problem is not new, tools for paper generation have been
available for 10 years already
• Past research on generated papers discovered a hundred
of them in IEEE, Elsevier, Springer journals (2009 and
later)
Task
Distinguish machine-generated papers from authentic
documents automatically.
Key assumption
Most of papers are generated with one of several popular tools.
12.11.2016 AINL FRUCT’16 14 / 29
Machine-Generated Text Detection Problem
Today you can write a paper on a given topic with one click!
SCIgen - An Automatic CS Paper Generator
12.11.2016 AINL FRUCT’16 15 / 29
Machine-Generated Text Detection Problem
Today you can write a paper on a given topic with one click!
Mathgen: Randomly generated math papers
12.11.2016 AINL FRUCT’16 16 / 29
Machine-Generated Text Detection in Real-World
Data
Automatic detection of gibberish papers should:
• deal with big data (millions of papers in real-world
collections),
• be applicable for the Russian language,
• capture texts prepared with various generation tools,
• also detect machine-translated text chunks containing
grammatical errors.
Our findings:
• Исследование коллекции eLIBRARY.RU на наличие искусственных и
ненаучных текстов — SCIENCE ONLINE 2016
12.11.2016 AINL FRUCT’16 17 / 29
eLIBRARY.RU
• Search a collection of scientific papers of eLIBRARY.RU
for machine-generated and non-scientific papers
• Classification task
— Machine-generated vs. human-written texts
— Scientific papers vs. fiction texts
• Text features:syntactic and lexical
• Results
— We did’t find any machine-generated texts like «Korchevatel»
in the collection of eLIBRARY.RU
— We found: anniversary congratulations, business news,
interviews, bibliographies, memorials, etc.
12.11.2016 AINL FRUCT’16 18 / 29
“Fly, pie, to the oven”. Non-scientific paper in a
scientific journal on baking bread
12.11.2016 AINL FRUCT’16 19 / 29
Machine-Translated Text Detection
• Recent advances in the field of statistical machine
translation (SMT) lead to high availability of SMT
systems on the Web.
• Student reports, term works and theses lack proper
analysis by their tutors.
• It is very tempting to find relevant information in English,
automatically translate it into Russian, and paste it into
the paper “as is”!
• Machine-translated texts often contain grammatical errors
or inappropriate words:
— First individuals in the system take the maximum number of
contacts for any parameter combination.
— Первые лица в системе взять максимальное количество
контактов для любой комбинации параметров.
12.11.2016 AINL FRUCT’16 20 / 29
Solution design for MT detection
• Let’s estimate the likelihood that a sentence is
machine-translated, according to several language models
(LMs). . .
— Lexical 2,3-gram LMs trained on authentic texts
— Lexical 2,3-gram LMs trained on machine-translated texts
— POS tag 2,3-gram LMs trained on authentic texts
— POS tag 2,3-gram LMs trained on machine-translated texts
— word2vec (skip-gram and CBOW) models trained on
authentic texts
• . . . and use these estimates as features for classification
task. 2 * 4 + 2 = 10 features in total
• The classifier is trained on a mixed labeled sample of
authentic and machine-translated sentences.
Our findings:
• Machine-Translated Text Detection in a Collection of Russian Scientific
Papers — Dialogue 2016
12.11.2016 AINL FRUCT’16 21 / 29
Intrinsic Plagiarism Detection Problem
IPD Task
Detecting the plagiarized parts of given document by analyzing
the writing style.
Main Challenges
• No external collection
• No further possibilities to uncover plagiarism besides
detecting suspicious text parts which significantly differ
from the rest of the document
• Even if suspicious text parts are found, there is still no
guarantee that these parts are truly plagiarized
12.11.2016 AINL FRUCT’16 22 / 29
PAN @ CLEF 2016
PAN: Uncovering Plagiarism, Authorship and Social Software
Misuse
• Held since 2007
• Offers:
— Large-scale corpora for EPD and IPD algorithms
— Performance measure scheme
12.11.2016 AINL FRUCT’16 23 / 29
PAN Tasks
1. Intrinsic plagiarism detection.
1.1 There exists one main author who wrote at least 70% of the
text.
1.2 Up to the other 30% may be written by other authors.
2. Diarization with a given number (n) of authors.
2.1 There are (n) of authors, no main author
2.2 Each author may have contributed to an arbitrary extent.
3. Diarization with an unknown number of authors.
3.1 No information about how many authors contributed to the
document.
12.11.2016 AINL FRUCT’16 24 / 29
Solving the Problem
Common scheme involves several stages:
• text segmentation (sentences, blocks, paragraphs etc.),
• map each segment to the feature space,
• outlier detection (or clustering for author diarization).
• Methods for Intrinsic Plagiarism Detection and Author Diarization—Notebook
for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop –
Working Notes Papers, 5-8 September, ´Evora, Portugal, September 2016.
CEUR-WS.org. ISSN 1613-0073.
12.11.2016 AINL FRUCT’16 25 / 29
Collaboration Opportunities
12.11.2016 AINL FRUCT’16 26 / 29
Research Collaboration
Opportunities for research collaboration include:
• Joint non-profit studies
• Custom research
• Consulting and mentorship
• Joint laboratories (joint & grant financing)
• Internship opportunities
• Thesis research
12.11.2016 AINL FRUCT’16 27 / 29
Dialogue Evaluation’17 - Plagiarism Detection
The PlagEvalRus workshop
Focused on evaluation of Russian-specific plagiarism detection
algorithms. The workshops emphasize on external plagiarism
detection in scientific texts (academic plagiarism).
With support of:
• PAN
• Dialogue conference
• CyberLeninka
www.dialog-21.ru/evaluation/2017/plageval/
12.11.2016 AINL FRUCT’16 28 / 29
Thanks for you attention!
Questions / Comments?
12.11.2016 AINL FRUCT’16 29 / 29

More Related Content

What's hot

Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 
Digital repertoires of poetry metrics: towards a Linked Open Data ecosystem
Digital repertoires of poetry metrics: towards a Linked Open Data ecosystemDigital repertoires of poetry metrics: towards a Linked Open Data ecosystem
Digital repertoires of poetry metrics: towards a Linked Open Data ecosystem
Uned Laboratorio de Innovación en Humanidades
 
Linked open data: standardization, interoperability and multilingual challeng...
Linked open data: standardization, interoperability and multilingual challeng...Linked open data: standardization, interoperability and multilingual challeng...
Linked open data: standardization, interoperability and multilingual challeng...
Uned Laboratorio de Innovación en Humanidades
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
Vsevolod Dyomkin
 
POSTDATA: Towards publishing European Poetry as Linked Open Data
POSTDATA: Towards publishing European Poetry as Linked Open DataPOSTDATA: Towards publishing European Poetry as Linked Open Data
POSTDATA: Towards publishing European Poetry as Linked Open Data
Uned Laboratorio de Innovación en Humanidades
 
Deep learning Type Inference for Dynamic Programming Languages
Deep learning Type Inference for Dynamic Programming Languages Deep learning Type Inference for Dynamic Programming Languages
Deep learning Type Inference for Dynamic Programming Languages
Amir M. Mir
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
Vsevolod Dyomkin
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curation
blalbritton
 
Mchristy-eMOP-workflows2-24x7
Mchristy-eMOP-workflows2-24x7Mchristy-eMOP-workflows2-24x7
Mchristy-eMOP-workflows2-24x7
Matt Christy
 
Data wrangling week 6
Data wrangling week 6Data wrangling week 6
Data wrangling week 6
Ferdin Joe John Joseph PhD
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-tools
Matt Christy
 
What's Spain's Paris? Mining Analogical Libraries from Q&A Discussions
What's Spain's Paris? Mining Analogical Libraries from Q&A DiscussionsWhat's Spain's Paris? Mining Analogical Libraries from Q&A Discussions
What's Spain's Paris? Mining Analogical Libraries from Q&A Discussions
Chunyang Chen
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
Marina Santini
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Matīss ‎‎‎‎‎‎‎  
 
Computational Rhetoric for Serbian - Resources and Implementation
Computational Rhetoric for Serbian - Resources and ImplementationComputational Rhetoric for Serbian - Resources and Implementation
Computational Rhetoric for Serbian - Resources and Implementation
Jelena Mitrovic
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
Lifeng (Aaron) Han
 
CoLing 2016
CoLing 2016CoLing 2016
Scanned texts as corpora - a case study
Scanned texts as corpora - a case study Scanned texts as corpora - a case study
Scanned texts as corpora - a case study
jsbien
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Harsh Thakkar
 

What's hot (20)

Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Digital repertoires of poetry metrics: towards a Linked Open Data ecosystem
Digital repertoires of poetry metrics: towards a Linked Open Data ecosystemDigital repertoires of poetry metrics: towards a Linked Open Data ecosystem
Digital repertoires of poetry metrics: towards a Linked Open Data ecosystem
 
Linked open data: standardization, interoperability and multilingual challeng...
Linked open data: standardization, interoperability and multilingual challeng...Linked open data: standardization, interoperability and multilingual challeng...
Linked open data: standardization, interoperability and multilingual challeng...
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
POSTDATA: Towards publishing European Poetry as Linked Open Data
POSTDATA: Towards publishing European Poetry as Linked Open DataPOSTDATA: Towards publishing European Poetry as Linked Open Data
POSTDATA: Towards publishing European Poetry as Linked Open Data
 
Deep learning Type Inference for Dynamic Programming Languages
Deep learning Type Inference for Dynamic Programming Languages Deep learning Type Inference for Dynamic Programming Languages
Deep learning Type Inference for Dynamic Programming Languages
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curation
 
Mchristy-eMOP-workflows2-24x7
Mchristy-eMOP-workflows2-24x7Mchristy-eMOP-workflows2-24x7
Mchristy-eMOP-workflows2-24x7
 
Data wrangling week 6
Data wrangling week 6Data wrangling week 6
Data wrangling week 6
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-tools
 
What's Spain's Paris? Mining Analogical Libraries from Q&A Discussions
What's Spain's Paris? Mining Analogical Libraries from Q&A DiscussionsWhat's Spain's Paris? Mining Analogical Libraries from Q&A Discussions
What's Spain's Paris? Mining Analogical Libraries from Q&A Discussions
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 
Computational Rhetoric for Serbian - Resources and Implementation
Computational Rhetoric for Serbian - Resources and ImplementationComputational Rhetoric for Serbian - Resources and Implementation
Computational Rhetoric for Serbian - Resources and Implementation
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
CoLing 2016
CoLing 2016CoLing 2016
CoLing 2016
 
Scanned texts as corpora - a case study
Scanned texts as corpora - a case study Scanned texts as corpora - a case study
Scanned texts as corpora - a case study
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 

Viewers also liked

AINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaAINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, Ledovaya
Lidia Pivovarova
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
Lidia Pivovarova
 
AINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoAINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, Couto
Lidia Pivovarova
 
AINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinAINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, Kazorin
Lidia Pivovarova
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
Lidia Pivovarova
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
Lidia Pivovarova
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
Lidia Pivovarova
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
Lidia Pivovarova
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
Lidia Pivovarova
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
Lidia Pivovarova
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
Lidia Pivovarova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
Lidia Pivovarova
 
AINL 2016: Skornyakov
AINL 2016: SkornyakovAINL 2016: Skornyakov
AINL 2016: Skornyakov
Lidia Pivovarova
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
Lidia Pivovarova
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
Lidia Pivovarova
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
Lidia Pivovarova
 
AINL 2016: Ustalov
AINL 2016: Ustalov AINL 2016: Ustalov
AINL 2016: Ustalov
Lidia Pivovarova
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, Nefedov
Lidia Pivovarova
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
Lidia Pivovarova
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
Lidia Pivovarova
 

Viewers also liked (20)

AINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaAINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, Ledovaya
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoAINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, Couto
 
AINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinAINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, Kazorin
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Skornyakov
AINL 2016: SkornyakovAINL 2016: Skornyakov
AINL 2016: Skornyakov
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
AINL 2016: Ustalov
AINL 2016: Ustalov AINL 2016: Ustalov
AINL 2016: Ustalov
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, Nefedov
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 

Similar to AINL 2016: Kuznetsova

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Liz Grumbach
 
Plc part 1
Plc part 1Plc part 1
Plc part 1
Taymoor Nazmy
 
Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13
Jessica C. Murphy
 
SynPhony2014
SynPhony2014SynPhony2014
SynPhony2014
langtech
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
kashmasardar
 
Com ling
Com lingCom ling
Com ling
Mohammad Raza
 
Relation between Languages, Machines and Computations
Relation between Languages, Machines and ComputationsRelation between Languages, Machines and Computations
Relation between Languages, Machines and Computations
BHARATH KUMAR
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Francisco Manuel Rangel Pardo
 
22_ideals.ppt
22_ideals.ppt22_ideals.ppt
22_ideals.ppt
DanielPerez457035
 
22_ideals (1).ppt
22_ideals (1).ppt22_ideals (1).ppt
22_ideals (1).ppt
Jadna Almeida
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledge
openminted_eu
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Universitat Politècnica de Catalunya
 
The Two Cultures of Programming
The Two Cultures of ProgrammingThe Two Cultures of Programming
The Two Cultures of Programming
Joshua Ballanco
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
Jie Bao
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
CS, NcState
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
Daniel S. Katz
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
HPCC Systems
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
tanishamahajan11
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
Theodore J. LaGrow
 
Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?
Julien PLU
 

Similar to AINL 2016: Kuznetsova (20)

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
 
Plc part 1
Plc part 1Plc part 1
Plc part 1
 
Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13
 
SynPhony2014
SynPhony2014SynPhony2014
SynPhony2014
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
 
Com ling
Com lingCom ling
Com ling
 
Relation between Languages, Machines and Computations
Relation between Languages, Machines and ComputationsRelation between Languages, Machines and Computations
Relation between Languages, Machines and Computations
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
 
22_ideals.ppt
22_ideals.ppt22_ideals.ppt
22_ideals.ppt
 
22_ideals (1).ppt
22_ideals (1).ppt22_ideals (1).ppt
22_ideals (1).ppt
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledge
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
 
The Two Cultures of Programming
The Two Cultures of ProgrammingThe Two Cultures of Programming
The Two Cultures of Programming
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?
 

More from Lidia Pivovarova

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
Lidia Pivovarova
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
Lidia Pivovarova
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
Lidia Pivovarova
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
Lidia Pivovarova
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
Lidia Pivovarova
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
Lidia Pivovarova
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
Lidia Pivovarova
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
Lidia Pivovarova
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
Lidia Pivovarova
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
Lidia Pivovarova
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
Lidia Pivovarova
 

More from Lidia Pivovarova (11)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
 

Recently uploaded

mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 

Recently uploaded (20)

mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 

AINL 2016: Kuznetsova

  • 1. Antiplagiat Research Rita Kuznetsova, Oleg Bakhteev, Alexey Romanov 12.11.2016 AINL FRUCT’16 1 / 29
  • 2. Outline Intro Cross-Language Plagiarism Detection Machine-Generated Text Detection Intrinsic Plagiarism Detection Collaboration 12.11.2016 AINL FRUCT’16 2 / 29
  • 3. What’s Anti-Plagiat JSC Anti-Plagiat System • Detects text reuse in any language and for any popular file type • Discovers cheating Few numbers • Over 500 universities • 140 M sources in search databases • 25 M text checked per year 12.11.2016 AINL FRUCT’16 3 / 29
  • 4. What’s Antiplagiat Research? Antiplagiat Research tackles the most challenging problems in the area of natural language processing and plagiarism detection. • Development of advancing technology • Propagation of scientific thought • Unity of young talents from leading institutions — Moscow Phystech (MIPT) — Computing Centre of RAS — Moscow State University 12.11.2016 AINL FRUCT’16 4 / 29
  • 5. History of the Project • Oct ’14: launch of the project by Antiplagiat JSC • Aug ’15: first conference participation • Nov ’15: comprehensive study on machine-generated text detection in real-world data • Apr ’16: PAN 2016 participation (Top-1 in 2 tracks of Author Diarization task) • Jul ’16: development of cross-language plagiarism detection tool powered by state-of-the-art techniques . . . and great growth opportunities 12.11.2016 AINL FRUCT’16 5 / 29
  • 6. Areas of Interest • Cross-Language Plagiarism • Paraphrase Detection • Machine-Generated Text Detection • Automatic Text Categorization • Intelligent Search and Topic Search • Author Diarization • Smart Evaluation of Research Papers 12.11.2016 AINL FRUCT’16 6 / 29
  • 7. Problems in Focus 12.11.2016 AINL FRUCT’16 7 / 29
  • 8. Types of Text Reuse Text reuse can be classified into several categories: • copying text ”as is” • text reuse with paraphrasing — Mr.Dursley always sat with his back to the window in his office on the ninth floor. — Mr.Dursley always propped his back on the glass window on the ninth floor of the office. • cross-language plagiarism — A cat was sitting on the table. — На столе сидела кошка. 12.11.2016 AINL FRUCT’16 8 / 29
  • 9. Cross-Language Plagiarism Problem The problem has ancient origins and still remains topical... 12.11.2016 AINL FRUCT’16 9 / 29
  • 10. Cross-Language Plagiarism Problem The problem has ancient origins and still remains topical... 12.11.2016 AINL FRUCT’16 10 / 29
  • 11. Cross-Language Plagiarism Problem Problem • A large proportion of texts contain reused fragments from another language. • The problem of cross-lingual textual similarity in the case of Russian being one of the languages in a pair is poorly known. • The majority of methods that involve machine translation stage, generates texts that differ too much from the sources of plagiarism. Our goal Develop a method for cross-lingual (Russian and English) text reuse detection that based on the monolingual approach. 12.11.2016 AINL FRUCT’16 11 / 29
  • 12. Cross-Language Plagiarism Detection Tool • Explicit Semantic Analysis for Cross-Language Retrieval in Case of Russian-English Translation — RuSSIR 2015 • A Monolingual Approach to Detection of Text Reuse in Russian-English Collection — AINL-ISMW FRUCT 2015 • Candidate Document Retrieval for Cross-Lingual Plagiarism Detection — IDP 2016 12.11.2016 AINL FRUCT’16 12 / 29
  • 13. Cross-Language Plagiarism Detection - main stages • Given: English document collection and suspicious Russian document • The first stage: — Find candidate documents, which possibly contain reused text from the suspicious document, in the collection. — Rank these documents according to their relevance values. • The second stage: — Split the suspicious document and candidate documents into segments. — Compare with each other. 12.11.2016 AINL FRUCT’16 13 / 29
  • 14. Machine-Generated Text Detection Problem • Problem is not new, tools for paper generation have been available for 10 years already • Past research on generated papers discovered a hundred of them in IEEE, Elsevier, Springer journals (2009 and later) Task Distinguish machine-generated papers from authentic documents automatically. Key assumption Most of papers are generated with one of several popular tools. 12.11.2016 AINL FRUCT’16 14 / 29
  • 15. Machine-Generated Text Detection Problem Today you can write a paper on a given topic with one click! SCIgen - An Automatic CS Paper Generator 12.11.2016 AINL FRUCT’16 15 / 29
  • 16. Machine-Generated Text Detection Problem Today you can write a paper on a given topic with one click! Mathgen: Randomly generated math papers 12.11.2016 AINL FRUCT’16 16 / 29
  • 17. Machine-Generated Text Detection in Real-World Data Automatic detection of gibberish papers should: • deal with big data (millions of papers in real-world collections), • be applicable for the Russian language, • capture texts prepared with various generation tools, • also detect machine-translated text chunks containing grammatical errors. Our findings: • Исследование коллекции eLIBRARY.RU на наличие искусственных и ненаучных текстов — SCIENCE ONLINE 2016 12.11.2016 AINL FRUCT’16 17 / 29
  • 18. eLIBRARY.RU • Search a collection of scientific papers of eLIBRARY.RU for machine-generated and non-scientific papers • Classification task — Machine-generated vs. human-written texts — Scientific papers vs. fiction texts • Text features:syntactic and lexical • Results — We did’t find any machine-generated texts like «Korchevatel» in the collection of eLIBRARY.RU — We found: anniversary congratulations, business news, interviews, bibliographies, memorials, etc. 12.11.2016 AINL FRUCT’16 18 / 29
  • 19. “Fly, pie, to the oven”. Non-scientific paper in a scientific journal on baking bread 12.11.2016 AINL FRUCT’16 19 / 29
  • 20. Machine-Translated Text Detection • Recent advances in the field of statistical machine translation (SMT) lead to high availability of SMT systems on the Web. • Student reports, term works and theses lack proper analysis by their tutors. • It is very tempting to find relevant information in English, automatically translate it into Russian, and paste it into the paper “as is”! • Machine-translated texts often contain grammatical errors or inappropriate words: — First individuals in the system take the maximum number of contacts for any parameter combination. — Первые лица в системе взять максимальное количество контактов для любой комбинации параметров. 12.11.2016 AINL FRUCT’16 20 / 29
  • 21. Solution design for MT detection • Let’s estimate the likelihood that a sentence is machine-translated, according to several language models (LMs). . . — Lexical 2,3-gram LMs trained on authentic texts — Lexical 2,3-gram LMs trained on machine-translated texts — POS tag 2,3-gram LMs trained on authentic texts — POS tag 2,3-gram LMs trained on machine-translated texts — word2vec (skip-gram and CBOW) models trained on authentic texts • . . . and use these estimates as features for classification task. 2 * 4 + 2 = 10 features in total • The classifier is trained on a mixed labeled sample of authentic and machine-translated sentences. Our findings: • Machine-Translated Text Detection in a Collection of Russian Scientific Papers — Dialogue 2016 12.11.2016 AINL FRUCT’16 21 / 29
  • 22. Intrinsic Plagiarism Detection Problem IPD Task Detecting the plagiarized parts of given document by analyzing the writing style. Main Challenges • No external collection • No further possibilities to uncover plagiarism besides detecting suspicious text parts which significantly differ from the rest of the document • Even if suspicious text parts are found, there is still no guarantee that these parts are truly plagiarized 12.11.2016 AINL FRUCT’16 22 / 29
  • 23. PAN @ CLEF 2016 PAN: Uncovering Plagiarism, Authorship and Social Software Misuse • Held since 2007 • Offers: — Large-scale corpora for EPD and IPD algorithms — Performance measure scheme 12.11.2016 AINL FRUCT’16 23 / 29
  • 24. PAN Tasks 1. Intrinsic plagiarism detection. 1.1 There exists one main author who wrote at least 70% of the text. 1.2 Up to the other 30% may be written by other authors. 2. Diarization with a given number (n) of authors. 2.1 There are (n) of authors, no main author 2.2 Each author may have contributed to an arbitrary extent. 3. Diarization with an unknown number of authors. 3.1 No information about how many authors contributed to the document. 12.11.2016 AINL FRUCT’16 24 / 29
  • 25. Solving the Problem Common scheme involves several stages: • text segmentation (sentences, blocks, paragraphs etc.), • map each segment to the feature space, • outlier detection (or clustering for author diarization). • Methods for Intrinsic Plagiarism Detection and Author Diarization—Notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, ´Evora, Portugal, September 2016. CEUR-WS.org. ISSN 1613-0073. 12.11.2016 AINL FRUCT’16 25 / 29
  • 27. Research Collaboration Opportunities for research collaboration include: • Joint non-profit studies • Custom research • Consulting and mentorship • Joint laboratories (joint & grant financing) • Internship opportunities • Thesis research 12.11.2016 AINL FRUCT’16 27 / 29
  • 28. Dialogue Evaluation’17 - Plagiarism Detection The PlagEvalRus workshop Focused on evaluation of Russian-specific plagiarism detection algorithms. The workshops emphasize on external plagiarism detection in scientific texts (academic plagiarism). With support of: • PAN • Dialogue conference • CyberLeninka www.dialog-21.ru/evaluation/2017/plageval/ 12.11.2016 AINL FRUCT’16 28 / 29
  • 29. Thanks for you attention! Questions / Comments? 12.11.2016 AINL FRUCT’16 29 / 29