SlideShare a Scribd company logo
A General Method Applicable to the Search
for Anglicisms in Russian Social Network
Texts
Alena Fenogenova, Ilia Karpov, Viktor Kazorin
National Research University Higher School of Economics,
Scientific Research Institute KVANT
AINL 2016
Problem
The phenomenon of
anglicisms in Russian
language
Task
Growing number of loanwords in Russian:
● problem for NLP tasks like spell-checking, taggers, sentiment detection, etc.
● interest for theoretical researches, studying language contact processes
We propose the method for detecting English loanwords (anglicisms) in Russian
Social Network texts
● unsupervised
● fully automated
● can be applied to any domain-specific area
Methodology
Idea: simultaneous scripting, phonetics and semantics similarity of the original Latin
word and its Cyrillic analogue
Hypotheses generation:
Set of transliteration, phonetic transcribing and morphological analysis
methods to find possible hypotheses
Hypotheses filtering:
Distributional semantic models for filtering hypotheses
General Architecture
Corpus collection
LiveJournal blog platform:
● English and Russian
● 20.000.000 texts and comments
● from 100,000 Russian and English top bloggers
● large variety of themes
● users of different age
● less plagiarism
Hypotheses generation
Transliteration and transcription
Idea:
Language speakers tend to preserve phonetic and orthographic properties of the
borrowed word
1) Transliteration
2) Transcription
Transliteration
Speaker has internal intuition about the writing system of the foreign language
Corpus: Dyakov dictionary of anglicisms + manually created set
Method: Statistical
1) separate on syllables
2) generate bigrams and trigrams of syllables
3) weight
4) result - max possible combination of syllables
Transcription
Idea: Speaker is supposed to preserve word's phonation while writing English word
with Cyrillic script
Source: In-vocabulary lexis with pre-defined transcriptions — Cambridge Advanced
Learner’s Dictionary, Cambridge Academic Content Dictionary and Cambridge
Business English
Method: context-dependent grammar + additional rules, based on practical
transcription of English named entities, proposed by Gilyarevsky.
Hypotheses generation
Transliteration and transcription
List of possible hypotheses for English words, how they could be written in Russian
EN word EN-RU TR-RU RU-EN RU word
football футбол(0) футбол(0) footbol(2) футбол
brainstorm браинсторм(1.5) брэйнстом(1) bryeynshtorm(3) брейншторм
fashion фашион(2) фэшэн(1) feshn(3) фэшн
Table 1. Transliteration/Transcription hypotheses with Levenshtein distance.
Hypotheses generation
Word root extraction
Idea: Frequent type of anglicisms — Russian word contains English root plus
Russian affixes
Method: RNN based root extraction method
Model:
● trained on 97,000 pairs (normal form — root )
● roots were extracted from WikiDictionary
● non-vocabulary words with frequency >= 30
Hypotheses generation. Levenshtein comparison
Root candidates*
Transliteration/transcription
candidates
Levenshtein distance (modified) with thresholds
Resulting hypotheses
*for compounds words as well (the two roots of the compounds were also checked)
Hypotheses filtering
Idea: many anglicisms are used in the same context in both English and Russian
spelling in social network texts
Method: distributive semantics
Algorithm: SkipGram
● Resulting hypothesis candidates in Cyrillic Model (top 100 similar words)
● If our English hypothesis in SkipGram set Word is anglicism!
Hypotheses filtering
Problem: Many anglicisms are rarely used by Russian speakers in original (English)
spelling
Solution: Translation and context search with CBOW model
For each hypotheses all contexts, containing 5 words left and 5 words right the
hypotheses are translated.
If top 100 most relevant English words for each context contains English analogue
of the hypotheses in more than 50% cases, we consider the hypotheses to be
proved.
Evaluation. Comparison with test set
The dictionary of anglicisms by A.I.Dyakov:
● is available online since 2014
● contains about 15,000 lexical items (1000 collocations)
● has a wide range of living spheres (economics, IT, marketing, etc.)
● involves loan words from spoken language, various slangs, professional
jargons and profanities
Evaluation
Match the resulting list with manually created test set
Only 4321 words from Dyakov dictionary are available in model (LiveJournal).
Available in SkipGram model 2417 words
Manually annotated all hypotheses with LD < 1, missed at the Dyakov dictionary
1146 anglicisms were found by the method (863 + 283)
F-measure Precision Recall
0.38 0.84 0.24
Conclusion
1. Method is fully automated, it works and finds anglicisms!
2. About 1146 of 4300 words from manual dictionary was found
3. Language transfers. Method catches borrowings not only from English
to Russian, but vise versa as well (Ex.: pogrom, vodka.)
4. The results are valuable for theoretical researches and can be applied
in practical systems.
Future work
Improve transliteration/transcription. Reduce hypotheses:
● Russian diphthongs are not the same in English. Some English cases can not
be transferred in Russian in this way.
● Morphological dependencies (song - сонг, but running - раннин)
Interesting anglicisms cases:
● English loan words borrowed from third languages (Ex.: Sheikh)
● Difficult cases such as homonymy (Ex.: пост, док)
Future work
● Comparison study of anglicisms in different languages (processes of language
contacts — German vs. Russian vs. French examples)
● Processes of borrowings depending on the time of borrowing
(Ex.: борда or борд)
● Usage of another corpora (Twitter, VKontakte)
● Online service with lists of new anglicisms
Thank you for attention!

More Related Content

What's hot

OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Lviv Data Science Summer School
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovLidia Pivovarova
 

What's hot (7)

Trans language for tableau
Trans language for tableauTrans language for tableau
Trans language for tableau
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, Nefedov
 

Similar to A general method applicable to the search for anglicisms in russian social network texts

ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPHendrik D'Oosterlinck
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationijnlc
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Typology_Course Syllabus_2014_DH_online
Typology_Course Syllabus_2014_DH_onlineTypology_Course Syllabus_2014_DH_online
Typology_Course Syllabus_2014_DH_onlineDorothea Hoffmann
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalcaptainmactavish1996
 

Similar to A general method applicable to the search for anglicisms in russian social network texts (20)

RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLP
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
W17 5406
W17 5406W17 5406
W17 5406
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Lidia Pivovarova
Lidia PivovarovaLidia Pivovarova
Lidia Pivovarova
 
Ivan Derganskyi
Ivan DerganskyiIvan Derganskyi
Ivan Derganskyi
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Typology_Course Syllabus_2014_DH_online
Typology_Course Syllabus_2014_DH_onlineTypology_Course Syllabus_2014_DH_online
Typology_Course Syllabus_2014_DH_online
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 

Recently uploaded

SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesrodneykiptoo8
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingJocelyn Atis
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONChetanK57
 
Plant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptxPlant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptxyusufzako14
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard Gill
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxSultanMuhammadGhauri
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationanitaento25
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsAreesha Ahmad
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversitySteffi Friedrichs
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratessachin783648
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionpablovgd
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxmuralinath2
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insectanitaento25
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)aishnasrivastava
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxAlaminAfendy1
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...Health Advances
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
 

Recently uploaded (20)

SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniques
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Plant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptxPlant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 

A general method applicable to the search for anglicisms in russian social network texts

  • 1. A General Method Applicable to the Search for Anglicisms in Russian Social Network Texts Alena Fenogenova, Ilia Karpov, Viktor Kazorin National Research University Higher School of Economics, Scientific Research Institute KVANT AINL 2016
  • 3. Task Growing number of loanwords in Russian: ● problem for NLP tasks like spell-checking, taggers, sentiment detection, etc. ● interest for theoretical researches, studying language contact processes We propose the method for detecting English loanwords (anglicisms) in Russian Social Network texts ● unsupervised ● fully automated ● can be applied to any domain-specific area
  • 4. Methodology Idea: simultaneous scripting, phonetics and semantics similarity of the original Latin word and its Cyrillic analogue Hypotheses generation: Set of transliteration, phonetic transcribing and morphological analysis methods to find possible hypotheses Hypotheses filtering: Distributional semantic models for filtering hypotheses
  • 6. Corpus collection LiveJournal blog platform: ● English and Russian ● 20.000.000 texts and comments ● from 100,000 Russian and English top bloggers ● large variety of themes ● users of different age ● less plagiarism
  • 7. Hypotheses generation Transliteration and transcription Idea: Language speakers tend to preserve phonetic and orthographic properties of the borrowed word 1) Transliteration 2) Transcription
  • 8. Transliteration Speaker has internal intuition about the writing system of the foreign language Corpus: Dyakov dictionary of anglicisms + manually created set Method: Statistical 1) separate on syllables 2) generate bigrams and trigrams of syllables 3) weight 4) result - max possible combination of syllables
  • 9. Transcription Idea: Speaker is supposed to preserve word's phonation while writing English word with Cyrillic script Source: In-vocabulary lexis with pre-defined transcriptions — Cambridge Advanced Learner’s Dictionary, Cambridge Academic Content Dictionary and Cambridge Business English Method: context-dependent grammar + additional rules, based on practical transcription of English named entities, proposed by Gilyarevsky.
  • 10. Hypotheses generation Transliteration and transcription List of possible hypotheses for English words, how they could be written in Russian EN word EN-RU TR-RU RU-EN RU word football футбол(0) футбол(0) footbol(2) футбол brainstorm браинсторм(1.5) брэйнстом(1) bryeynshtorm(3) брейншторм fashion фашион(2) фэшэн(1) feshn(3) фэшн Table 1. Transliteration/Transcription hypotheses with Levenshtein distance.
  • 11. Hypotheses generation Word root extraction Idea: Frequent type of anglicisms — Russian word contains English root plus Russian affixes Method: RNN based root extraction method Model: ● trained on 97,000 pairs (normal form — root ) ● roots were extracted from WikiDictionary ● non-vocabulary words with frequency >= 30
  • 12. Hypotheses generation. Levenshtein comparison Root candidates* Transliteration/transcription candidates Levenshtein distance (modified) with thresholds Resulting hypotheses *for compounds words as well (the two roots of the compounds were also checked)
  • 13. Hypotheses filtering Idea: many anglicisms are used in the same context in both English and Russian spelling in social network texts Method: distributive semantics Algorithm: SkipGram ● Resulting hypothesis candidates in Cyrillic Model (top 100 similar words) ● If our English hypothesis in SkipGram set Word is anglicism!
  • 14. Hypotheses filtering Problem: Many anglicisms are rarely used by Russian speakers in original (English) spelling Solution: Translation and context search with CBOW model For each hypotheses all contexts, containing 5 words left and 5 words right the hypotheses are translated. If top 100 most relevant English words for each context contains English analogue of the hypotheses in more than 50% cases, we consider the hypotheses to be proved.
  • 15. Evaluation. Comparison with test set The dictionary of anglicisms by A.I.Dyakov: ● is available online since 2014 ● contains about 15,000 lexical items (1000 collocations) ● has a wide range of living spheres (economics, IT, marketing, etc.) ● involves loan words from spoken language, various slangs, professional jargons and profanities
  • 16. Evaluation Match the resulting list with manually created test set Only 4321 words from Dyakov dictionary are available in model (LiveJournal). Available in SkipGram model 2417 words Manually annotated all hypotheses with LD < 1, missed at the Dyakov dictionary 1146 anglicisms were found by the method (863 + 283) F-measure Precision Recall 0.38 0.84 0.24
  • 17. Conclusion 1. Method is fully automated, it works and finds anglicisms! 2. About 1146 of 4300 words from manual dictionary was found 3. Language transfers. Method catches borrowings not only from English to Russian, but vise versa as well (Ex.: pogrom, vodka.) 4. The results are valuable for theoretical researches and can be applied in practical systems.
  • 18. Future work Improve transliteration/transcription. Reduce hypotheses: ● Russian diphthongs are not the same in English. Some English cases can not be transferred in Russian in this way. ● Morphological dependencies (song - сонг, but running - раннин) Interesting anglicisms cases: ● English loan words borrowed from third languages (Ex.: Sheikh) ● Difficult cases such as homonymy (Ex.: пост, док)
  • 19. Future work ● Comparison study of anglicisms in different languages (processes of language contacts — German vs. Russian vs. French examples) ● Processes of borrowings depending on the time of borrowing (Ex.: борда or борд) ● Usage of another corpora (Twitter, VKontakte) ● Online service with lists of new anglicisms
  • 20. Thank you for attention!