Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!, Proceedings of the International Workshop "Acquisition and Management of Multilingual Lexicons", part of the International Conference RANLP 2007, pp. 55-62, ISBN 978-954-452-004-5, Borovets, Bulgaria, 30 September 2007
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...Svetlin Nakov
Scientific article: We propose a novel knowledge-rich approach to measuring the similarity between a pair of words. The algorithm is tailored to Bulgarian and Russian and takes into account the orthographic and the phonetic correspondences between the two Slavic languages: it combines lemmatization, hand-crafted transforma-tion rules, and weighted Levenshtein distance. The experimental results show an 11-pt interpolated average precision of 90.58%, which represents a sizeable improvement over two classic rivaling approaches.
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Svetlin Nakov
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...Svetlin Nakov
Scientific article: We propose a novel knowledge-rich approach to measuring the similarity between a pair of words. The algorithm is tailored to Bulgarian and Russian and takes into account the orthographic and the phonetic correspondences between the two Slavic languages: it combines lemmatization, hand-crafted transforma-tion rules, and weighted Levenshtein distance. The experimental results show an 11-pt interpolated average precision of 90.58%, which represents a sizeable improvement over two classic rivaling approaches.
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Svetlin Nakov
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Cognate or False Friend? Ask the Web! (Distinguish between Cognates and False...Svetlin Nakov
Scientific article: We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic âbridgesâ, and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs
shows this is a very promising approach.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Improved Word Alignments Using the Web as a CorpusSvetlin Nakov
Scientific article: We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.
International Refereed Journal of Engineering and Science (IRJES)irjes
The core of the vision IRJES is to disseminate new knowledge and technology for the benefit of all, ranging from academic research and professional communities to industry professionals in a range of topics in computer science and engineering. It also provides a place for high-caliber researchers, practitioners and PhD students to present ongoing research and development in these areas.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and
interogative sentences about a “story” (a set of sentences describing some events from the real life).
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
ABSTRACT
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
KEYWORDS
Natural Language Processing, Syntactic Analysis, Morphology, Grammar, Romanian Language
http://airccse.org/journal/ijnlc/papers/1412ijnlc01.pdf
Labotka, D., Sabo, E., Bonais, R., Gelman, S. A., & Baptista, M. (2023). Testing the effects of congruence in adult multilingual acquisition with implications for creole genesis. Cognition, 235, 105387. https://doi.org/10.1016/j.cognition.2023.105387
Най-търсените направления в ИТ сферата за 2024Svetlin Nakov
Най-търсените направления в ИТ сферата?
д-р Светлин Наков, съосновател на СофтУни
София, май 2024 г.
Какво се случва на пазара на труда в ИТ сектора?
Какви са прогнозите за ИТ сектора за напред?
Защо има смисъл да учиш програмиране и ИТ през 2024?
Каква е ролята на AI в ИТ професиите?
Как да започна работа като junior?
Upskill програмите на СофтУни
BG-IT-Edu: отворено учебно съдържание за ИТ учителиSvetlin Nakov
Отворено учебно съдържание по програмиране и ИТ за учители
Безплатни учебни курсове и ресурси за ИТ учители
Разработени курсове към 03/2024 г. в BG-IT-Edu
https://github.com/BG-IT-Edu
Качествени учебни курсове (учебно съдържание) за ИТ учители: презентации + примери + упражнения + проекти + задачи за изпитване + judge система + насоки за учителите
Достъпни безплатно, под отворен лиценз CC-BY-NC-SA
Разработени от СофтУни Фондацията, по инициатива и под надзора на д-р Светлин Наков
Научете повече тук: https://nakov.com/blog/2024/03/27/bg-it-edu-open-education-content-for-it-teachers/
Светът на програмирането през 2024 г.
Продължава ли бумът на технологичните професии? Кои професии ще се търсят? Как да започна?
Прогнозата на д-р Светлин Наков за бъдещето на софтуерните професии в България
Има ли смисъл да учиш програмиране през 2024?
Какво се търси на пазара на труда?
Ще продължи ли търсенето на програмисти и през 2024?
Все още ли е най-търсената професия в технологиите?
Ролята на AI в сферата на софтуерните разработчици
Какво се случва на пазара на труда?
Има ли спад в търсенето на програмисти?
Как да започна с програмирането?
Видео от събитието сме качили във FB: https://fb.com/events/346653434644683
AI Tools for Business and Startups
Svetlin Nakov @ Innowave Summit 2023
Artificial Intelligence is already here!
AI Tools for Business: Where is AI Used?
ChatGPT and Bard in Daily Tasks
ChatGPT and Bard for Creativity
ChatGPT and Bard for Marketing
ChatGPT for Data Analysis
DALL-E for Image Generation
Learn more at: https://nakov.com/blog/2023/11/25/ai-for-business-and-startups-my-talk-at-innowave-summit-2023/
AI Tools for Scientists - Nakov (Oct 2023)Svetlin Nakov
Инструменти с изкуствен интелект в помощ на изследователите
Д-р Светлин Наков @ Anniversary Scientific Session dedicated to the 120th anniversary of the birth of John Atanasoff
More Related Content
Similar to Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!
Cognate or False Friend? Ask the Web! (Distinguish between Cognates and False...Svetlin Nakov
Scientific article: We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic âbridgesâ, and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs
shows this is a very promising approach.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Improved Word Alignments Using the Web as a CorpusSvetlin Nakov
Scientific article: We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.
International Refereed Journal of Engineering and Science (IRJES)irjes
The core of the vision IRJES is to disseminate new knowledge and technology for the benefit of all, ranging from academic research and professional communities to industry professionals in a range of topics in computer science and engineering. It also provides a place for high-caliber researchers, practitioners and PhD students to present ongoing research and development in these areas.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and
interogative sentences about a “story” (a set of sentences describing some events from the real life).
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
ABSTRACT
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
KEYWORDS
Natural Language Processing, Syntactic Analysis, Morphology, Grammar, Romanian Language
http://airccse.org/journal/ijnlc/papers/1412ijnlc01.pdf
Labotka, D., Sabo, E., Bonais, R., Gelman, S. A., & Baptista, M. (2023). Testing the effects of congruence in adult multilingual acquisition with implications for creole genesis. Cognition, 235, 105387. https://doi.org/10.1016/j.cognition.2023.105387
Най-търсените направления в ИТ сферата за 2024Svetlin Nakov
Най-търсените направления в ИТ сферата?
д-р Светлин Наков, съосновател на СофтУни
София, май 2024 г.
Какво се случва на пазара на труда в ИТ сектора?
Какви са прогнозите за ИТ сектора за напред?
Защо има смисъл да учиш програмиране и ИТ през 2024?
Каква е ролята на AI в ИТ професиите?
Как да започна работа като junior?
Upskill програмите на СофтУни
BG-IT-Edu: отворено учебно съдържание за ИТ учителиSvetlin Nakov
Отворено учебно съдържание по програмиране и ИТ за учители
Безплатни учебни курсове и ресурси за ИТ учители
Разработени курсове към 03/2024 г. в BG-IT-Edu
https://github.com/BG-IT-Edu
Качествени учебни курсове (учебно съдържание) за ИТ учители: презентации + примери + упражнения + проекти + задачи за изпитване + judge система + насоки за учителите
Достъпни безплатно, под отворен лиценз CC-BY-NC-SA
Разработени от СофтУни Фондацията, по инициатива и под надзора на д-р Светлин Наков
Научете повече тук: https://nakov.com/blog/2024/03/27/bg-it-edu-open-education-content-for-it-teachers/
Светът на програмирането през 2024 г.
Продължава ли бумът на технологичните професии? Кои професии ще се търсят? Как да започна?
Прогнозата на д-р Светлин Наков за бъдещето на софтуерните професии в България
Има ли смисъл да учиш програмиране през 2024?
Какво се търси на пазара на труда?
Ще продължи ли търсенето на програмисти и през 2024?
Все още ли е най-търсената професия в технологиите?
Ролята на AI в сферата на софтуерните разработчици
Какво се случва на пазара на труда?
Има ли спад в търсенето на програмисти?
Как да започна с програмирането?
Видео от събитието сме качили във FB: https://fb.com/events/346653434644683
AI Tools for Business and Startups
Svetlin Nakov @ Innowave Summit 2023
Artificial Intelligence is already here!
AI Tools for Business: Where is AI Used?
ChatGPT and Bard in Daily Tasks
ChatGPT and Bard for Creativity
ChatGPT and Bard for Marketing
ChatGPT for Data Analysis
DALL-E for Image Generation
Learn more at: https://nakov.com/blog/2023/11/25/ai-for-business-and-startups-my-talk-at-innowave-summit-2023/
AI Tools for Scientists - Nakov (Oct 2023)Svetlin Nakov
Инструменти с изкуствен интелект в помощ на изследователите
Д-р Светлин Наков @ Anniversary Scientific Session dedicated to the 120th anniversary of the birth of John Atanasoff
Изкуствен интелект при стартиране и управление на бизнес
Семинар във FinanceAcademy.bg
Д-р Светлин Наков
Изкуственият интелект (ИИ) е вече тук!
Къде се ползва ИИ?
ChatGPT – демо
Bard – демо
Claude – демо
Bing Chat – демо
Perplexity – демо
Bing Image Create – демо
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023Svetlin Nakov
IT industry in Bulgaria: key factors for success and the future. Deep tech, science, innovation, and education and how we can achieve more as an industry?
Dr. Svetlin Nakov
Innovation and Inspiration Manager @ SoftUni
Contents:
How big is the IT industry in Bulgaria?
Number of software professionals in Bulgaria: according to historical data from BASSCOM
Share of the software industry in GDP
Why does Bulgaria have such a successful IT industry?
Education for the tech industry: school education in software professions and profiles (2022/2023)
Education for the tech industry: Students in university in IT specialties (2022/2023)
Education for the IT industry: Learners at SoftUni (2022/2023)
Evolution of the Bulgarian software industry
How much can the industry grow?
Trends in the IT industry: AI progress, the IT market in Bulgaria, deep tech, science, and innovation
AI in the software industry
How to achieve more as an industry? education, deep tech, science, and innovation, entrepreneurship
Introduction
The IT industry in Bulgaria is one of the most successful in the country. It has grown rapidly in recent years and is now a major contributor to the economy. In this talk, Dr. Nakov explores the key factors behind the success of the Bulgarian IT industry, as well as its future prospects.
AI Tools for Business and Personal LifeSvetlin Nakov
A talk at LeaderClass.BG, Sofia, August 2023
by Svetlin Nakov, PhD
The artificial intelligence (AI) is here!
Where is AI used?
ChatGPT - demo
Bing Chat - demo
Bard - demo
Claude - demo
Bing Image Create - demo
Playground AI - demo
In this talk the speaker explains and demonstrates some AI tools for the business and personal life:
ChatGPT: a large language model that can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
Bing Chat: an Internet-connected AI chatbot that can search Internet and answer questions.
Bard: a large language model from Google AI, trained on a massive dataset of text and code, similar to ChatGPT.
Claude: A large AI chatbot, similar to ChatGPT, powerful in document analysis.
Bing Image Create: a tool that can generate images based on text descriptions.
Playground AI: image generator and image editor, based on generative AI.
Дипломна работа: учебно съдържание по ООП - Светлин НаковSvetlin Nakov
Дипломна работа на тема
"Учебно съдържание по обектно-ориентирано програмиране в профилираната подготовка по информатика"
Дипломант: д-р Светлин Наков
Специалност: Педагогика на обучението по информатика и информационни технологии в училище (ПОИИТУ)
Степен: магистър
Пловдивски университет "Паисий Хилендарски"
Факултет по математика и информатика (ФМИ)
Катедра “Компютърни технологии”
Настоящата дипломна работа има за цел да подпомогне българските ИТ учители от системата на средното образование в профилираните гимназии и паралелки, като им предостави безплатно добре разработени учебни програми и качествено учебно съдържание за преподаване в първия и най-важен модул от профил “Софтуерни и хардуерни науки”, а именно “Модул 1. Обектно-ориентирано проектиране и програмиране”.
Чрез изграждането на качествени учебни програми и ресурси за преподаване и пренасяне на добре изпитани образователни практики от автора на настоящата дипломна работа (д-р Светлин Наков) към българските ИТ учители целим значително да подпомогнем учителите в тяхната образователна кауза и да повишим качеството на обучението по програмиране в профилираните гимназии с профил “Софтуерни и хардуерни науки”.
Резултатите от настоящата дипломна работа са вече внедрени в практиката и разработените учебни ресурси се използват реално от стотици ИТ учители в България в ежедневната им работа. Това е една от основните цели и тя вече е изпълнена, още преди защитата на настоящата дипломна работа.
Прочетете повече в блога на д-р Наков: https://nakov.com/blog/2023/07/08/free-learning-content-oop-nakov/
Дипломна работа: учебно съдържание по ООПSvetlin Nakov
Презентация за защита на
Дипломна работа на тема
"Учебно съдържание по обектно-ориентирано програмиране в профилираната подготовка по информатика"
Дипломант: д-р Светлин Наков
Пловдивски университет "Паисий Хилендарски"
Факултет по математика и информатика (ФМИ)
Катедра “Компютърни технологии”
Защитена на: 8 юли 2023 г.
Научете повече в блога на д-р Наков: https://nakov.com/blog/2023/07/08/free-learning-content-oop-nakov
Свободно ИТ учебно съдържание за учители по програмиране и ИТSvetlin Nakov
В тази сесия разказвам за училищното образование по програмиране и ИТ, за професионалните и профилираните гимназии, свързани с програмиране и ИТ, за STEM кабинетите, за българските ИТ учители и тяхната подготовка, за проблемите, с които те се сблъскват, и как можем да им помогнем чрез проекта „Свободно учебно съдържание по програмиране и ИТ“: https://github.com/BG-IT-Edu.
Open Fest 2021, 14 август, София
В света се засилва тенденцията за установяване на STEAM образованието като двигател на научно-техническия прогрес чрез развитие на интердисциплинарни умения в сферата на природните науки, математиката, информационните технологии, инженерните науки и изкуствата в училищна възраст. С масовото изграждане на STEAM лаборатории в българските училища се изостря недостига на добре подготвени STEAM и ИТ учители.
Вярвайки в идеята, че българската ИТ общност може да помогне за решаването на проблема, през 2020 г. по инициатива на СофтУни Фондацията стартира проект за създаване на безплатно учебно съдържание по програмиране и ИТ за учители в подкрепа на училищното технологично образование. Проектът е със свободен лиценз в GitHub: https://github.com/BG-IT-Edu. Учителите получават безплатно богат комплект от съвременни учебни материали с високо качество: презентации, постъпкови ръководства, задачи за упражнения и практически проекти, окомплектовани с насоки, подсказки и решения, безплатна система за автоматизирано тестване на решенията и други учебни ресурси, на български и английски език.
Създадени са голяма част от учебните курсове за професиите "Приложен програмист", "Системен програмист" и "Програмист" в професионалните гимназии. Амбицията на проекта е да се създадат свободни учебни материали и за обученията в профил "Софтуерни и хардуерни науки" в профилираните гимназии.
Целта на проекта “Свободно ИТ учебно съдържание за учители” е да подпомогне българския ИТ учител с качествени учебни материали, така че да преподава на добро ниво и със съвременни технологии и инструменти, за да положи основите на подготовката на бъдещите ИТ специалисти и дигитални лидери на България.
В лекцията разказвам за училищното образование по програмиране и ИТ, за професионалните и профилираните гимназии, свързани с програмиране и ИТ, за STEM кабинетите, за българските ИТ учители и тяхната подготовка, за проблемите с които те се срещат, за липсата на учебници и учебни материали по програмиране, ИТ и по техническите дисциплини и как можем да помогнем на ИТ учителите.
A public talk "AI and the Professions of the Future", held on 29 April 2023 in Veliko Tarnovo by Svetlin Nakov. Main topics:
AI is here today --> take attention to it!
- ChatGPT: revolution in language AI
- Playground AI – AI for image generation
AI and the future professions
- AI-replaceable professions
- AI-resistant professions
AI in Education
Ethics in AI
In this seminar Dr. Svetlin Nakov talks about the programming languages, their popularity, available jobs and trends for 2022-2023.
Modern software development uses dozens of programming languages, along with hundreds of technology frameworks, libraries, and software tools.
This talk will review the most popular programming languages on the labor market: JavaScript, Java, C#, Python, PHP, C++, Go, Swift. It will be briefly stated what each of them is, what it is used for and what is its demand in the IT industry.
Agenda:
The Most Used Programming Languages in 2022:
- Python, Java, JavaScript, C#, C++, PHP
Jobs by Programming Languages in 2022:
- Jobs Worldwide by Programming Language
- Jobs in Bulgaria by Programming Language
Programming Languages Trends for 2023
- Language Popularity Rankings from Stack Overflow, GitHub, PYPL, IEEE, TIOBE, Etc.
Become a Software Developer: How To Start?
IT Professions and How to Become a DeveloperSvetlin Nakov
IT Professions and Their Future
The landscape of IT professions in the tech industry: software developer, front-end, back-end, AI, cloud, DevOps, QA, Java, JavaScript, Python, C#, C++, digital marketing, SEO, SEM, SMM, project manager, business analyst, CRM / ERP consultant, design / UI / UX expert, Web designer, motion designer, etc.
Industry 4.0 and the future of manufacturing, smart cities and digitalization of everything.
What are the most in-demand professions on LinkedIn? Why the best jobs in the world are related to software development and IT?
How to learn coding and start a tech job?
Why anyone can be a software developer?
Dr. Svetlin Nakov
December 2022
GitHub Actions (Nakov at RuseConf, Sept 2022)Svetlin Nakov
Building a CI/CD System with GitHub Actions
Dr. Svetlin Nakov
September 2022
Intro to Continuous Integration (CI), Continuous Delivery (CD), Continuous Deployment (CD), and CI/CD Pipelines
Intro to GitHub Actions
Building CI workflow
Building CD workflow
Live Demo: Build CI System for JS and .NET Apps
How to Become a QA Engineer and Start a JobSvetlin Nakov
How to Become a QA Engineer and Start a Job
Svetlin Nakov, PhD
Sofia, Nov, 2022
1) Why Become а QA Engineer?
2) How to Become a Software Quality Assurance Engineer (QA)?
Learn the Fundamental QA Skills:
Manual QA skills – 30%
Software engineering skills – 20%
QA automation skills – 50%
3) Anyone Can Be a QA Engineer!
4) The QA Program @ SoftUni
https://softuni.bg/qa
Призвание и цели: моята рецепта
д-р Светлин Наков
Как да открия своето призвание в живота и да го направя своя професия?
Съдържание:
1) Светлин Наков - моята история
2) Дигиталните професии
3) Как да намеря своята професия?
Поход на вдъхновителите | Аз мога тук и сега @ Петрич (19 ноември 2022 г.)
What Mongolian IT Industry Can Learn from Bulgaria?Svetlin Nakov
In this talk the speaker Dr. Svetlin Nakov explains the growth of the Bulgarian software industry for the latest 25 years and the key events in its growth.
He gives rich statistical data about Bulgaria, its GDP, and its software industry, which generates nearly 5% of the GDP (in 2022), about the open developer positions (5K in Oct 2022) and the number of software developers in Bulgaria (54K in Oct 2022).
The growth of the number of software developers in Bulgaria, software industry's revenues and their share in the national GDP are traced back from 2022 to 2005.
Similar research about the Mongolian software industry is conducted and available data is compared to Bulgaria and USA.
An interesting parallel is given between Bulgaria and Mongolia in terms of their software engineering talent and industry growth potential.
Finally, Dr. Nakov gives his recommendations about how local Mongolian software companies can reach the global tech market and suggests to use the "outstaffing" business model, targeting the European tech industry and the Asian region.
The talk is given at the "The Future of IT" forum in Ulaanbaatar, Mongolia (October 2022).
How to Become a Software Developer - Nakov in Mongolia (Oct 2022)Svetlin Nakov
In this talk the speaker Svetlin Nakov explains the IT professions and their extremely high demand in the latest years, and gives a recipe how to become a software engineer.
He recommends to spend 1-2 years in studying and practicing software engineering, following a learning curriculum like this:
Basic Coding Course – calculations, data, conditions, loops, IDE
Fundamentals of Programming – arrays, lists, maps, nested structures, text processing, error handling, basic language APIs, problem solving
Object-Oriented Programming – classes, objects, inheritance, …
Databases and ORM – relational DB, SQL, ORM frameworks, XML, JSON
Back-End Development – HTTP, MVC, Web apps, REST
Front-End Development – HTML, CSS, JS, DOM, AJAX, JS Frameworks
Projects – Git, software engineering, teamwork
Example: https://softuni.org/learn
The talk is from the "The Future of IT" forum in Ulaanbaatar, Mongolia (October 2022).
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!
1. Cognate or False Friend? Ask the Web!
Svetlin Nakov Preslav Nakov Elena Paskaleva
Sofia University Univ. of Cal. Berkeley Bulgarian Academy of Sciences
5 James Boucher Blvd. EECS, CS division 25A Acad. G. Bonchev Str.
Sofia, Bulgaria Berkeley, CA 94720 Sofia, Bulgaria
nakov @fmi.uni-sofia.bg nakov @cs.berkeley.edu hellen@lml.bas.bg
Abstract • Gift means a poison in German, but a present in
We propose a novel unsupervised semantic English;
method for distinguishing cognates from false
friends. The basic intuition is that if two words • Prost means cheers in German, but stupid in Bul-
are cognates, then most of the words in their garian.
respective local contexts should be translations
of each other. The idea is formalised using the And some examples with a different orthography:
Web as a corpus, a glossary of known word trans-
lations used as cross-linguistic “bridges”, and • embara¸ada means embarrassed in Portuguese,
c
the vector space model. Unlike traditional or- while embarazada means pregnant in Spanish;
thographic similarity measures, our method can
easily handle words with identical spelling. The • spenden means to donate in German, but to
evaluation on 200 Bulgarian-Russian word pairs spend means to use up or to pay out in English;
shows this is a very promising approach.
• bachelier means a person who passed his bac
exam in French, but in English bachelor means
an unmarried man;
Keywords
Cognates, false friends, semantic similarity, Web as a corpus.
• babichka (babiqka) means an old woman in Bul-
garian, but babochka (baboqka) is a butterfly in
Russian;
1 Introduction • godina (godina) means a year in Russian, but
godzina is an hour in Polish.
Linguists define cognates as words derived from a com-
mon root. For example, the Electronic Glossary of In the present paper, we describe a novel semantic
Linguistic Terms gives the following definition [5]: approach to distinguishing cognates from false friends.
The paper is organised as follows: Sections 2 explains
Two words (or other structures) in related the method, section 3 describes the resources, section
languages are cognate if they come from the 4 presents the data set, section 5 describes the experi-
same original word (or other structure). Gen- ments, section 6 discusses the results of the evaluation,
erally cognates will have similar, though of- and section 7 points to important related work. We
ten not identical, phonological and semantic conclude with directions for future work in section 8.
structures (sounds and meanings). For in-
stance, Latin tu, Spanish t´, Greek s´, Ger-
u u
man du, and English thou are all cognates; all 2 Method
mean ‘second person singular’, but they differ
in form and in whether they mean specifically 2.1 Contextual Web Similarity
‘familiar’ (non-honorific).
We propose an unsupervised algorithm, which given
Following previous researchers in computational lin- a Russian word wru and a Bulgarian word wbg to be
guistics [4, 22, 25], we adopt a simplified definition, compared, measures the semantic similarity between
which ignores origin, defining cognates (or true friends) them using the Web as a corpus and a glossary G
as words in different languages that are translations of known Russian-Bulgarian translation pairs, used as
and have a similar orthography. Similarly, we define “bridges”. The basic idea is that if two words are
false friends as words in different languages with sim- translations, then the words in their respective local
ilar orthography that are not translations. Here are contexts should be translations as well. The idea is for-
some identically-spelled examples of false friends: malised using the Web as a corpus, a glossary of known
word translations serving as cross-linguistic “bridges”,
• pozor (pozor) means a disgrace in Bulgarian, but and the vector space model. We measure the semantic
attention in Czech; similarity between a Bulgarian and a Russian word,
wbg and wru , by construct corresponding contextual
• mart (mart) means March in Bulgarian, but a semantic vectors Vbg and Vru , translating Vru into Bul-
market in English; garian, and comparing it to Vbg .
1
2. The process of building Vbg , starts with a query and gbg on the Web, where gbg immediately precedes
to Google limited to Bulgarian pages for the target or immediately follows wbg . This number is calculated
word wbg . We collect the resulting text snippets (up using Google page hits as a proxy for bigram frequen-
to 1,000), and we remove all stop words – preposi- cies: we issue two exact phrase queries “wbg gbg ” and
tions, pronouns, conjunctions, interjections and some “gbg wbg ”, and we sum the corresponding numbers of
adverbs. We then identify the occurrences of wbg , and page hits. We repeat the same procedure with wru and
we extract three words on either side of it. We filter gru in order to obtain the values for the correspond-
out the words that do not appear on the Bulgarian side ing coordinates of the Russian vector Vru . Finally, we
of G. Finally, for each retained word, we calculate the calculate the semantic similarity between wbg and wru
number of times it has been extracted, thus producing as the cosine between Vbg and Vru .
a frequency vector Vbg . We repeat the procedure for
wru to obtain a Russian frequency vector Vru , which
is then “translated” into Bulgarian by replacing each 3 Resources
Russian word with its translation(s) in G, retaining the
co-occurrence frequencies. In case of multiple Bulgar- 3.1 Grammatical Resources
ian translations for some Russian word, we distribute
the corresponding frequency equally among them, and We use two monolingual dictionaries for lemmatisa-
in case of multiple Russian words with the same Bul- tion. For Bulgarian, we have a large morphological
garian translation, we sum up the corresponding fre- dictionary, containing about 1,000,000 wordforms and
quencies. As a result, we end up with a Bulgarian 70,000 lemmata [29], created at the Linguistic Model-
vector Vru→bg for the Russian word wru . Finally, we ing Department, Institute for Parallel Processing, Bul-
calculate the semantic similarity between wbg and wru garian Academy of Sciences. Each dictionary entry
as the cosine between their corresponding Bulgarian consists of a wordform, a corresponding lemma, fol-
vectors, Vbg and Vru→bg . lowed by morphological and grammatical information.
There can be multiple entries for the same wordform,
in case of multiple homographs. We also use a large
2.2 Reverse Context Lookup grammatical dictionary of Russian in the same for-
The reverse context lookup is a modification of the mat, consisting of 1,500,000 wordforms and 100,000
above algorithm. The original algorithm implicitly as- lemmata, based on the Grammatical Dictionary of A.
sumes that, given a word w, the words in the local Zaliznjak [35]. Its electronic version was supplied by
context of w are semantically associated with it, which the Computerised fund of Russian language, Institute
is often wrong due to Web-specific words like home, of Russian language, Russian Academy of Sciences.
site, page, click, link, download, up, down, back, etc.
Since their Bulgarian and Russian equivalents are in 3.2 Bilingual Glossary
the glossary G, we can get very high similarity for un-
related words. For the same reason, we cannot judge We built a bilingual glossary using an online Russian-
such navigational words as true/false friends. Bulgarian dictionary3 with 3,982 entries in the follow-
The reverse context lookup copes with the problem ing format: a Russian word, an optional grammatical
as follows: in order to consider w associated with a marker, optional stylistic references, and a list of Bul-
word wc from the local context of w, it requires that garian translation equivalents. First, we removed all
w appear in the local context of wc as well1 . More multi-word expressions. Then we combined each Rus-
formally, let #(x, y) be the number of occurrences of x sian word with each of its Bulgarian translations –
in the local context of y. The strength of association is due to polysemy/homonymy some words had multiple
calculated as p(w, wc ) = min{#(w, wc ), #(wc , w)} and translations. As a result, we obtained a glossary G of
is used in the vector coordinates instead of #(w, wc ), 4,563 word-word translation pairs (3,794 if we exclude
which is used in the original algorithm. the stop words).
2.3 Web Similarity Using Seed Words 3.3 Huge Bilingual Glossary
For comparison purposes, we also experiment with the Similarly, we adapted a much larger Bulgarian-Russian
seed words algorithm of Fung&Yee’98 [12], which we electronic dictionary, transforming it into a bilingual
adapt to use the Web. We prepare a small glossary of glossary with 59,583 word-word translation pairs.
300 Russian-Bulgarian word translation pairs, which
is a subset of the glossary used for our contextual Web
similarity algorithm2 . Given a Bulgarian word wbg and 4 Data Set
a Russian word wru to compare, we build two vectors,
one Bulgarian (Vbg ) and one Russian (Vru ), both of 4.1 Overview
size 300, where each coordinate corresponds to a par-
Our evaluation data set consists of 200 Bulgarian-
ticular glossary entry (gru , gbg ). Therefore, we have a
Russian pairs – 100 cognates and 100 false friends. It
direct correspondence between the coordinates of Vbg
has been extracted from two large lists of cognates
and Vru . The coordinate value for gbg in Vbg is cal-
and false friends, manually assembled by a linguist
culated as the total number of co-occurrences of wbg
from several monolingual and bilingual dictionaries.
1 These contexts are collected using a separate query for wc . We limited the scope of our evaluation to nouns only.
2 We chose those 300 words from the glossary that occur most
frequently on the Web. 3 http://www.bgru.net/intr/dictionary/
2
3. As Table 1 shows, most of the words in our pairs con- • Absolute falseness: same lemma, same POS,
stitute a perfect orthographic match: this is the case same number of senses, but different meanings for
for 79 of the false friends and for 71 of the cognates. all senses, e.g. r. est (‘a tin’) and b. est
The remaining ones exhibit minor variations, e.g.: (‘a gesture’);
• y → i (r. ryba → b. riba, ‘a fish’); • Partial lemma falseness: same lemma, same
POS, but different number of senses, and different
• → e (r. ta → b. eta , ‘a floor’); meanings for some senses. For example, the r. bas
(‘bass, voice’) is a cognate of the first sense of the
• → ∅ (r. kost → b. kost, ‘a bone’); b. bas, ‘bass, voice’, but a false friend of its second
sense ‘a bet’ (the Russian for a bet is pari). On
• double consonant → single consonant (r. the other hand, the r. pari is a false friend of the
programma → b. programa, ‘a programme’); first sense of the b. pari (‘money’); the Russian
for money is deng i. In addition, the b. pari can
• etc. also be the plural for the b. para (‘a vapour’),
which translates into Russian as ispareni . This
quite complex example shows that the falseness
4.2 Discussion is not a symmetric cross-linguistic relation. It is
There are two general approaches to testing a statisti- shown schematically on Figure 1.
cal hypothesis about a linguistic problem: (1) from the
data to the rules, and (2) from the rules to the data. • Partial wordform falseness: the number of
In the first approach, we need to collect a large num- senses is not relevant, the POS can differ, the
ber of instances of potential interest and then to filter lemmata are different, but some wordform of the
out the bad ones using lexical and grammatical compe- lemma in one language is identical to a wordform
tence, linguistic rules as formulated in grammars and of the lemma in the second language. For ex-
dictionaries, etc. This direction is from the data to the ample, the b. hotel (‘a hotel’) is the same as
rules, and the final evaluation is made by a linguist. the inflected form r. hotel (past tense, singular,
The second approach requires to formulate the pos- masculine of hotet , ‘to want’).
tulates of the method from a linguistic point of view,
and then to check its consistency on a large volume of
data. Again, the check is done by a linguist, but the Our list of true and false friends contains only abso-
direction is from the rules to the data. lute false friends and cognates, excluding any partial
We combined both approaches. We started with two cognates. Note that we should not expect to have a
large lists of cognates and false friends, manually as- full identity between all wordforms of the cognates for
sembled by a linguist from several monolingual and a given pair of lemmata, since the rules for inflections
bilingual dictionaries: Bulgarian [2, 3, 8, 19, 20, 30], are different for Bulgarian and Russian.
Russian [10], and Bulgarian-Russian [6, 28]. From
these lists, we repeatedly extracted Russian-Bulgarian
word pairs (nouns), cognates and false friends, which
were further checked against our monolingual elec-
tronic dictionaries, described in section 3. The process
was repeated until we were able to collect 100 cognates
and 100 false friends.
Given an example pair exhibiting orthographic dif-
ferences between Bulgarian and Russian, we tested
against our electronic dictionaries the corresponding
letter sequences substitutions. While the four corre-
spondences in section 4.1 have proven incontestable,
other have been found inconsistent. For example, the
correspondence between the Russian -oro- and the
Bulgarian -ra- (e.g. r. goroh → b. grah, ‘peas’), men-
tioned in many comparative Russian-Bulgarian stud-
ies, does not always hold, as it is formulated for root
morphemes only. This correspondence fails in cases
where these strings occur outside the root morpheme
or at morpheme boundaries, e.g. r. plodorodie →
b. plodorodie, ‘fruitfulness, fertility’. We excluded
all examples exhibiting inconsistent orthographic al-
ternations between Bulgarian and Russian.
The linguistic check against the grammatical dictio-
naries further revealed different interactions between Fig. 1: Falseness example: double lines link cog-
orthography, part-of-speech (POS), grammatical func- nates, dotted lines link false friends, and solid lines
tion, and sense, suggesting the following degrees of link translations.
falseness:
3
4. 5 Experiments and Evaluation – reverse: web3 with reverse context
lookup;
In our experiments, we calculate the semantic (or or-
thographic) similarity between each pair of words from – reverse+tf.idf: web3 with reverse con-
the data set. We then order the pairs in ascending text lookup and tf.idf-weighting;
order, so that the ones near the top are likely to be – combined: web3 with lemmatisation, huge
false friends, while those near the bottom are likely glossary, and reverse context lookup;
to be cognates. Following Bergsma&Kondrak’07 [4]
and Kondrak&Sherif’06 [18], we measure the quality – combined+tf.idf: combined with
of the ranking using 11-point average precision. We tf.idf-weighting.
experiment with the following similarity measures:
• Baseline
– baseline: random.
• Orthographic Similarity
– medr: minimum edit distance ratio, defined
|MED(s1 ,s
as medr(s1 ,s2 ) = 1 − max(|s1 |,|s2 )| , where |s|
2 |)
is the length of the string s, and med is the
minimum edit distance or Levenshtein dis-
tance [21], calculated as the minimum num-
ber of edit operations – insert, replace,
delete – needed to transform s1 into s2 .
For example, the med between b. ml ko
(‘milk’) and r. moloko (‘milk’) is two: one
replace operation ( → o) and one in- Fig. 2: Evaluation, 11-point average precision.
sert operation (of o). Therefore we obtain Comparing web3 with baseline and three old algo-
medr(ml ko, moloko) = 1 − 2/6 ≈ 0.667; rithms – lcsr, medr and seed.
– lcsr: longest common subsequence ratio
|LCS(s
[24], defined as lcs(s1 ,s2 ) = max(|s11 ,s22)| , First, we compare two semantic similarity measures,
|,|s |)
web3 and seed, with the orthographic similarity mea-
where lcs(s1 , s2 ) is the longest common sub- sures, lcsr and medr, and with baseline; the re-
sequence of s1 and s2 . For example, the sults are shown on Figure 2. The baseline algorithm
lcs(ml ko, moloko) = mlko, and there- achieves 50% on 11-point average precision, and out-
fore lcsr(ml ko, moloko) = 4/6 ≈ 0.667. performs the orthographic similarity measures lcsr
• Semantic Similarity and medr, which achieve 45.08% and 44.99%. This is
not surprising since most of our pairs consist of iden-
– seed: our implementation and adaptation tical words. The semantic similarity measures, seed
of the seed words algorithm of Fung&Yee’98 and web3 perform much better, achieving 64.74% and
[12]; 92.05% on 11-point average precision. The huge abso-
lute difference in performance (almost 30%) between
– web3: the Web-based similarity algorithm seed and web3 suggests that building a dynamic set
with the default parameters: local context of textual contexts from which to extract words co-
size of 3, the smaller bilingual glossary, stop occurring with the target is a much better idea than
words filtering, no lemmatisation, no reverse using a fixed set of seed words and page hits as a proxy
context lookup, no tf.idf-weighting; for Web frequencies.
– no-stop: web3 without stop words re-
moval;
– web1: web3 with local context size of 1;
– web2: web3 with local context size of 2;
– web4: web3 with local context size of 4;
– web5: web3 with local context size of 5;
– web3+tf.idf: web3 with context size of 1;
– lemma: web3 with lemmatisation;
– lemma+tf.idf: web3 with lemmatisation
and tf.idf-weighting;
– hugedict: web3 with the huge glossary;
– hugedict+tf.idf: web3 with the huge Fig. 3: Evaluation, 11-point average precision.
glossary and tf.idf-weighting; Different context sizes; keeping the stop words.
4
5. The remaining experiments try different variations 6 Discussion
of the contextual Web similarity algorithm web3.
First, we tested the impact of stop words removal. Re- Ideally, our algorithm would rank first all false friends
call that web3 removes the stop words from the text and only then the cognates. Indeed, in the ranking
snippets returned by Google; therefore, we tried a ver- produced by combined, the top 75 pairs are false
sion of it, no-stop, which keeps them. As Figure 3 friends, while the last 48 are cognates; things get
shows, this was a bad idea yielding about 28% abso- mixed in the middle.
lute loss in accuracy – from 92.05% to 63.94%. We also The two lowest-ranked (misplaced) by combined
tried different context sizes: 1, 2, 3, 4 and 5. Context false friends are vrata (‘a door’ in Bulgarian, but ‘an
size 3 performed best, but the differences are small. entrance’ in Russian) at rank 152, and abiturient
We also experimented with different modifications of (‘a person who just graduated from a high school’ in
web3: using lemmatisation, tf.idf-weighting, reverse Bulgarian, but ‘a person who just enrolled in an uni-
lookup, a bigger glossary, and a combination of them. versity’ in Russian) at rank 148. These pairs are prob-
The results are shown on Figure 5. lematic for our semantic similarity algorithms, since
while the senses differ in Bulgarian and Russian, they
First, we tried lemmatising the words from the snip- are related – a door is a kind of entrance, and the newly
pets using the monolingual grammatical dictionaries admitted freshmen in an university are very likely to
described in section 3.1. We tried both with and with- have just graduated from a high school.
out tf.idf-weighting, achieving in either case an im- The highest-ranked (misplaced) by combined cog-
provement of about 2% over web3: as Figure 5 shows, nate is the pair b. gordost / r. gordost (‘pride’) at
lemma and lemma+tf.idf yield an 11-point average position 76. On the Web, this word often appears in
precision of 94.00% and 94.17%, respectively. historical and/or cultural contexts, which are nation-
In our next experiment, we used the thirteen times specific. As a result the word’s contexts appear mis-
larger glossary described in section 3.3, which yielded leadingly different in Bulgarian and Russian.
an 11-point average precision of 94.37% – an absolute In addition, when querying Google, we only have
improvement of 2.3% compared to web3 (Figure 5, access to at most 1,000 top-ranked results. Since
hugedict). Interestingly, when we tried using tf.idf- Google’s ranking often prefers commercial sites, travel
weighting together with this glossary, we achieved only agencies, news portals, etc., over books, scientific ar-
93.31% (Figure 5, hugedict-tf.idf). ticles, forum posts, etc., this introduces a bias on the
kinds of contexts we extract.
We also tried the reverse context lookup described
in section 2.2, which improved the results by 3.64% to
95.69%. Again, combining it with tf.idf-weighting, 7 Related Work
performed worse: 94.58%.
Finally, we tried a combination of web3 with lem- Many researchers have exploited the intuition that
matisation, reverse lookup and the huge glossary, words in two different languages with similar or identi-
achieving an 11-point average precision of 95.84%, cal spelling are likely to be translations of each other.
which is our best result. Adding tf.idf-weighting to Al-Onaizan&al.’99 [1] create improved Czech-
the combination yielded slightly worse results: 94.23%. English word alignments using probable cognates ex-
Figure 4 shows the precision-recall curves for lcsr, tracted with one of the variations of lcsr [24] de-
seed, web3, and combined. We can see that web3 scribed in [34]. Using a variation of that technique,
and combined clearly outperform lcsr and seed. Kondrak&al.’03 [17] demonstrate improved transla-
tion quality for nine European languages.
Koehn&Knight’02 [15] describe several techniques
for inducing translation lexicons. Starting with unre-
lated German and English corpora, they look for (1)
identical words, (2) cognates, (3) words with similar
frequencies, (4) words with similar meanings, and (5)
words with similar contexts. This is a bootstrapping
process, where new translation pairs are added to the
lexicon at each iteration.
Rapp’95 [31] describes a correlation between the
co-occurrences of words that are translations of each
other. In particular, he shows that if in a text in one
language two words A and B co-occur more often than
expected by chance, then in a text in another language
the translations of A and B are also likely to co-occur
frequently. Based on this observation, he proposes a
model for finding the most accurate cross-linguistic
mapping between German and English words using
non-parallel corpora. His approach differs from ours
Fig. 4: Precision-Recall Curve. Comparing web3 in the similarity measure, the text source, and the ad-
with lcsr, seed and combined. dressed problem. In later work on the same problem,
Rapp’99 [32] represents the context of the target word
with four vectors: one for the words immediately pre-
5
6. Fig. 5: Evaluation, 11-point average precision. Different improvements of web3.
ceding the target, another one for the ones immedi- the title and summary of returned search results. They
ately following the target, and two more for the words then look for the most frequently occurring word in the
one more word before/after the target. search results and apply tf.idf-weighting.
Fung&Yee’98 [12] extract word-level translations Finally, there is a lot of research on string sim-
from non-parallel corpora. They count the number ilarity that has been applied to cognate identifica-
of sentence-level co-occurrences of the target word tion: Ristad&Yianilos’98 [33] and Mann&Yarowsky’01
with a fixed set of “seed” words in order to rank [22] learn the med weights using a stochastic trans-
the candidates in a vector-space model using different ducer. Tiedemann’99 [34] and Mulloni&Pekar’06
similarity measures, after normalisation and tf.idf- [26] learn spelling changes between two languages
weighting. The process starts with a small initial set for lcsr and for medr respectively. Kondrak’05
of seed words, which are dynamically augmented as [16] proposes longest common prefix ratio, and
new translation pairs are identified. As we have seen longest common subsequence formula, which coun-
above, an adaptation of this algorithm, seed, yielded ters lcsr’s preference for short words. Klemen-
significantly worse results compared to web3. An- tiev&Roth’06 [14] and Bergsma&Kondrak’07 [4] pro-
other problem of that algorithm is that for a glossary pose discriminative frameworks for string similarity.
of size |G|, it requires 4 × |G| queries, which makes Rappoport&Levent-Levi’06 [23] learn substring corre-
it too expensive for practical use. We tried another spondences for cognates, using string-level substitu-
adaptation: given the words A and B, instead of us- tions method of Brill&Moore’00 [7]. Inkpen&al.’05
ing the exact phrase queries quot;A Bquot; and quot;B Aquot; and then [13] compare several orthographic similarity measures.
adding the page hits, to use the page hits for A and Frunza&Inkpen’06 [11] disambiguate partial cognates.
B. This performed even worse. While these algorithms can successfully distinguish
Diab&Finch’00 [9] present a model for statistical cognates from false friends based on orthographic sim-
word-level translation between comparable corpora. ilarity, they do not use semantics and therefore cannot
They count the co-occurrences for each pair of words in distinguish between equally spelled cognates and false
each corpus and assign a cross-linguistic mapping be- friends (with the notable exception of [4], which can
tween the words in the corpora such that it preserves do so in some cases).
the co-occurrences between the words in the source Unlike the above-mentioned methods, our approach:
language as closely as possible to the co-occurrences
of their mappings in the target language. • uses semantic similarity measure – not ortho-
graphical or phonetic;
Zhang&al.’05 [36] present an algorithm that uses a
search engine to improve query translation in order • uses the Web, rather than pre-existing corpora to
to carry out cross-lingual information retrieval. They extract the local context of the target word when
issue a query for a word in language A and tell the collecting semantic information about it;
search engine to return the results only in language
B and expect the possible translations of the query • is applied to a different problem: classification of
terms from language A to language B to be found in (nearly) identically-spelled false/true friends.
6
7. 8 Conclusions and Future Work [14] A. Klementiev and D. Roth. Named entity transliteration and
discovery from multilingual comparable corpora. In Proceed-
ings of the Human Language Technology Conference of the
We have proposed a novel unsupervised semantic NAACL, Main Conference, pages 82–88, New York City, USA,
method for distinguishing cognates from false friends, June 2006. Association for Computational Linguistics.
based on the intuition that if two words are cognates, [15] P. Koehn and K. Knight. Learning a translation lexicon from
monolingual corpora. In Proceedings of ACL workshop on
then the words in their local contexts should be trans- Unsupervised Lexical Acquisition, pages 9–16, 2002.
lations of each other, and we have demonstrated that
[16] G. Kondrak. Cognates and word alignment in bitexts. In Pro-
this is a very promising approach. ceedings of the 10th Machine Translation Summit, pages 305–
There are many ways in which we could improve 312, Phuket, Thailand, September 2005.
the proposed algorithm. First, we would like to au- [17] G. Kondrak, D. Marcu, and K. Knight. Cognates can improve
tomatically expand the bilingual glossary with more statistical translation models. In Proceedings of HLT-NAACL
2003 (companion volume), pages 44–48, 2003.
word translation pairs using bootstrapping as well as
[18] G. Kondrak and T. Sherif. Evaluation of several phonetic sim-
to combine the method with (language-specific) ortho- ilarity algorithms on the task of cognate identification. In
graphic similarity measures, as done in [27]. We also Proceedings of the Workshop on Linguistic Distances, pages
43–50, Sydney, Australia, July 2006. Association for Compu-
plan to apply this approach to other language pairs tational Linguistics.
and to other tasks, e.g. to improving word alignments.
[19] V. Kyuvlieva-Mishaykova and M. Charoleeva, editors. Dic-
tionary of the Bulgarian Language, volume 9. 1998 (“Reqnik
Acknowledgments. We would like to thank the na b lgarski ezik ”. pod red. na V. K vlieva-Mixaikova i
M. Qaroleeva. T. 9, Akademiqno izdatelstvo “Prof. Marin
anonymous reviewers for their useful comments. Drinov”, Sofi , 1998).
[20] V. Kyuvlieva-Mishaykova and E. Pernishka, editors. Dictio-
References nary of the Bulgarian Language, volume 12. 2004 (“Reqnik
na b lgarski ezik ”. pod red. na V. K vlieva-Mixaikova
and E. Pernixka. T. 12, Akademiqno izdatelstvo “Prof.
[1] Y. Al-Onaizan, J. Curin, M. Jahr, K. Knight, J. Lafferty, Marin Drinov”, Sofi , 2004).
D. Melamed, F. Och, D. Purdy, N. Smith, and D. Yarowsky.
Statistical machine translation. Technical report, 1999. [21] V. Levenshtein. Binary codes capable of correcting deletions,
insertions, and reversals. Soviet Physics Doklady, (10):707–
[2] L. Andreychin, L. Georgiev, S. Ilchev, N. Kostov, I. Lekov, 710, 1966.
S. Stoykov, and C. Todorov, editors. Explanatory Dictionary
of the Bulgarian Language. 3 edition, 1973 (L. Andreiqin, L. [22] G. Mann and D. Yarowsky. Multipath translation lexicon in-
Georgiev, St. Ilqev, N. Kostov, Iv. Lekov, St. Stoikov i duction via bridge languages. In Proceedings of NAACL’01,
Cv. Todorov, “B lgarski t lkoven reqnik ”. Izdatelstvo pages 1–8, 2001.
“Nauka i izkustvo”, Sofi , 1973).
[23] G. Mann and D. Yarowsky. Induction of cross-language affix
and letter sequence correspondence. In Proceedings of EACL
[3] Y. Baltova and M. Charoleeva, editors. Dictionary of the Bul- Workshop on Cross-Language Knowledge Induction, 2006.
garian Language, volume 11. 2002 (“Reqnik na b lgarski
ezik ”. pod red. na . Baltova and M. Qaroleeva. T. 11, Aka- [24] D. Melamed. Automatic evaluation and uniform filter cascades
demiqno izdatelstvo “Prof. Marin Drinov”, Sofi , 2002). for inducing N-best translation lexicons. In Proceedings of the
Third Workshop on Very Large Corpora, pages 184–198, 1995.
[4] S. Bergsma and G. Kondrak. Alignment-based discriminative
string similarity. In Proceedings of the 45th Annual Meet- [25] D. Melamed. Bitext maps and alignment via pattern recogni-
ing of the Association of Computational Linguistics, pages tion. Computational Linguistics, 25(1):107–130, 1999.
656–663, Prague, Czech Republic, June 2007. Association for
Computational Linguistics. [26] A. Mulloni and V. Pekar. Automatic detection of orthographic
cues for cognate recognition. In Proceedings of LREC-06,
pages 2387–2390, 2006.
[5] J. A. Bickford and D. Tuggy. Electronic glossary
of linguistic terms (with equivalent terms in Span- [27] P. Nakov, S. Nakov, and E. Paskaleva. Improved word align-
ish). http://www.sil.org/mexico/ling/glosario/E005ai- ments using the web as a corpus. In Proceedings of RANLP’07,
Glossary.htm, April 2002. version 0.6. Instituto Ling¨´
uıstico de 2007.
Verano (Mexico).
[28] K. Panchev. Differencial Russian-Bulgarian Dictionary. 1963
[6] D. Bozhkov, V. Velchev, S. Vlahov, H. E. Rot, et al. Russian- (K. Panqev, “Diferencialen rusko-b lgarski reqnik ”.
Bulgarian Dictionary. 1985-1986 (D. Bo kov, V. Velqev, S. pod red. na S. Vlahov i G.A. Tagamlicka. Izdatelstvo
Vlahov, H. E. Rot i dr., “Rusko-b lgarski reqnik ”. Izda- “Nauka i izkustvo”, Sofi , 1963).
telstvo “Nauka i izkustvo”, Sofi , 1985-1986).
[29] E. Paskaleva. Compilation and validation of morphological re-
[7] E. Brill and R. C. Moore. An improved error model for noisy sources. In Workshop on Balkan Language Resources and
channel spelling correction. In Proceedings of ACL, pages 286– Tools (Balkan Conference on Informatics), pages 68–74, 2003.
293, 2000. [30] E. Pernishka and L. Krumova-Cvetkova, editors. Dictionary
of the Bulgarian Language, volume 10. 2000 (“Reqnik na
[8] K. Cholakova, editor. Dictionary of the Bulgarian Language, b lgarski ezik ”. pod red. na E. Pernixka i Krumova-
volume 1-8. 1977-1995 (“Reqnik na b lgarski ezik ”. pod Cvetkova. T. 10, Akademiqno izdatelstvo “Prof. Marin
red. na Kr. Qolakova. T. 1-8, Izdatelstvo na BAN, Sofi , Drinov”, Sofi , 2000).
1977-2005).
[31] R. Rapp. Identifying word translations in non-parallel texts.
[9] M. Diab and S. Finch. A statistical word-level translation In Proceedings of ACL, pages 320–322, 1995.
model for comparable corpora. In Proceedings of RIAO, 2000.
[32] R. Rapp. Automatic identification of word translations from
unrelated english and german corpora. In Proceedings of ACL,
[10] A. Evgenyeva, editor. Dictionary of Russian in Four Vol- pages 519–526, 1999.
umes, volume 1-4. 1981-1984 (“Slovar russkogo zyka v
qetyreh tomah ”. pod red. A.P. Evgen evoi. T. 1-4, Izda- [33] E. Ristad and P. Yianilos. Learning string-edit distance. IEEE
tel stvo “Russkii zyk”, Moskva, 1981-1984). Trans. Pattern Anal. Mach. Intell., 20(5):522–532, 1998.
[11] O. Frunza and D. Inkpen. Semi-supervised learning of par- [34] J. Tiedemann. Automatic construction of weighted string sim-
tial cognates using bilingual bootstrapping. In Proceedings ilarity measures. In Proceedings of EMNLP-VLC, pages 213–
ACL’06, pages 441–448, 2006. 219, 1999.
[35] A. Zaliznyak. Grammatical Dictionary of Russian. Russky
[12] P. Fung and L. Y. Yee. An IR approach for translating from yazyk, Moscow, 1977 (A. Zalizn k, Grammatiqeskii slo-
nonparallel, comparable texts. In Proceedings of ACL, vol- var russkogo zyka. “Russkii zyk”, Moskva, 1977).
ume 1, pages 414–420, 1998.
[36] J. Zhang, L. Sun, and J. Min. Using the web corpus to trans-
[13] D. Inkpen, O. Frunza, and G. Kondrak. Automatic identifi- late the queries in cross-lingual information retrieval. In IEEE
cation of cognates and false friends in french and english. In NLP-KE, pages 414–420, 2005.
Proceedings of RANLP’05, pages 251–257, 2005.
7
8. r Candidate (BG/RU) BG sense RU sense Sim. Cogn.? P@r R@r
1 mufta gratis muff 0.0085 no 100.00 1.00
2 bagrene / bagren e mottle gaff 0.0130 no 100.00 2.00
3 dobit k / dobytok livestock income 0.0143 no 100.00 3.00
4 mraz / mraz chill crud 0.0175 no 100.00 4.00
5 plet / plet hedge whip 0.0182 no 100.00 5.00
6 plitka plait tile 0.0272 no 100.00 6.00
7 kuqa doggish heap 0.0287 no 100.00 7.00
8 lepka bur modeling 0.0301 no 100.00 8.00
9 kaima minced meat selvage 0.0305 no 100.00 9.00
10 niz string bottom 0.0324 no 100.00 10.00
11 geran / geran draw-well geranium 0.0374 no 100.00 11.00
12 pequrka mushroom small stove 0.0379 no 100.00 12.00
13 vatman tram-driver whatman 0.0391 no 100.00 13.00
14 koreika korean bacon 0.0396 no 100.00 14.00
15 duma word thought 0.0398 no 100.00 15.00
16 tovar load commodity 0.0402 no 100.00 16.00
17 katran tar sea-kale 0.0420 no 100.00 17.00
... ... ... ... ... ... ... ...
76 generator generator generator 0.1621 yes 94.74 72.00
77 lodka boat boat 0.1672 yes 93.51 72.00
78 buket bouquet bouquet 0.1714 yes 92.31 72.00
79 prah / poroh dust gunpowder 0.1725 no 92.41 73.00
80 vrata door entrance 0.1743 no 92.50 74.00
81 kl ka gossip cammock 0.1754 no 92.59 75.00
... ... ... ... ... ... ... ...
97 dvi enie motion motion 0.2023 yes 83.51 81.00
98 komp t r / komp ter computer computer 0.2059 yes 82.65 81.00
99 vulkan volcano volcano 0.2099 yes 81.82 81.00
100 godina year time 0.2101 no 82.00 82.00
101 but leg rubble 0.2130 no 82.12 83.00
102 zapovednik despot reserve 0.2152 no 82.35 84.00
103 baba grandmother peasant woman 0.2154 no 82.52 85.00
... ... ... ... ... ... ... ...
154 most bridge bridge 0.3990 yes 62.99 97.00
155 zvezda star star 0.4034 yes 62.58 97.00
156 brat brother brother 0.4073 yes 62.18 97.00
157 meqta dream dream 0.4090 yes 61.78 97.00
158 dru estvo association friendship 0.4133 no 62.03 98.00
159 ml ko / moloko milk milk 0.4133 yes 61.64 98.00
160 klinika clinic clinic 0.4331 yes 61.25 98.00
161 glina clay clay 0.4361 yes 60.87 98.00
162 uqebnik textbook textbook 0.4458 yes 60.49 98.00
... ... ... ... ... ... ... ...
185 koren / koren root root 0.5498 yes 54.35 100.00
186 psihologi psychology psychology 0.5501 yes 54.05 100.00
187 qaika gull gull 0.5531 yes 53.76 100.00
188 spaln / spal n bedroom bedroom 0.5557 yes 53.48 100.00
189 masa / massa massage massage 0.5623 yes 53.19 100.00
190 benzin gasoline gasoline 0.6097 yes 52.91 100.00
191 pedagog pedagogue pedagogue 0.6459 yes 52.63 100.00
192 teori theory theory 0.6783 yes 52.36 100.00
193 br g / bereg shore shore 0.6862 yes 52.08 100.00
194 kontrast contrast contrast 0.7471 yes 51.81 100.00
195 sestra sister sister 0.7637 yes 51.55 100.00
196 finansi / finansy finances finances 0.8017 yes 51.28 100.00
197 srebro / serebro silver silver 0.8916 yes 50.76 100.00
198 nauka science science 0.9028 yes 50.51 100.00
199 flora flora flora 0.9171 yes 50.25 100.00
200 krasota beauty beauty 0.9684 yes 50.00 100.00
11-point average precision: 92.05
Table 1: Ranked examples from our data set for web3: Candidate is the candidate to be judged as being
cognate or not, Sim. is the Web similarity score, r is the rank, P@r and R@r are the precision and the recall
for the top r candidates. Cogn.? shows whether the words are cognates or not.
8