This document discusses methods for aligning word senses between languages using probabilistic sense distributions. It proposes two approaches: 1) Using only monolingual corpora and aligning senses based on similar sense distributions between closely related languages. 2) Leveraging parallel corpora to estimate sense distribution alignments and the most probable translation for each source sense. The approaches are tested on the Europarl corpus, first ignoring and then exploiting sentence alignments. Several examples are examined to validate the sense alignments. Key aspects include using word sense disambiguation to annotate corpora, estimating sense assignment distributions, and assigning translation weights between language pairs based on relative sense frequencies.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP “Polish National Corpus”). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
General background and conceptual explanation of word embeddings (word2vec in particular). Mostly aimed at linguists, but also understandable for non-linguists.
Leiden University, 23 March 2018
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP “Polish National Corpus”). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
General background and conceptual explanation of word embeddings (word2vec in particular). Mostly aimed at linguists, but also understandable for non-linguists.
Leiden University, 23 March 2018
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same
graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
Words can have more than one distinct meaning and many words can be interpreted in multiple ways
depending on the context in which they occur. The process of automatically identifying the meaning of
a polysemous word in a sentence is a fundamental task in Natural Language Processing (NLP). This
phenomenon poses challenges to Natural Language Processing systems. There have been many efforts
on word sense disambiguation for English; however, the amount of efforts for Amharic is very little.
Many natural language processing applications, such as Machine Translation, Information Retrieval,
Question Answering, and Information Extraction, require this task, which occurs at the semantic level.
In this thesis, a knowledge-based word sense disambiguation method that employs Amharic WordNet
is developed. Knowledge-based Amharic WSD extracts knowledge from word definitions and relations
among words and senses. The proposed system consists of preprocessing, morphological analysis and
disambiguation components besides Amharic WordNet database. Preprocessing is used to prepare the
input sentence for morphological analysis and morphological analysis is used to reduce various forms
of a word to a single root or stem word. Amharic WordNet contains words along with its different
meanings, synsets and semantic relations with in concepts. Finally, the disambiguation component is
used to identify the ambiguous words and assign the appropriate sense of ambiguous words in a
sentence using Amharic WordNet by using sense overlap and related words.
We have evaluated the knowledge-based Amharic word sense disambiguation using Amharic
WordNet system by conducting two experiments. The first one is evaluating the effect of Amharic
WordNet with and without morphological analyzer and the second one is determining an optimal
windows size for Amharic WSD. For Amharic WordNet with morphological analyzer and Amharic
WordNet without morphological analyzer we have achieved an accuracy of 57.5% and 80%,
respectively. In the second experiment, we have found that two-word window on each side of the
ambiguous word is enough for Amharic WSD. The test results have shown that the proposed WSD
methods have performed better than previous Amharic WSD methods.
Keywords: Natural Language Processing, Amharic WordNet, Word Sense Disambiguation,
Knowledge Based Approach, Lesk Algorithm
Defining the generative probabilistic topic model for text summarization that aims at extracting a small subset of sentences from the corpus with respect to some given query.
Shared-hidden-layer Deep Neural Network for Under-resourced Language the ContentTELKOMNIKA JOURNAL
Training speech recognizer with under-resourced language data still proves difficult. Indonesian language is considered under-resourced because the lack of a standard speech corpus, text corpus, and dictionary. In this research, the efficacy of augmenting limited Indonesian speech training data with highly-resourced-language training data, such as English, to train Indonesian speech recognizer was analyzed. The training was performed in form of shared-hidden-layer deep-neural-network (SHL-DNN) training. An SHL-DNN has language-independent hidden layers and can be pre-trained and trained using multilingual training data without any difference with a monolingual deep neural network. The SHL-DNN using Indonesian and English speech training data proved effective for decreasing word error rate (WER) in decoding Indonesian dictated-speech by achieving 3.82% absolute decrease compared to a monolingual Indonesian hidden Markov model using Gaussian mixture model emission (GMM-HMM). The case was confirmed when the SHL-DNN was also employed to decode Indonesian spontaneous-speech by achieving 4.19% absolute WER decrease.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same
graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
Words can have more than one distinct meaning and many words can be interpreted in multiple ways
depending on the context in which they occur. The process of automatically identifying the meaning of
a polysemous word in a sentence is a fundamental task in Natural Language Processing (NLP). This
phenomenon poses challenges to Natural Language Processing systems. There have been many efforts
on word sense disambiguation for English; however, the amount of efforts for Amharic is very little.
Many natural language processing applications, such as Machine Translation, Information Retrieval,
Question Answering, and Information Extraction, require this task, which occurs at the semantic level.
In this thesis, a knowledge-based word sense disambiguation method that employs Amharic WordNet
is developed. Knowledge-based Amharic WSD extracts knowledge from word definitions and relations
among words and senses. The proposed system consists of preprocessing, morphological analysis and
disambiguation components besides Amharic WordNet database. Preprocessing is used to prepare the
input sentence for morphological analysis and morphological analysis is used to reduce various forms
of a word to a single root or stem word. Amharic WordNet contains words along with its different
meanings, synsets and semantic relations with in concepts. Finally, the disambiguation component is
used to identify the ambiguous words and assign the appropriate sense of ambiguous words in a
sentence using Amharic WordNet by using sense overlap and related words.
We have evaluated the knowledge-based Amharic word sense disambiguation using Amharic
WordNet system by conducting two experiments. The first one is evaluating the effect of Amharic
WordNet with and without morphological analyzer and the second one is determining an optimal
windows size for Amharic WSD. For Amharic WordNet with morphological analyzer and Amharic
WordNet without morphological analyzer we have achieved an accuracy of 57.5% and 80%,
respectively. In the second experiment, we have found that two-word window on each side of the
ambiguous word is enough for Amharic WSD. The test results have shown that the proposed WSD
methods have performed better than previous Amharic WSD methods.
Keywords: Natural Language Processing, Amharic WordNet, Word Sense Disambiguation,
Knowledge Based Approach, Lesk Algorithm
Defining the generative probabilistic topic model for text summarization that aims at extracting a small subset of sentences from the corpus with respect to some given query.
Shared-hidden-layer Deep Neural Network for Under-resourced Language the ContentTELKOMNIKA JOURNAL
Training speech recognizer with under-resourced language data still proves difficult. Indonesian language is considered under-resourced because the lack of a standard speech corpus, text corpus, and dictionary. In this research, the efficacy of augmenting limited Indonesian speech training data with highly-resourced-language training data, such as English, to train Indonesian speech recognizer was analyzed. The training was performed in form of shared-hidden-layer deep-neural-network (SHL-DNN) training. An SHL-DNN has language-independent hidden layers and can be pre-trained and trained using multilingual training data without any difference with a monolingual deep neural network. The SHL-DNN using Indonesian and English speech training data proved effective for decreasing word error rate (WER) in decoding Indonesian dictated-speech by achieving 3.82% absolute decrease compared to a monolingual Indonesian hidden Markov model using Gaussian mixture model emission (GMM-HMM). The case was confirmed when the SHL-DNN was also employed to decode Indonesian spontaneous-speech by achieving 4.19% absolute WER decrease.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
“Neural Machine Translation for low resource languages: Use case anglais - wolof“ by Sokhar Samb - Data scientist at @THEOLEX
Abstract : We will dive into the different steps of developing a Wolof-English machine translation using JoeyNMT using the benchmark from Masakhane NLP.
This presentation took place during a joint WiMLDS meetup between Paris & Dakar.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
An Extensible Multilingual Open Source LemmatizerCOMRADES project
Ahmet Aker and Johann Petraka and Firas Sabbahb
Department of Computer Science, University of Sheffield
Department of Information Engineering, University of Duisburg-Essen
a.aker@is.inf.uni-due.de, johann.petrak@sheffield.ac.uk
firas.sabbah@stud.uni-due.de
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.Keywords:TTS, SBS, Sillable, Diphone.
Title:Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Author:Sohrab Hojjatkhah, Ali Jowharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies hould consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
ijcai11
1. Word Sense Alignment with Probabilistic Sense distribution in a Multilingual and
Monolingual Context ∗
Bhaskar Chatterjee
Grenoble INP, University Joseph Fourier
Grenoble, France
Bhaskar.Chatterjee@e.ujf-grenoble.fr
Supervised by: Gilles Serasset, Andon Tchechmedjiev
I understand what plagiarism entails and I declare that this report
is my own, original work.
Name, date and signature: Bhaskar Chatterjee, 12/06/2015
Abstract
In this article we take interest in the problem
of aligning a source sense in one language to a
target sense in another language by exploiting
source-disambiguated translation links in DBNary.
We propose to use Word Sense Disambiguation
(WSD) to annotate a corpora in order to estimate
sense distributions. In a first setting where we
only have monolingual corpora, we estimate sense
assignment distributions in both the source and
the target language and align senses based on the
assumption that if the source and target languages
are closely related, the relative sense distributions
will be the similar. We rerank sense on both
sides and align sense that have the same rank.
Then we leverage the europarl parallel corpus
and disambiguate it to estimate the distribution
bilingual sense alignement and to estimate the
most probable alignement target for the source
sense. We tested both approaches on a subset of
Europarl, by first ignoring sentence alignements
and then exploiting them to generate the bilingual
sense alignement model. We validate the output of
the alignement on a few significant examples.
Keywords: word sense disambiguation, mul-
tilingual natural language processing, lexical
semantics, sense similarity
1 Introduction
Human language is highly ambiguous. Since there are large
number of languages and each language contain a large num-
ber of words which have more than one meaning. This be-
comes seemingly very difficult to know which meaning for a
word corresponds correctly in another language.For instance,
the English noun plant can mean green plant or factory; sim-
ilarly the French word feuille can mean leaf or paper. The
∗
These match the formatting instructions of IJCAI-07. The sup-
port of IJCAI, Inc. is acknowledged.
correct sense of an ambiguous word can be selected based on
the context where it occurs, and correspondingly the prob-
lem of word sense disambiguation is defined as the task of
automatically assigning the most appropriate meaning to a
polysemous word within a given context. There are multilin-
gual lexical resources like Dbnary that that contain translation
links between top level entries or between a sense and a top-
level entry. Initially there were only translation links between
top-level entries and previous work [Tchechmedjiev et al.,
2014] have aligned the translation links with specific source
senses based on textual definitions that describe the source
sense. However the targets of these translation links remain
top-level entries: there is no prior information that indicated
what target sense should be preferred. We have to turn to
external resources to extract information that will allow the
alignment of the targets.
One solution is to exploit large large parallel corpora that
have been manually disambiguated and where correct senses
are assigned to each word. However, sense annotated corpora
exist in few languages (English, French) and are not parallel.
Moreover they are relatively small.
Consequently we turn to using word sense disambiguation
to obtain annotated corpora that we can exploit to estimate
sense distributions.The below figure pictorially describes the
problem of sense alignment .
Fig 1.1 Sense-Word translation
In this article, we will focus on similarity based meth-
ods. These methods assign scores to word senses through
semantic similarity (between word senses), and globally
find the sense combinations maximising the score over a
text. In other words, a local measure is used to assign a
similarity score between two lexical objects (senses, words,
constituencies) and the global algorithm is used to propagate
the local measures at a higher level.
For the multilingual corpora where we dont have parallel
2. texts translated, we are extracting the senses used for words
and counting the sense distribution for each word disam-
biguated by the existing wsd system, taking into assumption
that both source language and target language have similar
sense distribution ,we are assigning translation weights for
each sense in source to target senses in the target language.
For the parallel texts since we already have the translations
we extracting sense pair for parallel sentences. On this pair
we are checking how much each sense is dependent upon the
corresponding sense by giving them a probabilistic weight.
2 State of the art
This section contains all the existing sytems upon which our
work is based upon.
2.1 Training Data from Parallel Texts
In this section, we describe the parallel texts used in our
experiments, and the process of gathering training data from
them. For our work we used the europarl corpus [Koehn,
2005]. The Europarl parallel corpus is extracted from
the proceedings of the European Parliament. It includes
versions in 21 European languages: Romanic (French,
Italian, Spanish, Portuguese, Romanian), Germanic (English,
Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech,
Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian,
Estonian), Baltic (Latvian, Lithuanian), and Greek.The goal
of the extraction and processing was to generate sentence
aligned text for statistical machine translation systems. Using
a preprocessor sentence boundaries are identified. Europarl
is sentence aligned by a tool based on the Church and Gale
algorithm [Gale and Church, 1993].
Size of Corpus
The table shows all pairs of sentence translated data.
Europarl Corpus
Parallel Corpus
(L1-L2)
Sentences L1 Words English
Words
Bulgarian-English 406,934 - 9,886,291
Czech-English 646,605 12,999,455 15,625,264
Danish-English 1,968,800 44,654,417 48,574,988
German-English 1,920,209 44,548,491 47,818,827
Greek-English 1,235,976 - 31,929,703
Spanish-English 1,965,734 51,575,748 49,093,806
Estonian-English 651,746 11,214,221 15,685,733
Finnish-English 1,924,942 32,266,343 47,460,063
French-English 2,007,723 51,388,643 50,196,035
Hungarian-
English
624,934 12,420,276 15,096,358
Italian-English 1,909,115 47,402,927 49,666,692
Lithuanian-
English
635,146 11,294,690 15,341,983
Latvian-English 637,599 11,928,716 15,411,980
Dutch-English 1,997,775 50,602,994 49,469,373
Polish-English 632,565 12,815,544 15,268,824
Portuguese-
English
1,960,407 49,147,826 49,216,896
Romanian-
English
399,375 9,628,010 9,710,331
Slovak-English 640,715 12,942,434 15,442,233
Slovene-English 623,490 12,525,644 15,021,497
Swedish-English 1,862,234 41,508,712 5,703,795
Fig 2.1.1
2.2 Dbnary
Dbnary [S´erasset, 2012] is the data extracted from wik-
tionary as a lemon based multilingual lexical resource.The
extracted data is available as linked data. The main idea
of Dbnary is to create a lexical resource that is structured
as a set of monolingual dictionaries + bilingual translation
information. This way, the structure of extracted data follows
the usual structure of Machine Readable Dictionaries (MRD).
Dbnary Lexical Structure Example
3. Fig 2.2 Dbnary lexical entry for cat Figures are used from
[S´erasset, 2012]
3 Related Work
Despite a large body of work concerning word sense dis-
ambiguation (WSD), the use of WSD on parallel corpora is
poorly studied , little has been done at sense level for parallel
texts at both source and target language. In dbnary sense tag-
ing is done only at source language and not at target language.
Previously [S´erasset, 2012] in his paper have extracted lex-
ical entries from 10 languages from the wiktionary. Wik-
tionary contains translations that have gloses associated at the
source that identify what sense they belong to in the source
language and based on the glosses 1
used an adaptation of
various textual and semantic similarity techniques based on
partial or fuzzy gloss overlaps to disambiguate the translation
relations and then extract some of the sense number number
information present.
Presently dbnary provides translation links from senses in
the source language towards top-level entries (Vocable). In
this work we aim at improving dbnary by also alignement
these translation links to senses rather than top-level entries
in the target language.
There lies some inherent challenges with the system for ex-
ample if the pos tagger fails to produce the right results or
sense disambiguation fails then there are chances that our
whole work might produce incorrect results because based on
statistical data which are heavily dependent on these systems.
Below is the pictorial representation of flow of events
1
glosses are often associated with translations to make the infor-
mation available for computer programmes that may in turn be di-
rected towards helping users understand, whether through a textual
definition or a target sense number
Fig 3.1 Flow Of Events Figures are used from [S´erasset,
2012]
Also it is very difficult to find errors since the dataset is
huge(roughly 2 million sentences) and we don’t have man-
ually anotated sense for data in europarl corpus.
4 Method
4.1 Monolingual Sense Frequencies
There are a large number of languages exist today, and it is
hard to get a large dataset or translated corpus (comparable
corpus) that is also aligned at the sentence level (parallel
corpus) for many language pairs. In this case, we can only
use monolingual corpora.In such case finding word in the
target language that correctly translates from the source text
is very difficult. We want to assign a translation relation to a
particular sense, so we need the information that tells us how
often a word in one sense is translated into a target word with
a particular sense. Except obtaining parallel corpora is costly,
so we are looking for another solution for language pairs
where there is no parallel text available. Then under such
conditions we can use monolingual corpora on each side,
but there is one very important condition for the hypothesis
to hold true i.e if and only if we consider closely related
languages and if we assume that sense distributions (and thus
the ordering of senses for particular words) are similar across
the two languages. Since we have some knowledge about
closely related language ( especially if they are culturally
close) tend to share similar senses and sense distributions.
Assuming the languages are closely related we have sorted
senses by frequency on both sides and align the sense, the
translation link from to the source sense can have the same
ranked sense in the target language .
For instance the word dead in english have many mean-
ings(Senses) of which one is No longer living which
translates to the word Mort in french. This french word Mort
have its own list of senses. Now the real questions arises
which meaning of the word Mort should be taken as shown
in the figure 1.1 below. for the correct translation at sense
4. level.
Fig 4.1.1 Translation sense to word
Fig 4.1.2 For dead and mort senses are reordered
according to their usage in the corresponding languages
and then assigning them the same position
So our solution see fig 4.1.2 is we will order senses of
words on both sides according to their usage in both the lan-
guages , then we will take sense position of No longer living
(which is ordered acc. to their usage in that language) which
is now reordered to 2 and map with corresponding sense after
reordering them in the target language i.e sense at position 2
of mort i.e Moment ou lieu o cet arrłt des fonctions vitales se
produit. (assuming sense distribution or words remain same
for the languages). But this technique has its own limitations
such as if the languages are culturally very different for exam-
ple English and Hindi, then it is likely that sense distributions
are divergent .
4.2 Sense Frequencies in Parallel translated Text
Since we are using europarl corpus and parallel texts are
available, there is no need to make assumption of languages
with same sense frequencies in this case computation of
sense distribution is possible. Since sentence aligned texts
are available, we can compute for each sense assigned to a
word in English (any language can be taken) and take all
senses assigned to the corresponding sentence in French(here
as well other language can be taken) and take a cross
product of sense of the english word with all senses of the
french words. This way we will have a list of sense pair
in english-french. Now all unique sense pairs will have a
count which symbolizes how frequently these pairs occur
together. To find out the dependency of each english word
sense with the corresponding senses in french we are using a
probabilistic approach where we calculate a probable weight
of each sense in a language over the total sense pair count.
This way we can get for each sense pair what is the proba-
bility that a particular sense will be used that can be defined as
Fig 4.2.1 Probability of sense a
p(a) is the probability of sense a to occur in the sense pair
a ,b.
Count(a,b) is the total number of occurrence of sense pair a,b
in the parallel text.
Count a is the total number of times sense(a) is used for the
word W for the translation of corresponding target word in
the target language for the full corpora.For e.g how many
times sense ”No longer living” translates to word mort
irrespective of it translating to any sense.
Explanation with example :-
Taking the example from Fig 4.1.2 , for the word Dead if
we have to make sense pairs it will look something like
this pair1 (No longer living,Grands chagrins) ,pair2(No
longer living,(Figur) Fin, cessation dactivit.),pair3(No longer
living,Arrłt dfinitif ) ... Similarly pairs can be made with
(hated,Grands chagrins) and so on.
If we wanted to know for how many times sense(No longer
living(a)) translates to Mort sense(Grands chagrins(b)) above
formula will compute a weight for that.
Fig 4.2.2 Probability of english sense
Similarly if want to compute what is the probability
if converting from french to english, sense(b) i.e Grands
chagrins occur in translation to No longer living for the word
mort(W)
Fig 4.2.3 Probability of french sense
Similarly we can also compute how each of the english
senses relates to each of the french senses by giving them
a probabilistic weight for the translation. Below figure is
visually more understandable.
Fig 4.2.4 Probability of translated senses from english to
french
5 Validations
Validating sense alignments across languages is a difficult
task as sense alignment datasets are scarce and limited to spe-
cific language pairs. Due to time constraints, it would be un-
realistic to build such an evaluation dataset. However as a
preliminary validation step, we examine the case of a few in-
teresting example that highlight the strengths and weaknesses
of both approaches proposed. We are taking a very small
subset from europarl. A good example to check our work
5. will be a word that has higher frequency of occurrence. This
way we might be able to get most number of senses used and
a larger statistical weight on each sense. For this we have
chosen the certain words in english like council, commission,
house, rights,political,situation,issue. Based on the transla-
tion of this words in french according to our statistics we can
check rather accuracy of our work. Checking will be done by
human judgement since there is no system right now which
contains parallel translation of senses in both the languages.
5.1 Results
These are some the translations which we have on the sense
level.
case 1 : For the word Commission, tak-
ing into account the monolingual case
Fig 5.1.1 Sorted according to frequency of senses, in left is
english word and on right is french
In the English texts commission occurs more frequent with
sense 1 and the corresponding translation of sense 1 from
english commission is commission in french,so we have also
ordered the sense frequencies of French commission. It is
quite evident that sense 1 in french can be a good translation
of english sense 1 and also sense 2 in both cases are quite
similar but sense 3 is not quite accurate . Sense relates to
both sense 1 in french and sense 3. Simillarly we checked
for the word seal in english and most frequent translation of
seal is phoque, in this case only first two ranks were good
translation. There were also very bad examples like the
english word house whose targets words sense frequencies
were different.
Case 2 :For the parallel corpus we assigned each translation
with a probabilistic weight, taking the same example from
Fig 5.1.1.
Fig 5.1.1 Sorted according to frequency of senses, in left is
english word and on right is french
All the sense pair that occurred on while disambiguating
and aligning sense are given weight those pair which didnt
got paired are either given a weight zero or not referenced. In
this case sense pair weight is quite accurate. Due to limitation
in time and absence of some resources we couldnt test much
cases.
5.2 Conclusion and Future Work
Based on the results above and human judgement we can say
that we have close to 70 percentage accurate in doing so.
Which is not bad given a lot factors which are out of scope
for this internship. Also there is a problem of data sparseness
given relatively small size of data set used. This can be one
way of providing translation links at sense level at the target
language which in our case can be from French to English. A
lot can be done to improve this system for example We can
make use of a word alignment model [Brown et al., 1993] .
On the world alignment model we can make our sense pairs.
Another improvement which is so far taken as a block box
is the sense disambiguator. Currently sense disambiguator
which in our case is Simulated -Annealing-Disambiguation
method is taking a lot of time to disambiguate the senses,
work can be done to reduce the time substantially.
Acknowledgments
I am grateful to Prof. Gilles Serassat and Andon Tchechmed-
jiev for their helpful comments, discussions and supervision.
Without their supervision this work wouldn’t have been pos-
sible.
References
[Brown et al., 1993] Peter F Brown, Vincent J Della Pietra,
Stephen A Della Pietra, and Robert L Mercer. The math-
ematics of statistical machine translation: Parameter esti-
mation. Computational linguistics, 19(2):263–311, 1993.
[Gale and Church, 1993] William A Gale and Kenneth W
Church. A program for aligning sentences in bilingual cor-
pora. Computational linguistics, 19(1):75–102, 1993.
[Koehn, 2005] Philipp Koehn. Europarl: A parallel corpus
for statistical machine translation. In MT summit, vol-
ume 5, pages 79–86. Citeseer, 2005.
[S´erasset, 2012] Gilles S´erasset. Dbnary: Wiktionary as a
lemon-based multilingual lexical resource in rdf. Semantic
Web Journal-Special issue on Multilingual Linked Open
Data, 2012.
6. [Tchechmedjiev et al., 2014] Andon Tchechmedjiev, Gilles
S´erasset, J´erˆome Goulian, and Didier Schwab. Attach-
ing translations to proper lexical senses in dbnary. In
3rd Workshop on Linked Data in Linguistics: Multilingual
Knowledge Resources and Natural Language Processing,
pages to–appear, 2014.