SlideShare a Scribd company logo
Word Sense Alignment with Probabilistic Sense distribution in a Multilingual and
Monolingual Context ∗
Bhaskar Chatterjee
Grenoble INP, University Joseph Fourier
Grenoble, France
Bhaskar.Chatterjee@e.ujf-grenoble.fr
Supervised by: Gilles Serasset, Andon Tchechmedjiev
I understand what plagiarism entails and I declare that this report
is my own, original work.
Name, date and signature: Bhaskar Chatterjee, 12/06/2015
Abstract
In this article we take interest in the problem
of aligning a source sense in one language to a
target sense in another language by exploiting
source-disambiguated translation links in DBNary.
We propose to use Word Sense Disambiguation
(WSD) to annotate a corpora in order to estimate
sense distributions. In a first setting where we
only have monolingual corpora, we estimate sense
assignment distributions in both the source and
the target language and align senses based on the
assumption that if the source and target languages
are closely related, the relative sense distributions
will be the similar. We rerank sense on both
sides and align sense that have the same rank.
Then we leverage the europarl parallel corpus
and disambiguate it to estimate the distribution
bilingual sense alignement and to estimate the
most probable alignement target for the source
sense. We tested both approaches on a subset of
Europarl, by first ignoring sentence alignements
and then exploiting them to generate the bilingual
sense alignement model. We validate the output of
the alignement on a few significant examples.
Keywords: word sense disambiguation, mul-
tilingual natural language processing, lexical
semantics, sense similarity
1 Introduction
Human language is highly ambiguous. Since there are large
number of languages and each language contain a large num-
ber of words which have more than one meaning. This be-
comes seemingly very difficult to know which meaning for a
word corresponds correctly in another language.For instance,
the English noun plant can mean green plant or factory; sim-
ilarly the French word feuille can mean leaf or paper. The
∗
These match the formatting instructions of IJCAI-07. The sup-
port of IJCAI, Inc. is acknowledged.
correct sense of an ambiguous word can be selected based on
the context where it occurs, and correspondingly the prob-
lem of word sense disambiguation is defined as the task of
automatically assigning the most appropriate meaning to a
polysemous word within a given context. There are multilin-
gual lexical resources like Dbnary that that contain translation
links between top level entries or between a sense and a top-
level entry. Initially there were only translation links between
top-level entries and previous work [Tchechmedjiev et al.,
2014] have aligned the translation links with specific source
senses based on textual definitions that describe the source
sense. However the targets of these translation links remain
top-level entries: there is no prior information that indicated
what target sense should be preferred. We have to turn to
external resources to extract information that will allow the
alignment of the targets.
One solution is to exploit large large parallel corpora that
have been manually disambiguated and where correct senses
are assigned to each word. However, sense annotated corpora
exist in few languages (English, French) and are not parallel.
Moreover they are relatively small.
Consequently we turn to using word sense disambiguation
to obtain annotated corpora that we can exploit to estimate
sense distributions.The below figure pictorially describes the
problem of sense alignment .
Fig 1.1 Sense-Word translation
In this article, we will focus on similarity based meth-
ods. These methods assign scores to word senses through
semantic similarity (between word senses), and globally
find the sense combinations maximising the score over a
text. In other words, a local measure is used to assign a
similarity score between two lexical objects (senses, words,
constituencies) and the global algorithm is used to propagate
the local measures at a higher level.
For the multilingual corpora where we dont have parallel
texts translated, we are extracting the senses used for words
and counting the sense distribution for each word disam-
biguated by the existing wsd system, taking into assumption
that both source language and target language have similar
sense distribution ,we are assigning translation weights for
each sense in source to target senses in the target language.
For the parallel texts since we already have the translations
we extracting sense pair for parallel sentences. On this pair
we are checking how much each sense is dependent upon the
corresponding sense by giving them a probabilistic weight.
2 State of the art
This section contains all the existing sytems upon which our
work is based upon.
2.1 Training Data from Parallel Texts
In this section, we describe the parallel texts used in our
experiments, and the process of gathering training data from
them. For our work we used the europarl corpus [Koehn,
2005]. The Europarl parallel corpus is extracted from
the proceedings of the European Parliament. It includes
versions in 21 European languages: Romanic (French,
Italian, Spanish, Portuguese, Romanian), Germanic (English,
Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech,
Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian,
Estonian), Baltic (Latvian, Lithuanian), and Greek.The goal
of the extraction and processing was to generate sentence
aligned text for statistical machine translation systems. Using
a preprocessor sentence boundaries are identified. Europarl
is sentence aligned by a tool based on the Church and Gale
algorithm [Gale and Church, 1993].
Size of Corpus
The table shows all pairs of sentence translated data.
Europarl Corpus
Parallel Corpus
(L1-L2)
Sentences L1 Words English
Words
Bulgarian-English 406,934 - 9,886,291
Czech-English 646,605 12,999,455 15,625,264
Danish-English 1,968,800 44,654,417 48,574,988
German-English 1,920,209 44,548,491 47,818,827
Greek-English 1,235,976 - 31,929,703
Spanish-English 1,965,734 51,575,748 49,093,806
Estonian-English 651,746 11,214,221 15,685,733
Finnish-English 1,924,942 32,266,343 47,460,063
French-English 2,007,723 51,388,643 50,196,035
Hungarian-
English
624,934 12,420,276 15,096,358
Italian-English 1,909,115 47,402,927 49,666,692
Lithuanian-
English
635,146 11,294,690 15,341,983
Latvian-English 637,599 11,928,716 15,411,980
Dutch-English 1,997,775 50,602,994 49,469,373
Polish-English 632,565 12,815,544 15,268,824
Portuguese-
English
1,960,407 49,147,826 49,216,896
Romanian-
English
399,375 9,628,010 9,710,331
Slovak-English 640,715 12,942,434 15,442,233
Slovene-English 623,490 12,525,644 15,021,497
Swedish-English 1,862,234 41,508,712 5,703,795
Fig 2.1.1
2.2 Dbnary
Dbnary [S´erasset, 2012] is the data extracted from wik-
tionary as a lemon based multilingual lexical resource.The
extracted data is available as linked data. The main idea
of Dbnary is to create a lexical resource that is structured
as a set of monolingual dictionaries + bilingual translation
information. This way, the structure of extracted data follows
the usual structure of Machine Readable Dictionaries (MRD).
Dbnary Lexical Structure Example
Fig 2.2 Dbnary lexical entry for cat Figures are used from
[S´erasset, 2012]
3 Related Work
Despite a large body of work concerning word sense dis-
ambiguation (WSD), the use of WSD on parallel corpora is
poorly studied , little has been done at sense level for parallel
texts at both source and target language. In dbnary sense tag-
ing is done only at source language and not at target language.
Previously [S´erasset, 2012] in his paper have extracted lex-
ical entries from 10 languages from the wiktionary. Wik-
tionary contains translations that have gloses associated at the
source that identify what sense they belong to in the source
language and based on the glosses 1
used an adaptation of
various textual and semantic similarity techniques based on
partial or fuzzy gloss overlaps to disambiguate the translation
relations and then extract some of the sense number number
information present.
Presently dbnary provides translation links from senses in
the source language towards top-level entries (Vocable). In
this work we aim at improving dbnary by also alignement
these translation links to senses rather than top-level entries
in the target language.
There lies some inherent challenges with the system for ex-
ample if the pos tagger fails to produce the right results or
sense disambiguation fails then there are chances that our
whole work might produce incorrect results because based on
statistical data which are heavily dependent on these systems.
Below is the pictorial representation of flow of events
1
glosses are often associated with translations to make the infor-
mation available for computer programmes that may in turn be di-
rected towards helping users understand, whether through a textual
definition or a target sense number
Fig 3.1 Flow Of Events Figures are used from [S´erasset,
2012]
Also it is very difficult to find errors since the dataset is
huge(roughly 2 million sentences) and we don’t have man-
ually anotated sense for data in europarl corpus.
4 Method
4.1 Monolingual Sense Frequencies
There are a large number of languages exist today, and it is
hard to get a large dataset or translated corpus (comparable
corpus) that is also aligned at the sentence level (parallel
corpus) for many language pairs. In this case, we can only
use monolingual corpora.In such case finding word in the
target language that correctly translates from the source text
is very difficult. We want to assign a translation relation to a
particular sense, so we need the information that tells us how
often a word in one sense is translated into a target word with
a particular sense. Except obtaining parallel corpora is costly,
so we are looking for another solution for language pairs
where there is no parallel text available. Then under such
conditions we can use monolingual corpora on each side,
but there is one very important condition for the hypothesis
to hold true i.e if and only if we consider closely related
languages and if we assume that sense distributions (and thus
the ordering of senses for particular words) are similar across
the two languages. Since we have some knowledge about
closely related language ( especially if they are culturally
close) tend to share similar senses and sense distributions.
Assuming the languages are closely related we have sorted
senses by frequency on both sides and align the sense, the
translation link from to the source sense can have the same
ranked sense in the target language .
For instance the word dead in english have many mean-
ings(Senses) of which one is No longer living which
translates to the word Mort in french. This french word Mort
have its own list of senses. Now the real questions arises
which meaning of the word Mort should be taken as shown
in the figure 1.1 below. for the correct translation at sense
level.
Fig 4.1.1 Translation sense to word
Fig 4.1.2 For dead and mort senses are reordered
according to their usage in the corresponding languages
and then assigning them the same position
So our solution see fig 4.1.2 is we will order senses of
words on both sides according to their usage in both the lan-
guages , then we will take sense position of No longer living
(which is ordered acc. to their usage in that language) which
is now reordered to 2 and map with corresponding sense after
reordering them in the target language i.e sense at position 2
of mort i.e Moment ou lieu o cet arrłt des fonctions vitales se
produit. (assuming sense distribution or words remain same
for the languages). But this technique has its own limitations
such as if the languages are culturally very different for exam-
ple English and Hindi, then it is likely that sense distributions
are divergent .
4.2 Sense Frequencies in Parallel translated Text
Since we are using europarl corpus and parallel texts are
available, there is no need to make assumption of languages
with same sense frequencies in this case computation of
sense distribution is possible. Since sentence aligned texts
are available, we can compute for each sense assigned to a
word in English (any language can be taken) and take all
senses assigned to the corresponding sentence in French(here
as well other language can be taken) and take a cross
product of sense of the english word with all senses of the
french words. This way we will have a list of sense pair
in english-french. Now all unique sense pairs will have a
count which symbolizes how frequently these pairs occur
together. To find out the dependency of each english word
sense with the corresponding senses in french we are using a
probabilistic approach where we calculate a probable weight
of each sense in a language over the total sense pair count.
This way we can get for each sense pair what is the proba-
bility that a particular sense will be used that can be defined as
Fig 4.2.1 Probability of sense a
p(a) is the probability of sense a to occur in the sense pair
a ,b.
Count(a,b) is the total number of occurrence of sense pair a,b
in the parallel text.
Count a is the total number of times sense(a) is used for the
word W for the translation of corresponding target word in
the target language for the full corpora.For e.g how many
times sense ”No longer living” translates to word mort
irrespective of it translating to any sense.
Explanation with example :-
Taking the example from Fig 4.1.2 , for the word Dead if
we have to make sense pairs it will look something like
this pair1 (No longer living,Grands chagrins) ,pair2(No
longer living,(Figur) Fin, cessation dactivit.),pair3(No longer
living,Arrłt dfinitif ) ... Similarly pairs can be made with
(hated,Grands chagrins) and so on.
If we wanted to know for how many times sense(No longer
living(a)) translates to Mort sense(Grands chagrins(b)) above
formula will compute a weight for that.
Fig 4.2.2 Probability of english sense
Similarly if want to compute what is the probability
if converting from french to english, sense(b) i.e Grands
chagrins occur in translation to No longer living for the word
mort(W)
Fig 4.2.3 Probability of french sense
Similarly we can also compute how each of the english
senses relates to each of the french senses by giving them
a probabilistic weight for the translation. Below figure is
visually more understandable.
Fig 4.2.4 Probability of translated senses from english to
french
5 Validations
Validating sense alignments across languages is a difficult
task as sense alignment datasets are scarce and limited to spe-
cific language pairs. Due to time constraints, it would be un-
realistic to build such an evaluation dataset. However as a
preliminary validation step, we examine the case of a few in-
teresting example that highlight the strengths and weaknesses
of both approaches proposed. We are taking a very small
subset from europarl. A good example to check our work
will be a word that has higher frequency of occurrence. This
way we might be able to get most number of senses used and
a larger statistical weight on each sense. For this we have
chosen the certain words in english like council, commission,
house, rights,political,situation,issue. Based on the transla-
tion of this words in french according to our statistics we can
check rather accuracy of our work. Checking will be done by
human judgement since there is no system right now which
contains parallel translation of senses in both the languages.
5.1 Results
These are some the translations which we have on the sense
level.
case 1 : For the word Commission, tak-
ing into account the monolingual case
Fig 5.1.1 Sorted according to frequency of senses, in left is
english word and on right is french
In the English texts commission occurs more frequent with
sense 1 and the corresponding translation of sense 1 from
english commission is commission in french,so we have also
ordered the sense frequencies of French commission. It is
quite evident that sense 1 in french can be a good translation
of english sense 1 and also sense 2 in both cases are quite
similar but sense 3 is not quite accurate . Sense relates to
both sense 1 in french and sense 3. Simillarly we checked
for the word seal in english and most frequent translation of
seal is phoque, in this case only first two ranks were good
translation. There were also very bad examples like the
english word house whose targets words sense frequencies
were different.
Case 2 :For the parallel corpus we assigned each translation
with a probabilistic weight, taking the same example from
Fig 5.1.1.
Fig 5.1.1 Sorted according to frequency of senses, in left is
english word and on right is french
All the sense pair that occurred on while disambiguating
and aligning sense are given weight those pair which didnt
got paired are either given a weight zero or not referenced. In
this case sense pair weight is quite accurate. Due to limitation
in time and absence of some resources we couldnt test much
cases.
5.2 Conclusion and Future Work
Based on the results above and human judgement we can say
that we have close to 70 percentage accurate in doing so.
Which is not bad given a lot factors which are out of scope
for this internship. Also there is a problem of data sparseness
given relatively small size of data set used. This can be one
way of providing translation links at sense level at the target
language which in our case can be from French to English. A
lot can be done to improve this system for example We can
make use of a word alignment model [Brown et al., 1993] .
On the world alignment model we can make our sense pairs.
Another improvement which is so far taken as a block box
is the sense disambiguator. Currently sense disambiguator
which in our case is Simulated -Annealing-Disambiguation
method is taking a lot of time to disambiguate the senses,
work can be done to reduce the time substantially.
Acknowledgments
I am grateful to Prof. Gilles Serassat and Andon Tchechmed-
jiev for their helpful comments, discussions and supervision.
Without their supervision this work wouldn’t have been pos-
sible.
References
[Brown et al., 1993] Peter F Brown, Vincent J Della Pietra,
Stephen A Della Pietra, and Robert L Mercer. The math-
ematics of statistical machine translation: Parameter esti-
mation. Computational linguistics, 19(2):263–311, 1993.
[Gale and Church, 1993] William A Gale and Kenneth W
Church. A program for aligning sentences in bilingual cor-
pora. Computational linguistics, 19(1):75–102, 1993.
[Koehn, 2005] Philipp Koehn. Europarl: A parallel corpus
for statistical machine translation. In MT summit, vol-
ume 5, pages 79–86. Citeseer, 2005.
[S´erasset, 2012] Gilles S´erasset. Dbnary: Wiktionary as a
lemon-based multilingual lexical resource in rdf. Semantic
Web Journal-Special issue on Multilingual Linked Open
Data, 2012.
[Tchechmedjiev et al., 2014] Andon Tchechmedjiev, Gilles
S´erasset, J´erˆome Goulian, and Didier Schwab. Attach-
ing translations to proper lexical senses in dbnary. In
3rd Workshop on Linked Data in Linguistics: Multilingual
Knowledge Resources and Natural Language Processing,
pages to–appear, 2014.

More Related Content

What's hot

Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
IDES Editor
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
Ceis 3
Ceis 3Ceis 3
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
ijnlc
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
ijwscjournal
 
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical DomainDDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
LuukBoulogne
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
paperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
ijnlc
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
Tae Hwan Jung
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clusteringSouvik Roy
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_basedcyan1d3
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
University of Bari (Italy)
 
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
Seid Hassen
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET Journal
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
Content Savvy
 
Shared-hidden-layer Deep Neural Network for Under-resourced Language the Content
Shared-hidden-layer Deep Neural Network for Under-resourced Language the ContentShared-hidden-layer Deep Neural Network for Under-resourced Language the Content
Shared-hidden-layer Deep Neural Network for Under-resourced Language the Content
TELKOMNIKA JOURNAL
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
Leiden University
 

What's hot (19)

Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Ceis 3
Ceis 3Ceis 3
Ceis 3
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
Exempler approach
Exempler approachExempler approach
Exempler approach
 
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical DomainDDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_based
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
 
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
 
Shared-hidden-layer Deep Neural Network for Under-resourced Language the Content
Shared-hidden-layer Deep Neural Network for Under-resourced Language the ContentShared-hidden-layer Deep Neural Network for Under-resourced Language the Content
Shared-hidden-layer Deep Neural Network for Under-resourced Language the Content
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 

Similar to ijcai11

Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
ijnlc
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
ijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
ijwscjournal
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
kevig
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
ijaia
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
Paris Women in Machine Learning and Data Science
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
W17 5406
W17 5406W17 5406
W17 5406
bonbon93
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
COMRADES project
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
cscpconf
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
Gurram Poorna Prudhvi
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
ijnlc
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word documentbutest
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
gerogepatton
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
gerogepatton
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
ijaia
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
gerogepatton
 

Similar to ijcai11 (20)

Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
P99 1067
P99 1067P99 1067
P99 1067
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
W17 5406
W17 5406W17 5406
W17 5406
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word document
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 

ijcai11

  • 1. Word Sense Alignment with Probabilistic Sense distribution in a Multilingual and Monolingual Context ∗ Bhaskar Chatterjee Grenoble INP, University Joseph Fourier Grenoble, France Bhaskar.Chatterjee@e.ujf-grenoble.fr Supervised by: Gilles Serasset, Andon Tchechmedjiev I understand what plagiarism entails and I declare that this report is my own, original work. Name, date and signature: Bhaskar Chatterjee, 12/06/2015 Abstract In this article we take interest in the problem of aligning a source sense in one language to a target sense in another language by exploiting source-disambiguated translation links in DBNary. We propose to use Word Sense Disambiguation (WSD) to annotate a corpora in order to estimate sense distributions. In a first setting where we only have monolingual corpora, we estimate sense assignment distributions in both the source and the target language and align senses based on the assumption that if the source and target languages are closely related, the relative sense distributions will be the similar. We rerank sense on both sides and align sense that have the same rank. Then we leverage the europarl parallel corpus and disambiguate it to estimate the distribution bilingual sense alignement and to estimate the most probable alignement target for the source sense. We tested both approaches on a subset of Europarl, by first ignoring sentence alignements and then exploiting them to generate the bilingual sense alignement model. We validate the output of the alignement on a few significant examples. Keywords: word sense disambiguation, mul- tilingual natural language processing, lexical semantics, sense similarity 1 Introduction Human language is highly ambiguous. Since there are large number of languages and each language contain a large num- ber of words which have more than one meaning. This be- comes seemingly very difficult to know which meaning for a word corresponds correctly in another language.For instance, the English noun plant can mean green plant or factory; sim- ilarly the French word feuille can mean leaf or paper. The ∗ These match the formatting instructions of IJCAI-07. The sup- port of IJCAI, Inc. is acknowledged. correct sense of an ambiguous word can be selected based on the context where it occurs, and correspondingly the prob- lem of word sense disambiguation is defined as the task of automatically assigning the most appropriate meaning to a polysemous word within a given context. There are multilin- gual lexical resources like Dbnary that that contain translation links between top level entries or between a sense and a top- level entry. Initially there were only translation links between top-level entries and previous work [Tchechmedjiev et al., 2014] have aligned the translation links with specific source senses based on textual definitions that describe the source sense. However the targets of these translation links remain top-level entries: there is no prior information that indicated what target sense should be preferred. We have to turn to external resources to extract information that will allow the alignment of the targets. One solution is to exploit large large parallel corpora that have been manually disambiguated and where correct senses are assigned to each word. However, sense annotated corpora exist in few languages (English, French) and are not parallel. Moreover they are relatively small. Consequently we turn to using word sense disambiguation to obtain annotated corpora that we can exploit to estimate sense distributions.The below figure pictorially describes the problem of sense alignment . Fig 1.1 Sense-Word translation In this article, we will focus on similarity based meth- ods. These methods assign scores to word senses through semantic similarity (between word senses), and globally find the sense combinations maximising the score over a text. In other words, a local measure is used to assign a similarity score between two lexical objects (senses, words, constituencies) and the global algorithm is used to propagate the local measures at a higher level. For the multilingual corpora where we dont have parallel
  • 2. texts translated, we are extracting the senses used for words and counting the sense distribution for each word disam- biguated by the existing wsd system, taking into assumption that both source language and target language have similar sense distribution ,we are assigning translation weights for each sense in source to target senses in the target language. For the parallel texts since we already have the translations we extracting sense pair for parallel sentences. On this pair we are checking how much each sense is dependent upon the corresponding sense by giving them a probabilistic weight. 2 State of the art This section contains all the existing sytems upon which our work is based upon. 2.1 Training Data from Parallel Texts In this section, we describe the parallel texts used in our experiments, and the process of gathering training data from them. For our work we used the europarl corpus [Koehn, 2005]. The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. Using a preprocessor sentence boundaries are identified. Europarl is sentence aligned by a tool based on the Church and Gale algorithm [Gale and Church, 1993]. Size of Corpus The table shows all pairs of sentence translated data. Europarl Corpus Parallel Corpus (L1-L2) Sentences L1 Words English Words Bulgarian-English 406,934 - 9,886,291 Czech-English 646,605 12,999,455 15,625,264 Danish-English 1,968,800 44,654,417 48,574,988 German-English 1,920,209 44,548,491 47,818,827 Greek-English 1,235,976 - 31,929,703 Spanish-English 1,965,734 51,575,748 49,093,806 Estonian-English 651,746 11,214,221 15,685,733 Finnish-English 1,924,942 32,266,343 47,460,063 French-English 2,007,723 51,388,643 50,196,035 Hungarian- English 624,934 12,420,276 15,096,358 Italian-English 1,909,115 47,402,927 49,666,692 Lithuanian- English 635,146 11,294,690 15,341,983 Latvian-English 637,599 11,928,716 15,411,980 Dutch-English 1,997,775 50,602,994 49,469,373 Polish-English 632,565 12,815,544 15,268,824 Portuguese- English 1,960,407 49,147,826 49,216,896 Romanian- English 399,375 9,628,010 9,710,331 Slovak-English 640,715 12,942,434 15,442,233 Slovene-English 623,490 12,525,644 15,021,497 Swedish-English 1,862,234 41,508,712 5,703,795 Fig 2.1.1 2.2 Dbnary Dbnary [S´erasset, 2012] is the data extracted from wik- tionary as a lemon based multilingual lexical resource.The extracted data is available as linked data. The main idea of Dbnary is to create a lexical resource that is structured as a set of monolingual dictionaries + bilingual translation information. This way, the structure of extracted data follows the usual structure of Machine Readable Dictionaries (MRD). Dbnary Lexical Structure Example
  • 3. Fig 2.2 Dbnary lexical entry for cat Figures are used from [S´erasset, 2012] 3 Related Work Despite a large body of work concerning word sense dis- ambiguation (WSD), the use of WSD on parallel corpora is poorly studied , little has been done at sense level for parallel texts at both source and target language. In dbnary sense tag- ing is done only at source language and not at target language. Previously [S´erasset, 2012] in his paper have extracted lex- ical entries from 10 languages from the wiktionary. Wik- tionary contains translations that have gloses associated at the source that identify what sense they belong to in the source language and based on the glosses 1 used an adaptation of various textual and semantic similarity techniques based on partial or fuzzy gloss overlaps to disambiguate the translation relations and then extract some of the sense number number information present. Presently dbnary provides translation links from senses in the source language towards top-level entries (Vocable). In this work we aim at improving dbnary by also alignement these translation links to senses rather than top-level entries in the target language. There lies some inherent challenges with the system for ex- ample if the pos tagger fails to produce the right results or sense disambiguation fails then there are chances that our whole work might produce incorrect results because based on statistical data which are heavily dependent on these systems. Below is the pictorial representation of flow of events 1 glosses are often associated with translations to make the infor- mation available for computer programmes that may in turn be di- rected towards helping users understand, whether through a textual definition or a target sense number Fig 3.1 Flow Of Events Figures are used from [S´erasset, 2012] Also it is very difficult to find errors since the dataset is huge(roughly 2 million sentences) and we don’t have man- ually anotated sense for data in europarl corpus. 4 Method 4.1 Monolingual Sense Frequencies There are a large number of languages exist today, and it is hard to get a large dataset or translated corpus (comparable corpus) that is also aligned at the sentence level (parallel corpus) for many language pairs. In this case, we can only use monolingual corpora.In such case finding word in the target language that correctly translates from the source text is very difficult. We want to assign a translation relation to a particular sense, so we need the information that tells us how often a word in one sense is translated into a target word with a particular sense. Except obtaining parallel corpora is costly, so we are looking for another solution for language pairs where there is no parallel text available. Then under such conditions we can use monolingual corpora on each side, but there is one very important condition for the hypothesis to hold true i.e if and only if we consider closely related languages and if we assume that sense distributions (and thus the ordering of senses for particular words) are similar across the two languages. Since we have some knowledge about closely related language ( especially if they are culturally close) tend to share similar senses and sense distributions. Assuming the languages are closely related we have sorted senses by frequency on both sides and align the sense, the translation link from to the source sense can have the same ranked sense in the target language . For instance the word dead in english have many mean- ings(Senses) of which one is No longer living which translates to the word Mort in french. This french word Mort have its own list of senses. Now the real questions arises which meaning of the word Mort should be taken as shown in the figure 1.1 below. for the correct translation at sense
  • 4. level. Fig 4.1.1 Translation sense to word Fig 4.1.2 For dead and mort senses are reordered according to their usage in the corresponding languages and then assigning them the same position So our solution see fig 4.1.2 is we will order senses of words on both sides according to their usage in both the lan- guages , then we will take sense position of No longer living (which is ordered acc. to their usage in that language) which is now reordered to 2 and map with corresponding sense after reordering them in the target language i.e sense at position 2 of mort i.e Moment ou lieu o cet arrłt des fonctions vitales se produit. (assuming sense distribution or words remain same for the languages). But this technique has its own limitations such as if the languages are culturally very different for exam- ple English and Hindi, then it is likely that sense distributions are divergent . 4.2 Sense Frequencies in Parallel translated Text Since we are using europarl corpus and parallel texts are available, there is no need to make assumption of languages with same sense frequencies in this case computation of sense distribution is possible. Since sentence aligned texts are available, we can compute for each sense assigned to a word in English (any language can be taken) and take all senses assigned to the corresponding sentence in French(here as well other language can be taken) and take a cross product of sense of the english word with all senses of the french words. This way we will have a list of sense pair in english-french. Now all unique sense pairs will have a count which symbolizes how frequently these pairs occur together. To find out the dependency of each english word sense with the corresponding senses in french we are using a probabilistic approach where we calculate a probable weight of each sense in a language over the total sense pair count. This way we can get for each sense pair what is the proba- bility that a particular sense will be used that can be defined as Fig 4.2.1 Probability of sense a p(a) is the probability of sense a to occur in the sense pair a ,b. Count(a,b) is the total number of occurrence of sense pair a,b in the parallel text. Count a is the total number of times sense(a) is used for the word W for the translation of corresponding target word in the target language for the full corpora.For e.g how many times sense ”No longer living” translates to word mort irrespective of it translating to any sense. Explanation with example :- Taking the example from Fig 4.1.2 , for the word Dead if we have to make sense pairs it will look something like this pair1 (No longer living,Grands chagrins) ,pair2(No longer living,(Figur) Fin, cessation dactivit.),pair3(No longer living,Arrłt dfinitif ) ... Similarly pairs can be made with (hated,Grands chagrins) and so on. If we wanted to know for how many times sense(No longer living(a)) translates to Mort sense(Grands chagrins(b)) above formula will compute a weight for that. Fig 4.2.2 Probability of english sense Similarly if want to compute what is the probability if converting from french to english, sense(b) i.e Grands chagrins occur in translation to No longer living for the word mort(W) Fig 4.2.3 Probability of french sense Similarly we can also compute how each of the english senses relates to each of the french senses by giving them a probabilistic weight for the translation. Below figure is visually more understandable. Fig 4.2.4 Probability of translated senses from english to french 5 Validations Validating sense alignments across languages is a difficult task as sense alignment datasets are scarce and limited to spe- cific language pairs. Due to time constraints, it would be un- realistic to build such an evaluation dataset. However as a preliminary validation step, we examine the case of a few in- teresting example that highlight the strengths and weaknesses of both approaches proposed. We are taking a very small subset from europarl. A good example to check our work
  • 5. will be a word that has higher frequency of occurrence. This way we might be able to get most number of senses used and a larger statistical weight on each sense. For this we have chosen the certain words in english like council, commission, house, rights,political,situation,issue. Based on the transla- tion of this words in french according to our statistics we can check rather accuracy of our work. Checking will be done by human judgement since there is no system right now which contains parallel translation of senses in both the languages. 5.1 Results These are some the translations which we have on the sense level. case 1 : For the word Commission, tak- ing into account the monolingual case Fig 5.1.1 Sorted according to frequency of senses, in left is english word and on right is french In the English texts commission occurs more frequent with sense 1 and the corresponding translation of sense 1 from english commission is commission in french,so we have also ordered the sense frequencies of French commission. It is quite evident that sense 1 in french can be a good translation of english sense 1 and also sense 2 in both cases are quite similar but sense 3 is not quite accurate . Sense relates to both sense 1 in french and sense 3. Simillarly we checked for the word seal in english and most frequent translation of seal is phoque, in this case only first two ranks were good translation. There were also very bad examples like the english word house whose targets words sense frequencies were different. Case 2 :For the parallel corpus we assigned each translation with a probabilistic weight, taking the same example from Fig 5.1.1. Fig 5.1.1 Sorted according to frequency of senses, in left is english word and on right is french All the sense pair that occurred on while disambiguating and aligning sense are given weight those pair which didnt got paired are either given a weight zero or not referenced. In this case sense pair weight is quite accurate. Due to limitation in time and absence of some resources we couldnt test much cases. 5.2 Conclusion and Future Work Based on the results above and human judgement we can say that we have close to 70 percentage accurate in doing so. Which is not bad given a lot factors which are out of scope for this internship. Also there is a problem of data sparseness given relatively small size of data set used. This can be one way of providing translation links at sense level at the target language which in our case can be from French to English. A lot can be done to improve this system for example We can make use of a word alignment model [Brown et al., 1993] . On the world alignment model we can make our sense pairs. Another improvement which is so far taken as a block box is the sense disambiguator. Currently sense disambiguator which in our case is Simulated -Annealing-Disambiguation method is taking a lot of time to disambiguate the senses, work can be done to reduce the time substantially. Acknowledgments I am grateful to Prof. Gilles Serassat and Andon Tchechmed- jiev for their helpful comments, discussions and supervision. Without their supervision this work wouldn’t have been pos- sible. References [Brown et al., 1993] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The math- ematics of statistical machine translation: Parameter esti- mation. Computational linguistics, 19(2):263–311, 1993. [Gale and Church, 1993] William A Gale and Kenneth W Church. A program for aligning sentences in bilingual cor- pora. Computational linguistics, 19(1):75–102, 1993. [Koehn, 2005] Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In MT summit, vol- ume 5, pages 79–86. Citeseer, 2005. [S´erasset, 2012] Gilles S´erasset. Dbnary: Wiktionary as a lemon-based multilingual lexical resource in rdf. Semantic Web Journal-Special issue on Multilingual Linked Open Data, 2012.
  • 6. [Tchechmedjiev et al., 2014] Andon Tchechmedjiev, Gilles S´erasset, J´erˆome Goulian, and Didier Schwab. Attach- ing translations to proper lexical senses in dbnary. In 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing, pages to–appear, 2014.