ijcai11

Word Sense Alignment with Probabilistic Sense distribution in a Multilingual and
Monolingual Context ∗
Bhaskar Chatterjee
Grenoble INP, University Joseph Fourier
Grenoble, France
Bhaskar.Chatterjee@e.ujf-grenoble.fr
Supervised by: Gilles Serasset, Andon Tchechmedjiev
I understand what plagiarism entails and I declare that this report
is my own, original work.
Name, date and signature: Bhaskar Chatterjee, 12/06/2015
Abstract
In this article we take interest in the problem
of aligning a source sense in one language to a
target sense in another language by exploiting
source-disambiguated translation links in DBNary.
We propose to use Word Sense Disambiguation
(WSD) to annotate a corpora in order to estimate
sense distributions. In a first setting where we
only have monolingual corpora, we estimate sense
assignment distributions in both the source and
the target language and align senses based on the
assumption that if the source and target languages
are closely related, the relative sense distributions
will be the similar. We rerank sense on both
sides and align sense that have the same rank.
Then we leverage the europarl parallel corpus
and disambiguate it to estimate the distribution
bilingual sense alignement and to estimate the
most probable alignement target for the source
sense. We tested both approaches on a subset of
Europarl, by first ignoring sentence alignements
and then exploiting them to generate the bilingual
sense alignement model. We validate the output of
the alignement on a few significant examples.
Keywords: word sense disambiguation, mul-
tilingual natural language processing, lexical
semantics, sense similarity
1 Introduction
Human language is highly ambiguous. Since there are large
number of languages and each language contain a large num-
ber of words which have more than one meaning. This be-
comes seemingly very difficult to know which meaning for a
word corresponds correctly in another language.For instance,
the English noun plant can mean green plant or factory; sim-
ilarly the French word feuille can mean leaf or paper. The
∗
These match the formatting instructions of IJCAI-07. The sup-
port of IJCAI, Inc. is acknowledged.
correct sense of an ambiguous word can be selected based on
the context where it occurs, and correspondingly the prob-
lem of word sense disambiguation is defined as the task of
automatically assigning the most appropriate meaning to a
polysemous word within a given context. There are multilin-
gual lexical resources like Dbnary that that contain translation
links between top level entries or between a sense and a top-
level entry. Initially there were only translation links between
top-level entries and previous work [Tchechmedjiev et al.,
2014] have aligned the translation links with specific source
senses based on textual definitions that describe the source
sense. However the targets of these translation links remain
top-level entries: there is no prior information that indicated
what target sense should be preferred. We have to turn to
external resources to extract information that will allow the
alignment of the targets.
One solution is to exploit large large parallel corpora that
have been manually disambiguated and where correct senses
are assigned to each word. However, sense annotated corpora
exist in few languages (English, French) and are not parallel.
Moreover they are relatively small.
Consequently we turn to using word sense disambiguation
to obtain annotated corpora that we can exploit to estimate
sense distributions.The below figure pictorially describes the
problem of sense alignment .
Fig 1.1 Sense-Word translation
In this article, we will focus on similarity based meth-
ods. These methods assign scores to word senses through
semantic similarity (between word senses), and globally
find the sense combinations maximising the score over a
text. In other words, a local measure is used to assign a
similarity score between two lexical objects (senses, words,
constituencies) and the global algorithm is used to propagate
the local measures at a higher level.
For the multilingual corpora where we dont have parallel

texts translated, we are extracting the senses used for words
and counting the sense distribution for each word disam-
biguated by the existing wsd system, taking into assumption
that both source language and target language have similar
sense distribution ,we are assigning translation weights for
each sense in source to target senses in the target language.
For the parallel texts since we already have the translations
we extracting sense pair for parallel sentences. On this pair
we are checking how much each sense is dependent upon the
corresponding sense by giving them a probabilistic weight.
2 State of the art
This section contains all the existing sytems upon which our
work is based upon.
2.1 Training Data from Parallel Texts
In this section, we describe the parallel texts used in our
experiments, and the process of gathering training data from
them. For our work we used the europarl corpus [Koehn,
2005]. The Europarl parallel corpus is extracted from
the proceedings of the European Parliament. It includes
versions in 21 European languages: Romanic (French,
Italian, Spanish, Portuguese, Romanian), Germanic (English,
Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech,
Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian,
Estonian), Baltic (Latvian, Lithuanian), and Greek.The goal
of the extraction and processing was to generate sentence
aligned text for statistical machine translation systems. Using
a preprocessor sentence boundaries are identiﬁed. Europarl
is sentence aligned by a tool based on the Church and Gale
algorithm [Gale and Church, 1993].
Size of Corpus
The table shows all pairs of sentence translated data.
Europarl Corpus
Parallel Corpus
(L1-L2)
Sentences L1 Words English
Words
Bulgarian-English 406,934 - 9,886,291
Czech-English 646,605 12,999,455 15,625,264
Danish-English 1,968,800 44,654,417 48,574,988
German-English 1,920,209 44,548,491 47,818,827
Greek-English 1,235,976 - 31,929,703
Spanish-English 1,965,734 51,575,748 49,093,806
Estonian-English 651,746 11,214,221 15,685,733
Finnish-English 1,924,942 32,266,343 47,460,063
French-English 2,007,723 51,388,643 50,196,035
Hungarian-
English
624,934 12,420,276 15,096,358
Italian-English 1,909,115 47,402,927 49,666,692
Lithuanian-
English
635,146 11,294,690 15,341,983
Latvian-English 637,599 11,928,716 15,411,980
Dutch-English 1,997,775 50,602,994 49,469,373
Polish-English 632,565 12,815,544 15,268,824
Portuguese-
English
1,960,407 49,147,826 49,216,896
Romanian-
English
399,375 9,628,010 9,710,331
Slovak-English 640,715 12,942,434 15,442,233
Slovene-English 623,490 12,525,644 15,021,497
Swedish-English 1,862,234 41,508,712 5,703,795
Fig 2.1.1
2.2 Dbnary
Dbnary [S´erasset, 2012] is the data extracted from wik-
tionary as a lemon based multilingual lexical resource.The
extracted data is available as linked data. The main idea
of Dbnary is to create a lexical resource that is structured
as a set of monolingual dictionaries + bilingual translation
information. This way, the structure of extracted data follows
the usual structure of Machine Readable Dictionaries (MRD).
Dbnary Lexical Structure Example

Fig 2.2 Dbnary lexical entry for cat Figures are used from
[Sérasset, 2012]
3 Related Work
Despite a large body of work concerning word sense dis-
ambiguation (WSD), the use of WSD on parallel corpora is
poorly studied , little has been done at sense level for parallel
texts at both source and target language. In dbnary sense tag-
ing is done only at source language and not at target language.
Previously [Sérasset, 2012] in his paper have extracted lex-
ical entries from 10 languages from the wiktionary. Wik-
tionary contains translations that have gloses associated at the
source that identify what sense they belong to in the source
language and based on the glosses 1
used an adaptation of
various textual and semantic similarity techniques based on
partial or fuzzy gloss overlaps to disambiguate the translation
relations and then extract some of the sense number number
information present.
Presently dbnary provides translation links from senses in
the source language towards top-level entries (Vocable). In
this work we aim at improving dbnary by also alignement
these translation links to senses rather than top-level entries
in the target language.
There lies some inherent challenges with the system for ex-
ample if the pos tagger fails to produce the right results or
sense disambiguation fails then there are chances that our
whole work might produce incorrect results because based on
statistical data which are heavily dependent on these systems.
Below is the pictorial representation of flow of events
1
glosses are often associated with translations to make the infor-
mation available for computer programmes that may in turn be di-
rected towards helping users understand, whether through a textual
definition or a target sense number
Fig 3.1 Flow Of Events Figures are used from [Sérasset,
2012]
Also it is very difficult to find errors since the dataset is
huge(roughly 2 million sentences) and we don’t have man-
ually anotated sense for data in europarl corpus.
4 Method
4.1 Monolingual Sense Frequencies
There are a large number of languages exist today, and it is
hard to get a large dataset or translated corpus (comparable
corpus) that is also aligned at the sentence level (parallel
corpus) for many language pairs. In this case, we can only
use monolingual corpora.In such case finding word in the
target language that correctly translates from the source text
is very difficult. We want to assign a translation relation to a
particular sense, so we need the information that tells us how
often a word in one sense is translated into a target word with
a particular sense. Except obtaining parallel corpora is costly,
so we are looking for another solution for language pairs
where there is no parallel text available. Then under such
conditions we can use monolingual corpora on each side,
but there is one very important condition for the hypothesis
to hold true i.e if and only if we consider closely related
languages and if we assume that sense distributions (and thus
the ordering of senses for particular words) are similar across
the two languages. Since we have some knowledge about
closely related language ( especially if they are culturally
close) tend to share similar senses and sense distributions.
Assuming the languages are closely related we have sorted
senses by frequency on both sides and align the sense, the
translation link from to the source sense can have the same
ranked sense in the target language .
For instance the word dead in english have many mean-
ings(Senses) of which one is No longer living which
translates to the word Mort in french. This french word Mort
have its own list of senses. Now the real questions arises
which meaning of the word Mort should be taken as shown
in the figure 1.1 below. for the correct translation at sense

level.
Fig 4.1.1 Translation sense to word
Fig 4.1.2 For dead and mort senses are reordered
according to their usage in the corresponding languages
and then assigning them the same position
So our solution see fig 4.1.2 is we will order senses of
words on both sides according to their usage in both the lan-
guages , then we will take sense position of No longer living
(which is ordered acc. to their usage in that language) which
is now reordered to 2 and map with corresponding sense after
reordering them in the target language i.e sense at position 2
of mort i.e Moment ou lieu o cet arrłt des fonctions vitales se
produit. (assuming sense distribution or words remain same
for the languages). But this technique has its own limitations
such as if the languages are culturally very different for exam-
ple English and Hindi, then it is likely that sense distributions
are divergent .
4.2 Sense Frequencies in Parallel translated Text
Since we are using europarl corpus and parallel texts are
available, there is no need to make assumption of languages
with same sense frequencies in this case computation of
sense distribution is possible. Since sentence aligned texts
are available, we can compute for each sense assigned to a
word in English (any language can be taken) and take all
senses assigned to the corresponding sentence in French(here
as well other language can be taken) and take a cross
product of sense of the english word with all senses of the
french words. This way we will have a list of sense pair
in english-french. Now all unique sense pairs will have a
count which symbolizes how frequently these pairs occur
together. To find out the dependency of each english word
sense with the corresponding senses in french we are using a
probabilistic approach where we calculate a probable weight
of each sense in a language over the total sense pair count.
This way we can get for each sense pair what is the proba-
bility that a particular sense will be used that can be defined as
Fig 4.2.1 Probability of sense a
p(a) is the probability of sense a to occur in the sense pair
a ,b.
Count(a,b) is the total number of occurrence of sense pair a,b
in the parallel text.
Count a is the total number of times sense(a) is used for the
word W for the translation of corresponding target word in
the target language for the full corpora.For e.g how many
times sense ”No longer living” translates to word mort
irrespective of it translating to any sense.
Explanation with example :-
Taking the example from Fig 4.1.2 , for the word Dead if
we have to make sense pairs it will look something like
this pair1 (No longer living,Grands chagrins) ,pair2(No
longer living,(Figur) Fin, cessation dactivit.),pair3(No longer
living,Arrłt dfinitif ) ... Similarly pairs can be made with
(hated,Grands chagrins) and so on.
If we wanted to know for how many times sense(No longer
living(a)) translates to Mort sense(Grands chagrins(b)) above
formula will compute a weight for that.
Fig 4.2.2 Probability of english sense
Similarly if want to compute what is the probability
if converting from french to english, sense(b) i.e Grands
chagrins occur in translation to No longer living for the word
mort(W)
Fig 4.2.3 Probability of french sense
Similarly we can also compute how each of the english
senses relates to each of the french senses by giving them
a probabilistic weight for the translation. Below figure is
visually more understandable.
Fig 4.2.4 Probability of translated senses from english to
french
5 Validations
Validating sense alignments across languages is a difficult
task as sense alignment datasets are scarce and limited to spe-
cific language pairs. Due to time constraints, it would be un-
realistic to build such an evaluation dataset. However as a
preliminary validation step, we examine the case of a few in-
teresting example that highlight the strengths and weaknesses
of both approaches proposed. We are taking a very small
subset from europarl. A good example to check our work

will be a word that has higher frequency of occurrence. This
way we might be able to get most number of senses used and
a larger statistical weight on each sense. For this we have
chosen the certain words in english like council, commission,
house, rights,political,situation,issue. Based on the transla-
tion of this words in french according to our statistics we can
check rather accuracy of our work. Checking will be done by
human judgement since there is no system right now which
contains parallel translation of senses in both the languages.
5.1 Results
These are some the translations which we have on the sense
level.
case 1 : For the word Commission, tak-
ing into account the monolingual case
Fig 5.1.1 Sorted according to frequency of senses, in left is
english word and on right is french
In the English texts commission occurs more frequent with
sense 1 and the corresponding translation of sense 1 from
english commission is commission in french,so we have also
ordered the sense frequencies of French commission. It is
quite evident that sense 1 in french can be a good translation
of english sense 1 and also sense 2 in both cases are quite
similar but sense 3 is not quite accurate . Sense relates to
both sense 1 in french and sense 3. Simillarly we checked
for the word seal in english and most frequent translation of
seal is phoque, in this case only first two ranks were good
translation. There were also very bad examples like the
english word house whose targets words sense frequencies
were different.
Case 2 :For the parallel corpus we assigned each translation
with a probabilistic weight, taking the same example from
Fig 5.1.1.
Fig 5.1.1 Sorted according to frequency of senses, in left is
english word and on right is french
All the sense pair that occurred on while disambiguating
and aligning sense are given weight those pair which didnt
got paired are either given a weight zero or not referenced. In
this case sense pair weight is quite accurate. Due to limitation
in time and absence of some resources we couldnt test much
cases.
5.2 Conclusion and Future Work
Based on the results above and human judgement we can say
that we have close to 70 percentage accurate in doing so.
Which is not bad given a lot factors which are out of scope
for this internship. Also there is a problem of data sparseness
given relatively small size of data set used. This can be one
way of providing translation links at sense level at the target
language which in our case can be from French to English. A
lot can be done to improve this system for example We can
make use of a word alignment model [Brown et al., 1993] .
On the world alignment model we can make our sense pairs.
Another improvement which is so far taken as a block box
is the sense disambiguator. Currently sense disambiguator
which in our case is Simulated -Annealing-Disambiguation
method is taking a lot of time to disambiguate the senses,
work can be done to reduce the time substantially.
Acknowledgments
I am grateful to Prof. Gilles Serassat and Andon Tchechmed-
jiev for their helpful comments, discussions and supervision.
Without their supervision this work wouldn’t have been pos-
sible.
References
[Brown et al., 1993] Peter F Brown, Vincent J Della Pietra,
Stephen A Della Pietra, and Robert L Mercer. The math-
ematics of statistical machine translation: Parameter esti-
mation. Computational linguistics, 19(2):263–311, 1993.
[Gale and Church, 1993] William A Gale and Kenneth W
Church. A program for aligning sentences in bilingual cor-
pora. Computational linguistics, 19(1):75–102, 1993.
[Koehn, 2005] Philipp Koehn. Europarl: A parallel corpus
for statistical machine translation. In MT summit, vol-
ume 5, pages 79–86. Citeseer, 2005.
[Sérasset, 2012] Gilles Sérasset. Dbnary: Wiktionary as a
lemon-based multilingual lexical resource in rdf. Semantic
Web Journal-Special issue on Multilingual Linked Open
Data, 2012.

[Tchechmedjiev et al., 2014] Andon Tchechmedjiev, Gilles
Sérasset, Jérôme Goulian, and Didier Schwab. Attach-
ing translations to proper lexical senses in dbnary. In
3rd Workshop on Linked Data in Linguistics: Multilingual
Knowledge Resources and Natural Language Processing,
pages to–appear, 2014.

ijcai11

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to ijcai11

Similar to ijcai11 (20)

ijcai11