A Survey on Unsupervised Graph-based Word Sense Disambiguation

A Survey on Unsupervised Graph-based Word
Sense Disambiguation

Elena-Oana T˘b˘ranu
a a

Faculty of Computer Science
“Alexandru I. Cuza” University of Ia¸i
s
{elena.tabaranu@info.uaic.ro}

Abstract. This paper presents comparative evaluations of graph based
word sense disambiguation techniques using several measures of word
semantic similarity and several ranking algorithms. Unsupervised word
sense disambiguation has received a lot of attention lately because of it’s
fast execution time and it’s ability to make the most of a small input
corpus. Recent state of the art graph based systems have tried to close
the gap between the supervised and the unsupervised approaches.

Key words: WordNet, WSD, Semantic Graphs, SAN, HITS, PageR-
ank, P-Rank

1 Introduction

The problem of word sense disambiguation (WSD) is defined by Sinha et al[2]
as the task of automatically assigning the most appropriate meaning to a poly-
semous word within a given context.
WSD methods are critical for solving natural language processing tasks like
machine translation and speech processing, but also boost the performance of
other tasks like text retrieval, document classification and document clustering.
Approaches found in the bibliography face the trade off between unsupervised
and supervised methods: the first one has fast execution time, but low accuracy
and the second one requires training in a large amount of manually annotated
data.
The graph based methods make the most of the semantic model they em-
ploy, thus trying to close the gap between the unsupervised and supervised ap-
proaches.
This paper is organized as follows. It describes the latest state-of the art
methods for unsupervised graph-based word sense disambiguation. Next, this
paper presents several comparative evaluations carried on the Senseval data sets
using the same semantic representation.

2 A Survey on Unsupervised Graph-based Word Sense Disambiguation

2 State of the Art

2.1 Supervised Word Sense Disambiguation

Supervised word sense disambiguation systems have an accuracy of 60%-70%
while the unsupervised ones struggle between 45% and 60%. Most approaches
transform the sense of a particular word into a feature vector to be used in
the learning process. The major disadvantage of using such supervised learning
methods emerges from the knowledge acquisition bottleneck problem because
their accuracy is strongly connected to the amount of annotated corpus available.
State of the art results include Mihalcea and Csomai[3]’s SenseLearner which
employs seven semantic models trained using a memory based algorithm, the
Simil-Prime1 system and the results reported by Hoste et al.2
The SenseLearner uses a minimal supervised approach because it’s aim is to
process a relatively small data set for training and also generalize the learned
concepts as global models for general word categories. SenseLearner takes as
input raw text which is preprocessed before computing the feature vector. Next,
a semantic model is learned for all predefined word categories, which are defined
as groups of words that share some common syntactic or semantic properties.
Once defined and trained, the models are used to annotate the ambiguous words
in the test corpus with their corresponding meaning.
Training the SenseLearner system used the SemCor semantically annotated
dataset and evaluation was done with Senseval 2 and 3 English All Words data
sets with results of 71.3% and 68.1% respectively. The best supervised results
were reported by SMUaw3 and GAMBL4 systems as winners of the Senseval 2
and 3 All English Words Task. The former is based on pattern learning from sense
tagged corpora and instance based learning with automatic feature selection,
while the latter needs extensive training using memory based classifiers.

2.2 Unsupervised Word Sense Disambiguation

Unsupervised Word Sense Disambiguation systems seek to identify the best sense
candidate for a model of the word sense dependency in text. Such systems use a
metric of semantic similarity to compute the relatedness between the senses and
an algorithm which chooses their most likely combination.
1
Kohomban, U., Lee, W.: Learning semantic classes for word sense disambiguation.
In Proc. of ACL, pages 34-41, 2005.
2
Hoste, V., Daelemens, W., Hendrickx, I., van den Bosch, A.: Evaluating the results of
the memory-based word-expert approach to unrestricted word sense disambiguation.
In Proc. of the ACL Workshop on Word Sense Disambiguation, 2002.
3
Mihalcea, R.: Word sense disambiguation with pattern learning and automatic fea-
ture selection. Natural Language Engineering, 1(1):1-15, 2002.
4
Decadt, B., Hoste, V., Daelemens, W., van den Bosch, A.: GAMBL, genetic algo-
rithm optimization for memory-based wsd. In Proc. of the Senseval3: Third Inter-
national Workshop on the Evaluation of Systems for the Semantic Analysis of Text,
2004.

A Survey on Unsupervised Graph-based Word Sense Disambiguation 3

Fig. 1. Semantic model learning in SenseLearner.

Sinha et al.[2] have evaluated six methods of semantic similarity assuming as
input a pair of concepts from the WordNet5 hierarchy: Leacock & Chodorow6 (lcj),
Lesk7 (lesk), Wu & Palmer8 , Resnik9 , Lin10 , and Jiang & Conrath11 (jcn). They
also use a normalization technique to implement a combination of the similarity
measures, which accounts for the strength of each individual metric.
Leacock & Chodorow is a similarity metric computed using the equation (1)
where the length is the length of the shortest path between two concepts using
5
Fellbaum, C.: WordNet an electronic lexical database. MIT Press, 1998.
6
Leacock, C., Chodorow, M.: Combining local context and WordNet sense similarity
for word sense identification. In WordNet, An Electronic Lexical Database. The MIT
Press, 1998
7
Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How
to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference
1986, Toronto, June 1986.
8
Wu, Z., Palmer, M.: Verb semantics and lexical selection.In Proceedings of the 32nd
Annual Meeting of the Association for Computational Linguistics, Las Cruces, New
Mexico, 1994.
9
Resnik, P.: Using information content to evaluate semantic similarity. In Proceed-
ings of the 14th International Joint Conference on Artificial Intelligence, Montreal,
Canada, 1995.
10
Lin, D.: An information-theoretic definition of similarity. In Proceedings of the 15th
International Conference on Machine Learning, Madison, WI, 1998.
11
Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical
taxonomy. In Proceedings of the International Conference on Research in Computa-
tional Linguistics, Taiwan, 1997.


node-counting, and D is the maximum depth of the taxonomy.
length
simlch = −log (1)
2∗D
The metric introduced by Jiang & Conrath uses the least common subsumer
(LCS)and combines the information content (IC) of two input concepts:
1
simjcn = (2)
IC(concept1 ) + IC(concept2 ) − 2 ∗ IC(LCS)
The information content is deﬁned as:

IC(c) = −log(P (c)) (3)

The table below proves that a combination of the jcn, lch and lesk measures
performs better then using them individually.

Table 1. Results for the individual or combined similarity measures

jcn lch lesk combined
Precission 51.57 41.47 51.87 53.43
Recall 19.12 16.02 44.97 53.43
F-measure 27.89 23.11 48.17 53.43

Tsatsaronis et al.[1] propose a new node similarity algorithm P-Rank for
their graph representation which actually does not seem to perform better then
the other unsupervised methods. They justify the lower results based on Nav-
igli and Lapata12 ’s observations which also reported lower performance for the
betweenness and indegree methods of structural similarity.

2.3 Graph-based Methods
Graph-based methods model the word sense dependency in text using a graph
representation.
Senses are represented as labelled nodes in the graph and weighted edges are
added to mark the dependency among them. Each word thus has a window
associated with it, including several words before and after that word, which in
turn means that each word has a corresponding graph associated with it, and it
is that word that gets disambiguated after the ranking algorithm are run on that
graph. The node with the highest value is chosen as the most probable sense for
that word.
Sinha et al[1] have noticed a remarkable property that makes these graph-
based algorithms appealing: the fact that they take into account information
12
Navigli, R., Lapata, M.: Graph connectivity measures for unsupervised word sense
disambiguation. In Proc of IJCAI, pages 1683-1688, 2007.


drawn from the entire graph, capturing relationships among all the words in a
sequence, which makes them superior to other approaches that rely only on local
information individually derived for each word.

2.4 Semantic Graph Construction

Graph-based methods usually associate a node for each word to be processed.
Senses can be represented as labels and their dependencies are indicated as edge
weights. The likelihood of each sense can be determined using a graph-based
ranking algorithm, which runs over the graph of potential senses and identifies
the ‘best’ one.
Given a sequence of words W = w1 , w2 , w3 , w4 and their corresponding labels
1 2 Nw
Lwi = lwi , lwi , ..., lwi i , Sinha et al[1] define a labeled graph G = (V, E) such that
j
there is a node v ∈ V for every possible label lwi , i = 1..n, j = 1.Nwi . Edges
e ∈ E map the dependencies between pairs of labels.

Fig. 2. Sample semantic representation used by Sinha et al[2] for a sequence of four
words w1, w2, w3, w4 and their corresponding labels.

Tsatsaronis et al[1] have used a semantic model which contains only the
words which have an entry in the WordNet thesaurus. Their approach first adds
all the words and their corresponding senses represented by WordNet synsets to
the network (Initial Phase). The expansion phase extends the network iteratively
for each word with all the semantically related senses from the WordNet (Expan-
sion Round 1) until the network is connected. Failing to construct a connected
network will imply that the words in the sentence cannot be disambiguated.
Weights are added in the next step computed based on the frequency of each
edge type (Expansion Round 2 ). At some point in the construction phase, some


nodes could share the same sense (Expansion Example 2) and in this particular
case only one labelled node is added to the network.

Fig. 3. Sample semantic representation used by Tsatsaronis et al[1] for words ii and
tj and their corresponding senses.

Other approaches in the literature have used the gloss words of the WordNet
entries13 , have deﬁned additional composite semantic relations14 or have used
the Extended WordNet to enhance their model15 .

2.5 Spreading of Activation (SAN) Method

The spreading of activation in semantic networks proposed by Tsasaronis et al[4]
consider all nodes to have an activation level of 0, except for the input nodes
13
Veronis, J., Ide, N.: Word Sense Disambiguation with very large neural networks
extracted from machine readable dictionaries. In Proc. of COLING, pages 389-394,
1990.
14
Mihalcea, R., Tarau, P., Figa, E.: PageRank on semantic networks with application
to word sense disambiguation. In Proc. of COLING, 2004.
15
Agirre, E., Soroa, A.: Personalizing page rank for word sense disambiguation. In
Proc. of EACL, pages 33-41, 2009.


which have a value of 1. At each iteration p the node j propagates it’s output
activation Oj (p) to it’s neighbours as a function f of it’s current activation value
A(j (p) and the weights of the edges that connect it with it’s neighbours.

Oj (p) = f (Aj (p)) (4)

The activation level of a node k at iteration p is influenced by the output
function at iteration p − 1 of every neighbour j with a direct edge ejk . Wjk if
the function for the edge weights.

Ak (p) = Oj (p − 1) · Wjk (5)
j

The function to compute the output activation level must be chosen carefully
since the network can be flooded. Tsataronis et al[1] use the function of equation
3 with a threshold value τ to prevent the nodes with a low activation level to
1
influence their corresponding neighbours. Also, 1+p is a factor used to reduce the
influence of a node to its neighbours as iterations go by, while the Fj function will
reduce the influence of nodes that connect to many neighbours. This algorithm
requires no training.

0 if Aj (p) < τ
Oj (p) = Fj (6)
p+1 · Aj (p) otherwise

CT represents the total number of nodes, while Cj is the number of nodes
with a direct edge from j.

Cj
Fj = (1 − ) (7)
CT

2.6 Page-Rank Method

Page-Rank is a graph ranking algorithm based on the idea of “voting” or “rec-
ommendation”. When one node links to another one, it basically offers a recom-
mendation for that other node. The higher the number of recommendations that
are offered for a node, the higher the importance of the node. Furthermore, the
importance of the node offering the recommendation determines how important
the vote itself is, and this information is also taken into account by the ranking
algorithm.

P ageRank(Vb )
P ageRank(Va ) = (1 − d) + d (8)
|degree(Vb )|
(Va ,Vb )∈E

Sinha et al.[2] have used the Page-Rank algorithm to recursively score the
candidate nodes for a weighted undirected graph. Va and Vb are two nodes in the


graph connected by edges with weight wba and the Page-Rank score is computed
based in the following equation:

wba
P ageRank(Va ) = (1 − d) + d P ageRank(Vb ) (9)
(Vc ,Vb )∈E wbc
(Va ,Vb )∈E

2.7 HITS Method

Tsataronis et al[1] use the same semantic representation for the HITS ranking
algorithm. This approach identiﬁes the most important nodes in the graph also
known as authorities and the nodes that point to this kind of nodes, also known
as hubs. The major disadvantage of the HITS algorithm is that the densely
connected nodes can attract the highest score (clique attack ). Every node has
attached a pair of values for it’s authority and hub score with initial values set
to 1. Hubs and authorities are iteratively updated using the equations (10) and
(11).
authority(p) = hub(q) (10)
q∈In(p))

hub(p) = authority(r) (11)
r∈Out(p)

In(i) are all the nodes that link to i and Out(i) are all the nodes i links
to. Equations (10) and (11) are extended with weights for the graph edges. In
equations (12) and (13), wi,j is the weight for the edge connecting node i with
node j.

authority(p) = wq,p · hub(q) (12)
q∈In(p))

hub(p) = wp,r · authority(r) (13)
r∈Out(p)

A normalization is used for the scores which divides each authority by the sum
of all authority values and each hub by the sum of all hub values. The sense with
the highest authority score is chosen as the most likely one for each word.

2.8 P-Rank Method

The P-Rank measure16 is a recently introduced method for the structural sim-
ilarity of nodes in an information network and represents a generalization of
16
Zhao, P., Han, J., Sun, Y.: P-Rank: a comprehensive structural similarity measure
over information networks. In Proc. of CIKM, pages 553-562, 2009.


other state of the art measures like CoCitation17 , Coupling18 , Amsler19 and
SimRank20 . P-Rank is based on the idea that two nodes are similar if they are
referenced and also reference similar nodes. Rk+1 (a, b) represents the P-Rank
score for nodes a and b at iteration k + 1 and is computed based on the recursive
equation:

|I(a)| |I(b)|
C
Rk+1 (a, b) = λ · Rk (Ii (a), Ij (b))
|I(a)||I(b)| i=1 j=1
|O(a)| |O(b)|
C
+(1 − λ) · Rk (Oi (a), Oj (b))
|O(a)||O(b)| i=1 j=1

In equations (14) and (15), Incoming(a) and Outgoing(a) are the lists for
the incoming and outgoing neighbours of node a and the definition of |I(a)|
and |O(a)| takes into consideration the weights of all the edges that connect
the neighbours of node. The parameter λ ∈ [0, 1] is used to balance the weight
on in- and out link directions. The value Tsatsaronis et al.[1] have chosen for
their experiments is 0.5. C ∈ [0, 1] is a damping factor for the in- and out-link
directions with an usual value of 0.8.

|I(a)| = wi,a (14)
i∈Incoming(a)

|O(a)| = wa,j (15)
j∈Outgoing(a)

3 Experiments and Results

The Senseval 2 and 3 All English Words Task data sets are often used for testing
WSD systems since they are manually annotated by human experts. Tables the
statistics of the data sets for nouns (N), verbs (V), adjectives (Adj), adverbs
(Adv) and all the words computed considering their senses from the WordNet
2 thesaurus. Verbs are the most difficult to disambiguate and have an average
polysemy close to 11, while adverbs have an average polysemy close to 1.
17
Small, H. G.: Co-citation in the scientific literature: A new measure of relationship
between two documents. Journal of the American Society for Information Science,
24(4):265269, 1973
18
Kessler, M. M.: Bibliographic coupling between scientific papers. American Docu-
mentation, 14(1):1025, 1963
19
Amsler, R.: Application of citation-based automatic classification. Technical report,
The University of Texas at Austin, Linguistics Research Center, Austin, TX,, 1972
20
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In Proc.
of KDD, pages 538-543, 2002.


Table 2. Polysemous and monosemous occurrences for the Senseval 2 words using
WordNet 2

N V Adj Adv All
Monosemous 260 33 80 91 464
Polysemous 813 502 352 172 1839
Average Polysemous 4.21 9.9 3.94 3.23 5.37
Average Polysemous (P. only) 5.24 10.48 4.61 4.41 6.48

Table 3. Polysemous and monosemous occurrences for the Senseval 3 words using
WordNet 2

N V Adj Adv All
Monosemous 193 39 72 13 317
Polysemous 699 686 276 1 1662
Average Polysemous 5.07 11.49 4.13 1.07 7.23
Average Polysemous (P. only) 6.19 12.08 4.95 2.0 8.41

A baseline was computed selecting a random sense from the WordNet. Other
supervised systems have used as baseline the most frequent sense in the the-
saurus.
Table 4 presents a comparison between diﬀerent WSD results, independently
of the type of methods used. The top tree unsupervised methods PR, HITS and
the method of Agirre and Soroa are compared with the highest results reported
in the literature for the Senseval 2 and 3 data sets. The best performing method
is the supervised approach Simil-Prime with an overall accuracy of 65%. The
results table shows that, though the unsupervised systems do not perform as
good as the unsupervised ones, they indeed reduced the gap between the two
approaches.

Table 4. Accuracies on the Senseval 2 and 3 All English Words Task data sets.

Dataset SenseLearner Simil-Prime SSI WE FS PR HITS Agi09
Senseval 2 64.82 65.00 n/a 63.2 63.7 58.8 58.3 59.5
Senseval 3 63.01 65.85 60.4 n/a 61.3 56.7 57.4 57.4

4 Conclusions

The recent state of the art WSD systems minimise the gap between supervised
and unsupervised approaches. This paper describes several graph based methods
which make the most of the rich semantic model they employ. Unsupervised
systems have also the advantage of seeking the optimal value for the parameters
using as little data as possible and testing on as large a dataset as possible.
Future work could investigate the results of the recently introduced P-Rank


algorithm on a diﬀerent model like the one proposed by Sinha et al. This way
we could investigate the inﬂuence of the model upon each algorithm result.

5 References
1. Tsatsaronis, G., Varlamis, I., Norvag, K. : An Experimental Study on Unsupervised
Graph-based Word Sense Disambiguation. In Proc. of CICLing (2010).
2. Sinha, R., Mihalcea, R. :Unsupervised graph-based word sense disambiguation using
measures of semantic similarity. In Proc. of ICSC (2007).
3. Mihalcea, R., Csomai, A. : Senselearner: Word sense disambiguation for all words
in unrestricted text. In Proc. of ACL, pages 53-56 (2005).
4. Tsatsaronis, G., Vazirgiannis, M., Androutsopoulos, I. : Word Sense Disambiguation
with Spreading Activation Networks Generated from Thesauri. In Proc. of IJCAI
(2007).

A Survey on Unsupervised Graph-based Word Sense Disambiguation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Survey on Unsupervised Graph-based Word Sense Disambiguation

Similar to A Survey on Unsupervised Graph-based Word Sense Disambiguation (20)

More from Elena-Oana Tabaranu

More from Elena-Oana Tabaranu (7)

Recently uploaded

Recently uploaded (20)

A Survey on Unsupervised Graph-based Word Sense Disambiguation