A Survey on Unsupervised Graph-based Word Sense Disambiguation


Published on

This paper presents comparative evaluations of graph based
word sense disambiguation techniques using several measures of word
semantic similarity and several ranking algorithms. Unsupervised word
sense disambiguation has received a lot of attention lately because of it's
fast execution time and it's ability to make the most of a small input
corpus. Recent state of the art graph based systems have tried to close
the gap between the supervised and the unsupervised approaches.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Survey on Unsupervised Graph-based Word Sense Disambiguation

  1. 1. A Survey on Unsupervised Graph-based Word Sense Disambiguation Elena-Oana T˘b˘ranu a a Faculty of Computer Science “Alexandru I. Cuza” University of Ia¸i s {elena.tabaranu@info.uaic.ro} Abstract. This paper presents comparative evaluations of graph based word sense disambiguation techniques using several measures of word semantic similarity and several ranking algorithms. Unsupervised word sense disambiguation has received a lot of attention lately because of it’s fast execution time and it’s ability to make the most of a small input corpus. Recent state of the art graph based systems have tried to close the gap between the supervised and the unsupervised approaches. Key words: WordNet, WSD, Semantic Graphs, SAN, HITS, PageR- ank, P-Rank 1 Introduction The problem of word sense disambiguation (WSD) is defined by Sinha et al[2] as the task of automatically assigning the most appropriate meaning to a poly- semous word within a given context. WSD methods are critical for solving natural language processing tasks like machine translation and speech processing, but also boost the performance of other tasks like text retrieval, document classification and document clustering. Approaches found in the bibliography face the trade off between unsupervised and supervised methods: the first one has fast execution time, but low accuracy and the second one requires training in a large amount of manually annotated data. The graph based methods make the most of the semantic model they em- ploy, thus trying to close the gap between the unsupervised and supervised ap- proaches. This paper is organized as follows. It describes the latest state-of the art methods for unsupervised graph-based word sense disambiguation. Next, this paper presents several comparative evaluations carried on the Senseval data sets using the same semantic representation.
  2. 2. 2 A Survey on Unsupervised Graph-based Word Sense Disambiguation 2 State of the Art 2.1 Supervised Word Sense Disambiguation Supervised word sense disambiguation systems have an accuracy of 60%-70% while the unsupervised ones struggle between 45% and 60%. Most approaches transform the sense of a particular word into a feature vector to be used in the learning process. The major disadvantage of using such supervised learning methods emerges from the knowledge acquisition bottleneck problem because their accuracy is strongly connected to the amount of annotated corpus available. State of the art results include Mihalcea and Csomai[3]’s SenseLearner which employs seven semantic models trained using a memory based algorithm, the Simil-Prime1 system and the results reported by Hoste et al.2 The SenseLearner uses a minimal supervised approach because it’s aim is to process a relatively small data set for training and also generalize the learned concepts as global models for general word categories. SenseLearner takes as input raw text which is preprocessed before computing the feature vector. Next, a semantic model is learned for all predefined word categories, which are defined as groups of words that share some common syntactic or semantic properties. Once defined and trained, the models are used to annotate the ambiguous words in the test corpus with their corresponding meaning. Training the SenseLearner system used the SemCor semantically annotated dataset and evaluation was done with Senseval 2 and 3 English All Words data sets with results of 71.3% and 68.1% respectively. The best supervised results were reported by SMUaw3 and GAMBL4 systems as winners of the Senseval 2 and 3 All English Words Task. The former is based on pattern learning from sense tagged corpora and instance based learning with automatic feature selection, while the latter needs extensive training using memory based classifiers. 2.2 Unsupervised Word Sense Disambiguation Unsupervised Word Sense Disambiguation systems seek to identify the best sense candidate for a model of the word sense dependency in text. Such systems use a metric of semantic similarity to compute the relatedness between the senses and an algorithm which chooses their most likely combination. 1 Kohomban, U., Lee, W.: Learning semantic classes for word sense disambiguation. In Proc. of ACL, pages 34-41, 2005. 2 Hoste, V., Daelemens, W., Hendrickx, I., van den Bosch, A.: Evaluating the results of the memory-based word-expert approach to unrestricted word sense disambiguation. In Proc. of the ACL Workshop on Word Sense Disambiguation, 2002. 3 Mihalcea, R.: Word sense disambiguation with pattern learning and automatic fea- ture selection. Natural Language Engineering, 1(1):1-15, 2002. 4 Decadt, B., Hoste, V., Daelemens, W., van den Bosch, A.: GAMBL, genetic algo- rithm optimization for memory-based wsd. In Proc. of the Senseval3: Third Inter- national Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004.
  3. 3. A Survey on Unsupervised Graph-based Word Sense Disambiguation 3 Fig. 1. Semantic model learning in SenseLearner. Sinha et al.[2] have evaluated six methods of semantic similarity assuming as input a pair of concepts from the WordNet5 hierarchy: Leacock & Chodorow6 (lcj), Lesk7 (lesk), Wu & Palmer8 , Resnik9 , Lin10 , and Jiang & Conrath11 (jcn). They also use a normalization technique to implement a combination of the similarity measures, which accounts for the strength of each individual metric. Leacock & Chodorow is a similarity metric computed using the equation (1) where the length is the length of the shortest path between two concepts using 5 Fellbaum, C.: WordNet an electronic lexical database. MIT Press, 1998. 6 Leacock, C., Chodorow, M.: Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database. The MIT Press, 1998 7 Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference 1986, Toronto, June 1986. 8 Wu, Z., Palmer, M.: Verb semantics and lexical selection.In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, 1994. 9 Resnik, P.: Using information content to evaluate semantic similarity. In Proceed- ings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Canada, 1995. 10 Lin, D.: An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI, 1998. 11 Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computa- tional Linguistics, Taiwan, 1997.
  4. 4. 4 A Survey on Unsupervised Graph-based Word Sense Disambiguation node-counting, and D is the maximum depth of the taxonomy. length simlch = −log (1) 2∗D The metric introduced by Jiang & Conrath uses the least common subsumer (LCS)and combines the information content (IC) of two input concepts: 1 simjcn = (2) IC(concept1 ) + IC(concept2 ) − 2 ∗ IC(LCS) The information content is defined as: IC(c) = −log(P (c)) (3) The table below proves that a combination of the jcn, lch and lesk measures performs better then using them individually. Table 1. Results for the individual or combined similarity measures jcn lch lesk combined Precission 51.57 41.47 51.87 53.43 Recall 19.12 16.02 44.97 53.43 F-measure 27.89 23.11 48.17 53.43 Tsatsaronis et al.[1] propose a new node similarity algorithm P-Rank for their graph representation which actually does not seem to perform better then the other unsupervised methods. They justify the lower results based on Nav- igli and Lapata12 ’s observations which also reported lower performance for the betweenness and indegree methods of structural similarity. 2.3 Graph-based Methods Graph-based methods model the word sense dependency in text using a graph representation. Senses are represented as labelled nodes in the graph and weighted edges are added to mark the dependency among them. Each word thus has a window associated with it, including several words before and after that word, which in turn means that each word has a corresponding graph associated with it, and it is that word that gets disambiguated after the ranking algorithm are run on that graph. The node with the highest value is chosen as the most probable sense for that word. Sinha et al[1] have noticed a remarkable property that makes these graph- based algorithms appealing: the fact that they take into account information 12 Navigli, R., Lapata, M.: Graph connectivity measures for unsupervised word sense disambiguation. In Proc of IJCAI, pages 1683-1688, 2007.
  5. 5. A Survey on Unsupervised Graph-based Word Sense Disambiguation 5 drawn from the entire graph, capturing relationships among all the words in a sequence, which makes them superior to other approaches that rely only on local information individually derived for each word. 2.4 Semantic Graph Construction Graph-based methods usually associate a node for each word to be processed. Senses can be represented as labels and their dependencies are indicated as edge weights. The likelihood of each sense can be determined using a graph-based ranking algorithm, which runs over the graph of potential senses and identifies the ‘best’ one. Given a sequence of words W = w1 , w2 , w3 , w4 and their corresponding labels 1 2 Nw Lwi = lwi , lwi , ..., lwi i , Sinha et al[1] define a labeled graph G = (V, E) such that j there is a node v ∈ V for every possible label lwi , i = 1..n, j = 1.Nwi . Edges e ∈ E map the dependencies between pairs of labels. Fig. 2. Sample semantic representation used by Sinha et al[2] for a sequence of four words w1, w2, w3, w4 and their corresponding labels. Tsatsaronis et al[1] have used a semantic model which contains only the words which have an entry in the WordNet thesaurus. Their approach first adds all the words and their corresponding senses represented by WordNet synsets to the network (Initial Phase). The expansion phase extends the network iteratively for each word with all the semantically related senses from the WordNet (Expan- sion Round 1) until the network is connected. Failing to construct a connected network will imply that the words in the sentence cannot be disambiguated. Weights are added in the next step computed based on the frequency of each edge type (Expansion Round 2 ). At some point in the construction phase, some
  6. 6. 6 A Survey on Unsupervised Graph-based Word Sense Disambiguation nodes could share the same sense (Expansion Example 2) and in this particular case only one labelled node is added to the network. Fig. 3. Sample semantic representation used by Tsatsaronis et al[1] for words ii and tj and their corresponding senses. Other approaches in the literature have used the gloss words of the WordNet entries13 , have defined additional composite semantic relations14 or have used the Extended WordNet to enhance their model15 . 2.5 Spreading of Activation (SAN) Method The spreading of activation in semantic networks proposed by Tsasaronis et al[4] consider all nodes to have an activation level of 0, except for the input nodes 13 Veronis, J., Ide, N.: Word Sense Disambiguation with very large neural networks extracted from machine readable dictionaries. In Proc. of COLING, pages 389-394, 1990. 14 Mihalcea, R., Tarau, P., Figa, E.: PageRank on semantic networks with application to word sense disambiguation. In Proc. of COLING, 2004. 15 Agirre, E., Soroa, A.: Personalizing page rank for word sense disambiguation. In Proc. of EACL, pages 33-41, 2009.
  7. 7. A Survey on Unsupervised Graph-based Word Sense Disambiguation 7 which have a value of 1. At each iteration p the node j propagates it’s output activation Oj (p) to it’s neighbours as a function f of it’s current activation value A(j (p) and the weights of the edges that connect it with it’s neighbours. Oj (p) = f (Aj (p)) (4) The activation level of a node k at iteration p is influenced by the output function at iteration p − 1 of every neighbour j with a direct edge ejk . Wjk if the function for the edge weights. Ak (p) = Oj (p − 1) · Wjk (5) j The function to compute the output activation level must be chosen carefully since the network can be flooded. Tsataronis et al[1] use the function of equation 3 with a threshold value τ to prevent the nodes with a low activation level to 1 influence their corresponding neighbours. Also, 1+p is a factor used to reduce the influence of a node to its neighbours as iterations go by, while the Fj function will reduce the influence of nodes that connect to many neighbours. This algorithm requires no training. 0 if Aj (p) < τ Oj (p) = Fj (6) p+1 · Aj (p) otherwise CT represents the total number of nodes, while Cj is the number of nodes with a direct edge from j. Cj Fj = (1 − ) (7) CT 2.6 Page-Rank Method Page-Rank is a graph ranking algorithm based on the idea of “voting” or “rec- ommendation”. When one node links to another one, it basically offers a recom- mendation for that other node. The higher the number of recommendations that are offered for a node, the higher the importance of the node. Furthermore, the importance of the node offering the recommendation determines how important the vote itself is, and this information is also taken into account by the ranking algorithm. P ageRank(Vb ) P ageRank(Va ) = (1 − d) + d (8) |degree(Vb )| (Va ,Vb )∈E Sinha et al.[2] have used the Page-Rank algorithm to recursively score the candidate nodes for a weighted undirected graph. Va and Vb are two nodes in the
  8. 8. 8 A Survey on Unsupervised Graph-based Word Sense Disambiguation graph connected by edges with weight wba and the Page-Rank score is computed based in the following equation: wba P ageRank(Va ) = (1 − d) + d P ageRank(Vb ) (9) (Vc ,Vb )∈E wbc (Va ,Vb )∈E 2.7 HITS Method Tsataronis et al[1] use the same semantic representation for the HITS ranking algorithm. This approach identifies the most important nodes in the graph also known as authorities and the nodes that point to this kind of nodes, also known as hubs. The major disadvantage of the HITS algorithm is that the densely connected nodes can attract the highest score (clique attack ). Every node has attached a pair of values for it’s authority and hub score with initial values set to 1. Hubs and authorities are iteratively updated using the equations (10) and (11). authority(p) = hub(q) (10) q∈In(p)) hub(p) = authority(r) (11) r∈Out(p) In(i) are all the nodes that link to i and Out(i) are all the nodes i links to. Equations (10) and (11) are extended with weights for the graph edges. In equations (12) and (13), wi,j is the weight for the edge connecting node i with node j. authority(p) = wq,p · hub(q) (12) q∈In(p)) hub(p) = wp,r · authority(r) (13) r∈Out(p) A normalization is used for the scores which divides each authority by the sum of all authority values and each hub by the sum of all hub values. The sense with the highest authority score is chosen as the most likely one for each word. 2.8 P-Rank Method The P-Rank measure16 is a recently introduced method for the structural sim- ilarity of nodes in an information network and represents a generalization of 16 Zhao, P., Han, J., Sun, Y.: P-Rank: a comprehensive structural similarity measure over information networks. In Proc. of CIKM, pages 553-562, 2009.
  9. 9. A Survey on Unsupervised Graph-based Word Sense Disambiguation 9 other state of the art measures like CoCitation17 , Coupling18 , Amsler19 and SimRank20 . P-Rank is based on the idea that two nodes are similar if they are referenced and also reference similar nodes. Rk+1 (a, b) represents the P-Rank score for nodes a and b at iteration k + 1 and is computed based on the recursive equation: |I(a)| |I(b)| C Rk+1 (a, b) = λ · Rk (Ii (a), Ij (b)) |I(a)||I(b)| i=1 j=1 |O(a)| |O(b)| C +(1 − λ) · Rk (Oi (a), Oj (b)) |O(a)||O(b)| i=1 j=1 In equations (14) and (15), Incoming(a) and Outgoing(a) are the lists for the incoming and outgoing neighbours of node a and the definition of |I(a)| and |O(a)| takes into consideration the weights of all the edges that connect the neighbours of node. The parameter λ ∈ [0, 1] is used to balance the weight on in- and out link directions. The value Tsatsaronis et al.[1] have chosen for their experiments is 0.5. C ∈ [0, 1] is a damping factor for the in- and out-link directions with an usual value of 0.8. |I(a)| = wi,a (14) i∈Incoming(a) |O(a)| = wa,j (15) j∈Outgoing(a) 3 Experiments and Results The Senseval 2 and 3 All English Words Task data sets are often used for testing WSD systems since they are manually annotated by human experts. Tables the statistics of the data sets for nouns (N), verbs (V), adjectives (Adj), adverbs (Adv) and all the words computed considering their senses from the WordNet 2 thesaurus. Verbs are the most difficult to disambiguate and have an average polysemy close to 11, while adverbs have an average polysemy close to 1. 17 Small, H. G.: Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265269, 1973 18 Kessler, M. M.: Bibliographic coupling between scientific papers. American Docu- mentation, 14(1):1025, 1963 19 Amsler, R.: Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX,, 1972 20 Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In Proc. of KDD, pages 538-543, 2002.
  10. 10. 10 A Survey on Unsupervised Graph-based Word Sense Disambiguation Table 2. Polysemous and monosemous occurrences for the Senseval 2 words using WordNet 2 N V Adj Adv All Monosemous 260 33 80 91 464 Polysemous 813 502 352 172 1839 Average Polysemous 4.21 9.9 3.94 3.23 5.37 Average Polysemous (P. only) 5.24 10.48 4.61 4.41 6.48 Table 3. Polysemous and monosemous occurrences for the Senseval 3 words using WordNet 2 N V Adj Adv All Monosemous 193 39 72 13 317 Polysemous 699 686 276 1 1662 Average Polysemous 5.07 11.49 4.13 1.07 7.23 Average Polysemous (P. only) 6.19 12.08 4.95 2.0 8.41 A baseline was computed selecting a random sense from the WordNet. Other supervised systems have used as baseline the most frequent sense in the the- saurus. Table 4 presents a comparison between different WSD results, independently of the type of methods used. The top tree unsupervised methods PR, HITS and the method of Agirre and Soroa are compared with the highest results reported in the literature for the Senseval 2 and 3 data sets. The best performing method is the supervised approach Simil-Prime with an overall accuracy of 65%. The results table shows that, though the unsupervised systems do not perform as good as the unsupervised ones, they indeed reduced the gap between the two approaches. Table 4. Accuracies on the Senseval 2 and 3 All English Words Task data sets. Dataset SenseLearner Simil-Prime SSI WE FS PR HITS Agi09 Senseval 2 64.82 65.00 n/a 63.2 63.7 58.8 58.3 59.5 Senseval 3 63.01 65.85 60.4 n/a 61.3 56.7 57.4 57.4 4 Conclusions The recent state of the art WSD systems minimise the gap between supervised and unsupervised approaches. This paper describes several graph based methods which make the most of the rich semantic model they employ. Unsupervised systems have also the advantage of seeking the optimal value for the parameters using as little data as possible and testing on as large a dataset as possible. Future work could investigate the results of the recently introduced P-Rank
  11. 11. A Survey on Unsupervised Graph-based Word Sense Disambiguation 11 algorithm on a different model like the one proposed by Sinha et al. This way we could investigate the influence of the model upon each algorithm result. 5 References 1. Tsatsaronis, G., Varlamis, I., Norvag, K. : An Experimental Study on Unsupervised Graph-based Word Sense Disambiguation. In Proc. of CICLing (2010). 2. Sinha, R., Mihalcea, R. :Unsupervised graph-based word sense disambiguation using measures of semantic similarity. In Proc. of ICSC (2007). 3. Mihalcea, R., Csomai, A. : Senselearner: Word sense disambiguation for all words in unrestricted text. In Proc. of ACL, pages 53-56 (2005). 4. Tsatsaronis, G., Vazirgiannis, M., Androutsopoulos, I. : Word Sense Disambiguation with Spreading Activation Networks Generated from Thesauri. In Proc. of IJCAI (2007).