Successfully reported this slideshow.
Upcoming SlideShare
×

# COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

835 views

Published on

An Enhanced Lesk Word Sense Disambiguation
Algorithm through a Distributional Semantic Model

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

1. 1. An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro annalina.caputo@uniba.it Department of Computer Science - SWAP Research Group University of Bari Aldo Moro (ITALY) Coling 2014, Dublin, 27th-29th August 2014 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 1 / 21
2. 2. Motivations Problem One word... many meanings BANK 1 Sloping land (especially the slope beside a body of water) 2 A
3. 3. nancial institution that accepts deposits and channels the money into lending activities 3 A long ridge or pile 4 ... A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 2 / 21
4. 4. Motivations Lesk WSD Simple Lesk approach Insight Select the meaning whose gloss maximizes the context overlap Example The bank keeps my money 1 Sloping land (especially the slope beside a body of water) 2 A
5. 5. nancial institution that accepts deposits and channels the money into lending activities A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21
6. 6. Motivations Lesk WSD Simple Lesk approach Insight Select the meaning whose gloss maximizes the context overlap Example The bank keeps my money 1 Sloping land (especially the slope beside a body of water) ) overlap=0 2 A
7. 7. nancial institution that accepts deposits and channels the money into lending activities ) overlap=1 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21
8. 8. Motivations Lesk WSD Simple Lesk approach Issues 1 Sense de
9. 9. nition is short ) Reduced chances of matching 2 Overlap based on string matching ) Semantically related words are considered dierently 3 No knowledge about senses usage Lesk mismatch Sentence to disambiguate he cashed a check at the bank Right sense de
10. 10. nition A
11. 11. nancial institution that accepts deposits and channels the money into lending activities A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21
12. 12. Motivations Lesk WSD Simple Lesk approach Issues 1 Sense de
13. 13. nition is short ) Reduced chances of matching 2 Overlap based on string matching ) Semantically related words are considered dierently 3 No knowledge about senses usage Lesk mismatch Sentence to disambiguate he cashed a check at the bank Right sense de
14. 14. nition A
15. 15. nancial institution that accepts deposits and channels the money into lending activities overlap=0 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21
16. 16. Solution Distributional Lesk Idea Solutions 1 Sense de
17. 17. nition is short ) Gloss expansion through related meanings 2 Overlap is based on string matching ) Similarity computed in a WordSpace 3 No knowledge about senses usage ) Exploiting sense annotated corpus A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
18. 18. Solution Distributional Lesk Idea Solutions 1 Sense de
19. 19. nition is short ) Gloss expansion through related meanings Gloss Expansion Sentence to disambiguate he cashed a check at the bank A
20. 20. nancial institution that accepts deposits and channels the money into lending activities + A
21. 21. nancial institution that accepts demand deposits and makes loans and provides other services for the public... One of 12 regional banks that monitor and act as depositories for banks in their region... A corporation gaining
22. 22. nancial control over another corporation or
23. 23. nancial institution through a payment in cash or an exchange of stock... overlap=1 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
24. 24. Solution Distributional Lesk Idea Solutions 2 Overlap is based on string matching ) Similarity computed in a WordSpace Gloss Expansion Sentence to disambiguate that bank holds the mortgage on my home A
25. 25. nancial institution that accepts deposits and channels the money into lending activities + A
26. 26. nancial institution that accepts demand deposits and makes loans and provides other services for the public... One of 12 regional banks that monitor and act as depositories for banks in their region... A corporation gaining
27. 27. nancial control over another corporation or
28. 28. nancial institution through a payment in cash or an exchange of stock... overlap=0 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
29. 29. Solution Distributional Lesk Idea Solutions 2 Overlap is based on string matching ) Similarity computed in a WordSpace Gloss Expansion Sentence to disambiguate that bank holds the mortgage on my home A
30. 30. nancial institution that accepts deposits and channels the money into lending activities + A
31. 31. nancial institution that accepts demand deposits and makes loans and provides other services for the public... One of 12 regional banks that monitor and act as depositories for banks in their region... A corporation gaining
32. 32. nancial control over another corporation or
33. 33. nancial institution through a payment in cash or an exchange of stock... sloping side ground mortgage loans lending deposit money
34. 34. nancial cash payment A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
35. 35. Solution Gloss expansion Gloss expansion Leavening on a semantic network Concatenate recursively glosses of related synsets until a depth d is reached Exclude antonym relation A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 6 / 21
36. 36. Solution Gloss expansion Term weighting Idea Term relevance depends on both its frequency and the distance d of the related synset Solutions Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associated with the target word poorly characterize the meaning description Distance weight Inversely proportional to the distance in the network (number of edges) between the target synset and the related synset A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21
37. 37. Solution Gloss expansion Term weighting Idea Term relevance depends on both its frequency and the distance d of the related synset Solutions Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associated with the target word poorly characterize the meaning description Distance weight Inversely proportional to the distance in the network (number of edges) between the target synset and the related synset Bank 1 Sloping land (especially the slope beside a body of water) 2 A
38. 38. nancial institution that accepts deposits and channels the money into lending activities ... 8 A container (usually with a slot in the top) for keeping money at home A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21
39. 39. Solution Gloss expansion Term weight Inverse gloss frequency IGFk = 1 + log2 jSi j gf k (1) gf k is the number of extended glosses that contain a word wk Term weight Weight for word wk appearing h times in the extended gloss g ij is given by weight(wk ; g ij ) = Xh 1 1 + d IGFk (2) A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 8 / 21
40. 40. Solution WordSpace Distributional Semantic Models (DSMs) You shall know a word by the company it keeps! Words are represented as points in a geometric space Words are related if they are close in that space A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 9 / 21
41. 41. Solution WordSpace Overlap in DSM Gloss as a vector: weighted vector sum of terms occurring in the expanded gloss Context as a vector: vector sum of the target surrounding words Compute the overlap as the cosine similarity between gloss vector and context vector bank hold mortgage home
42. 42. nancial institution accept deposit channel money lend activity... sloping land especially slope beside bodywater... A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 10 / 21
43. 43. Solution Sense Distribution Sense distribution Insight Analyze the distribution of meanings according to each word Solution p(sij jwi ) = t(wi ; sij ) + 1 #wi + jSi j (3) t(wi ; sij ): number of times the word wi is tagged with sij #wi : number of occurrences of wi A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 11 / 21
44. 44. Solution Methodology Shaking the ingredients 1 For each word retrieve the list of meanings A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
45. 45. Solution Methodology Shaking the ingredients 1 For each word retrieve the list of meanings 2 Expand the glosses and build for each expanded gloss the corresponding vector A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
46. 46. Solution Methodology Shaking the ingredients 1 For each word retrieve the list of meanings 2 Expand the glosses and build for each expanded gloss the corresponding vector 3 Create the context vector considering surrounding words A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
47. 47. Solution Methodology Shaking the ingredients 1 For each word retrieve the list of meanings 2 Expand the glosses and build for each expanded gloss the corresponding vector 3 Create the context vector considering surrounding words 4 Compute the overlap in DSM A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
48. 48. Solution Methodology Shaking the ingredients 1 For each word retrieve the list of meanings 2 Expand the glosses and build for each expanded gloss the corresponding vector 3 Create the context vector considering surrounding words 4 Compute the overlap in DSM 5 Combine the overlap with sense distribution A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
49. 49. Solution Methodology Shaking the ingredients 1 For each word retrieve the list of meanings 2 Expand the glosses and build for each expanded gloss the corresponding vector 3 Create the context vector considering surrounding words 4 Compute the overlap in DSM 5 Combine the overlap with sense distribution 6 Select the meaning whose extended gloss has the maximum overlap A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
50. 50. Evaluation Goal Evaluation Goals Comparing our system with respect to 1 Simpli
51. 51. ed Lesk approach 2 Other task participants Evaluate the system with and without sense distribution Sense distribution linearly combined with the cosine similarity score Dataset Dataset: Task-12 of SemEval-2013 Multilingual Word Sense Disambiguation Sense inventory: BabelNet Metrics: F-measure A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 13 / 21
52. 52. Evaluation System setup System setup Developed in JAVA relying on BabelNet API 1.1.11 Lucene analyzer to tokenize both glosses and the context, Snowball library2 stemming Latent Semantic Analysis for building DSM considering the most 100; 000 frequent words BNC corpus for English Wikipedia dump for Italian Synset distance d is set to 1 Several context dimension: 3, 5, 10, 20 and the whole text Combination factor for cosine similarity and sense distribution: 0.5 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 14 / 21
53. 53. Evaluation Results English Run ContextSize SenseDistr: F MFS - - 0.656 EN.LESK.1 3 N 0.525 EN.LESK.6 3 Y 0.633 EN.DSM.1 3 N 0.536 EN.DSM.2 5 N 0.605 EN.DSM.3 10 N 0.633 EN.DSM.4 20 N 0.650 EN.DSM.5 W N 0.687 EN.DSM.6 3 Y 0.669 EN.DSM.7 5 Y 0.677 EN.DSM.8 10 Y 0.689 EN.DSM.9 20 Y 0.696 EN.DSM.10 W Y 0.715 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 15 / 21
54. 54. Evaluation Results Italian Run ContextSize SenseDistr: F MFS - - 0.572 IT.LESK.2 5 N 0.530 IT.LESK.10 W Y 0.607 IT.DSM.1 3 N 0.610 IT.DSM.2 5 N 0.607 IT.DSM.3 10 N 0.626 IT.DSM.4 20 N 0.628 IT.DSM.5 W N 0.633 IT.DSM.6 3 Y 0.631 IT.DSM.7 5 Y 0.630 IT.DSM.8 10 Y 0.635 IT.DSM.9 20 Y 0.639 IT.DSM.10 W Y 0.641 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 16 / 21
55. 55. Evaluation Task results English System F EN.DSM.10 0.715 EN.DSM.5 0.687 UMCC-DLSI-2 0.685 UMCC-DLSI-3 0.680 UMCC-DLSI-1 0.677 MFS 0.656 DAEBAK 0.604 GETALP-BN-1 0.263 GETALP-BN-2 0.266 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 17 / 21
56. 56. Evaluation Task results Italian System F UMCC-DLSI-2 0.658 UMCC-DLSI-1 0.657 IT.DSM.10 0.641 IT.DSM.5 0.633 DAEBAK 0.613 MFS 0.572 GETALP-BN-2 0.325 GETALP-BN-1 0.324 A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 18 / 21
57. 57. Conclusions and Future Work Conclusions Recap The proposed algorithm outperforms the simple Lesk one for both English and Italian The system without knowledge about sense distribution always outperform the MFS baseline For English the system obtained the best results in the SemEval-2013 Task 12 with or without sense distribution A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 19 / 21
58. 58. Conclusions and Future Work Future work What's next? Extend the evaluation to other languages Evaluate dierent DSMs and compositional approaches Adapt our approach to a speci
59. 59. c domain Using a domain corpus for DSM building Exploit a domain sense annotated corpus for sense distribution A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 20 / 21
60. 60. That's all folks! The system is available on line https://github.com/pippokill/lesk-wsd-dsm A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 21 / 21