COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model
1. An Enhanced Lesk Word Sense Disambiguation
Algorithm through a Distributional Semantic Model
Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro
annalina.caputo@uniba.it
Department of Computer Science - SWAP Research Group
University of Bari Aldo Moro (ITALY)
Coling 2014, Dublin, 27th-29th August 2014
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 1 / 21
2. Motivations Problem
One word... many meanings
BANK
1 Sloping land (especially the slope beside a body of water)
2 A
3. nancial institution that accepts deposits and channels the money into lending
activities
3 A long ridge or pile
4 ...
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 2 / 21
4. Motivations Lesk WSD
Simple Lesk approach
Insight
Select the meaning whose gloss maximizes the context overlap
Example
The bank keeps my money
1 Sloping land (especially the slope beside a body of water)
2 A
5. nancial institution that accepts deposits and channels the money into lending
activities
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21
6. Motivations Lesk WSD
Simple Lesk approach
Insight
Select the meaning whose gloss maximizes the context overlap
Example
The bank keeps my money
1 Sloping land (especially the slope beside a body of water) ) overlap=0
2 A
7. nancial institution that accepts deposits and channels the money into lending
activities ) overlap=1
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21
9. nition is short ) Reduced chances of matching
2 Overlap based on string matching ) Semantically related words are considered
dierently
3 No knowledge about senses usage
Lesk mismatch
Sentence to disambiguate
he cashed a check at the bank
Right sense de
11. nancial institution that accepts deposits
and channels the money into lending activities
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21
13. nition is short ) Reduced chances of matching
2 Overlap based on string matching ) Semantically related words are considered
dierently
3 No knowledge about senses usage
Lesk mismatch
Sentence to disambiguate
he cashed a check at the bank
Right sense de
15. nancial institution that accepts deposits
and channels the money into lending activities
overlap=0
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21
17. nition is short ) Gloss expansion through related meanings
2 Overlap is based on string matching ) Similarity computed in a WordSpace
3 No knowledge about senses usage ) Exploiting sense annotated corpus
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
21. nancial institution that accepts demand deposits and makes loans and provides other services
for the public... One of 12 regional banks that monitor and act as depositories for banks in their
region... A corporation gaining
23. nancial institution
through a payment in cash or an exchange of stock...
overlap=1
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
24. Solution Distributional Lesk
Idea
Solutions
2 Overlap is based on string matching ) Similarity computed in a WordSpace
Gloss Expansion
Sentence to disambiguate
that bank holds the mortgage on my home
A
26. nancial institution that accepts demand deposits and makes loans and provides other services
for the public... One of 12 regional banks that monitor and act as depositories for banks in their
region... A corporation gaining
28. nancial institution
through a payment in cash or an exchange of stock...
overlap=0
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
29. Solution Distributional Lesk
Idea
Solutions
2 Overlap is based on string matching ) Similarity computed in a WordSpace
Gloss Expansion
Sentence to disambiguate
that bank holds the mortgage on my home
A
31. nancial institution that accepts demand
deposits and makes loans and provides other
services for the public... One of 12 regional
banks that monitor and act as depositories for
banks in their region... A corporation gaining
33. nancial institution through a payment in cash
or an exchange of stock... sloping
side
ground
mortgage
loans
lending
deposit
money
34. nancial
cash
payment
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
35. Solution Gloss expansion
Gloss expansion
Leavening on a semantic network
Concatenate recursively glosses of related synsets until a depth d is reached
Exclude antonym relation
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 6 / 21
36. Solution Gloss expansion
Term weighting
Idea
Term relevance depends on both its frequency and the distance d of the related synset
Solutions
Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associated
with the target word poorly characterize the meaning description
Distance weight Inversely proportional to the distance in the network (number of edges)
between the target synset and the related synset
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21
37. Solution Gloss expansion
Term weighting
Idea
Term relevance depends on both its frequency and the distance d of the related synset
Solutions
Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associated
with the target word poorly characterize the meaning description
Distance weight Inversely proportional to the distance in the network (number of edges)
between the target synset and the related synset
Bank
1 Sloping land (especially the slope beside a body of water)
2 A
38. nancial institution that accepts deposits and channels the money into lending
activities
...
8 A container (usually with a slot in the top) for keeping money at home
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21
39. Solution Gloss expansion
Term weight
Inverse gloss frequency
IGFk = 1 + log2
jSi j
gf
k
(1)
gf
k is the number of extended glosses that contain a word wk
Term weight
Weight for word wk appearing h times in the extended gloss g
ij is given by
weight(wk ; g
ij ) =
Xh 1
1 + d
IGFk (2)
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 8 / 21
40. Solution WordSpace
Distributional Semantic Models (DSMs)
You shall know a word by the
company it keeps!
Words are represented as
points in a geometric space
Words are related if they are
close in that space
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 9 / 21
41. Solution WordSpace
Overlap in DSM
Gloss as a vector: weighted vector sum of terms occurring in the expanded gloss
Context as a vector: vector sum of the target surrounding words
Compute the overlap as the cosine similarity between gloss vector and context
vector
bank hold mortgage home
42. nancial institution accept deposit channel money lend activity...
sloping land especially slope beside bodywater...
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 10 / 21
43. Solution Sense Distribution
Sense distribution
Insight
Analyze the distribution of meanings according to each word
Solution
p(sij jwi ) =
t(wi ; sij ) + 1
#wi + jSi j
(3)
t(wi ; sij ): number of times the word wi is tagged with sij
#wi : number of occurrences of wi
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 11 / 21
44. Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
45. Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
46. Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
47. Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
48. Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
49. Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
51. ed Lesk approach
2 Other task participants
Evaluate the system with and without sense distribution
Sense distribution linearly combined with the cosine similarity score
Dataset
Dataset: Task-12 of SemEval-2013 Multilingual Word Sense Disambiguation
Sense inventory: BabelNet
Metrics: F-measure
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 13 / 21
52. Evaluation System setup
System setup
Developed in JAVA relying on BabelNet API 1.1.11
Lucene analyzer to tokenize both glosses and the context, Snowball library2
stemming
Latent Semantic Analysis for building DSM considering the most 100; 000 frequent
words
BNC corpus for English
Wikipedia dump for Italian
Synset distance d is set to 1
Several context dimension: 3, 5, 10, 20 and the whole text
Combination factor for cosine similarity and sense distribution: 0.5
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 14 / 21
53. Evaluation Results
English
Run ContextSize SenseDistr: F
MFS - - 0.656
EN.LESK.1 3 N 0.525
EN.LESK.6 3 Y 0.633
EN.DSM.1 3 N 0.536
EN.DSM.2 5 N 0.605
EN.DSM.3 10 N 0.633
EN.DSM.4 20 N 0.650
EN.DSM.5 W N 0.687
EN.DSM.6 3 Y 0.669
EN.DSM.7 5 Y 0.677
EN.DSM.8 10 Y 0.689
EN.DSM.9 20 Y 0.696
EN.DSM.10 W Y 0.715
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 15 / 21
54. Evaluation Results
Italian
Run ContextSize SenseDistr: F
MFS - - 0.572
IT.LESK.2 5 N 0.530
IT.LESK.10 W Y 0.607
IT.DSM.1 3 N 0.610
IT.DSM.2 5 N 0.607
IT.DSM.3 10 N 0.626
IT.DSM.4 20 N 0.628
IT.DSM.5 W N 0.633
IT.DSM.6 3 Y 0.631
IT.DSM.7 5 Y 0.630
IT.DSM.8 10 Y 0.635
IT.DSM.9 20 Y 0.639
IT.DSM.10 W Y 0.641
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 16 / 21
55. Evaluation Task results
English
System F
EN.DSM.10 0.715
EN.DSM.5 0.687
UMCC-DLSI-2 0.685
UMCC-DLSI-3 0.680
UMCC-DLSI-1 0.677
MFS 0.656
DAEBAK 0.604
GETALP-BN-1 0.263
GETALP-BN-2 0.266
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 17 / 21
56. Evaluation Task results
Italian
System F
UMCC-DLSI-2 0.658
UMCC-DLSI-1 0.657
IT.DSM.10 0.641
IT.DSM.5 0.633
DAEBAK 0.613
MFS 0.572
GETALP-BN-2 0.325
GETALP-BN-1 0.324
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 18 / 21
57. Conclusions and Future Work Conclusions
Recap
The proposed algorithm outperforms the simple Lesk one for both English and
Italian
The system without knowledge about sense distribution always outperform the
MFS baseline
For English the system obtained the best results in the SemEval-2013 Task 12 with
or without sense distribution
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 19 / 21
58. Conclusions and Future Work Future work
What's next?
Extend the evaluation to other languages
Evaluate dierent DSMs and compositional approaches
Adapt our approach to a speci
59. c domain
Using a domain corpus for DSM building
Exploit a domain sense annotated corpus for sense distribution
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 20 / 21
60. That's all folks!
The system is available on line
https://github.com/pippokill/lesk-wsd-dsm
A. Caputo (annalina.caputo@uniba.it) Lesk-DSM Coling 2014 - 28 Aug. 2014 21 / 21