Seman&c	
 Ā Analysis	
 Ā in	
 Ā Language	
 Ā Technology	
 Ā 
http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 



Word Sense Disambiguation


Marina	
 Ā San(ni	
 Ā 
san$nim@stp.lingfil.uu.se	
 Ā 
	
 Ā 
Department	
 Ā of	
 Ā Linguis(cs	
 Ā and	
 Ā Philology	
 Ā 
Uppsala	
 Ā University,	
 Ā Uppsala,	
 Ā Sweden	
 Ā 
	
 Ā 
Spring	
 Ā 2016	
 Ā 
	
 Ā 
	
 Ā 
1	
 Ā 
Previous	
 Ā Lecture:	
 Ā Word	
 Ā Senses	
 Ā 
•  Homonomy,	
 Ā polysemy,	
 Ā synonymy,	
 Ā metonymy,	
 Ā etc.	
 Ā 	
 Ā 
Prac(cal	
 Ā ac(vi(es:	
 Ā 
1)	
 Ā SELECTIONAL	
 Ā RESTRICTIONS	
 Ā 
2)	
 Ā MANUAL	
 Ā DISAMBIGUATION	
 Ā OF	
 Ā EXAMPLES	
 Ā USING	
 Ā SENSEVAL	
 Ā 
SENSES	
 Ā 
AIMS	
 Ā OF	
 Ā PRACTICAL	
 Ā ACTIVITiES:	
 Ā 	
 Ā 
•  STUDENTS	
 Ā SHOULD	
 Ā GET	
 Ā ACQUINTED	
 Ā WITH	
 Ā REAL	
 Ā DATA	
 Ā 
•  EXPLORATIONS	
 Ā OF	
 Ā APPLICATIONS,	
 Ā RESOURCES	
 Ā AND	
 Ā METHODS.	
 Ā 	
 Ā 
2	
 Ā 
No	
 Ā preset	
 Ā solu$ons	
 Ā (this	
 Ā slide	
 Ā is	
 Ā to	
 Ā tell	
 Ā you	
 Ā 
that	
 Ā you	
 Ā are	
 Ā doing	
 Ā well	
 Ā 	
  J	
 Ā 	
 Ā )	
 Ā 
•  Whatever	
 Ā your	
 Ā experience	
 Ā with	
 Ā data,	
 Ā it	
 Ā is	
 Ā a	
 Ā valuable	
 Ā experience:	
 Ā 	
 Ā 
•  Disappointment	
 Ā 
•  Frustra(on	
 Ā 
•  Feeling	
 Ā lost	
 Ā 
•  Happiness	
 Ā 
•  Power	
 Ā 
•  Excitement	
 Ā 
•  …	
 Ā 
•  All	
 Ā the	
 Ā students	
 Ā so	
 Ā far	
 Ā 	
 Ā (also	
 Ā in	
 Ā previous	
 Ā courses)	
 Ā have	
 Ā presented	
 Ā their	
 Ā 	
 Ā 
own	
 Ā solu(ons…	
 Ā many	
 Ā different	
 Ā solu(ons	
 Ā and	
 Ā it	
 Ā is	
 Ā ok…	
 Ā 	
 Ā 3	
 Ā 
J&M	
 Ā own	
 Ā solu$ons:	
 Ā Selec$onal	
 Ā Restric$ons	
 Ā (just	
 Ā for	
 Ā your	
 Ā 
records,	
 Ā does	
 Ā not	
 Ā mean	
 Ā they	
 Ā are	
 Ā necessearily	
 Ā beMer	
 Ā than	
 Ā yours…	
 Ā )	
 Ā 
4	
 Ā 
Other	
 Ā possible	
 Ā solu$ons…	
 Ā 
•  Kissàconcrete	
 Ā sense:	
 Ā touching	
 Ā 
with	
 Ā lips/mouth	
 Ā 
•  animate	
 Ā kiss	
 Ā [using	
 Ā lips/
mouth]	
 Ā animate/inanimate	
 Ā 
•  Ex:	
 Ā he	
 Ā kissed	
 Ā her;	
 Ā 	
 Ā 
•  The	
 Ā dolphin	
 Ā kissed	
 Ā the	
 Ā kid	
 Ā 	
 Ā 
•  Why	
 Ā does	
 Ā the	
 Ā pope	
 Ā kiss	
 Ā the	
 Ā 
ground	
 Ā a^er	
 Ā he	
 Ā disembarks	
 Ā ...	
 Ā 
•  Kissàfigura(ve	
 Ā sense:	
 Ā touching	
 Ā 	
 Ā 
•  animate	
 Ā kiss	
 Ā inanimate	
 Ā 
•  Ex:	
 Ā "Walk	
 Ā as	
 Ā if	
 Ā you	
 Ā are	
 Ā kissing	
 Ā the	
 Ā Earth	
 Ā 
with	
 Ā your	
 Ā feet."	
 Ā 
5	
 Ā 
pursed	
 Ā lips?	
 Ā 
NO	
 Ā solu$on	
 Ā or	
 Ā comments	
 Ā provided	
 Ā for	
 Ā Senseval	
 Ā 
•  All	
 Ā your	
 Ā impressions	
 Ā and	
 Ā feelings	
 Ā are	
 Ā plausible	
 Ā and	
 Ā acceptable	
  J	
 Ā 
6	
 Ā 
Remember	
 Ā that	
 Ā in	
 Ā both	
 Ā ac$vi$es…	
 Ā 
•  You	
 Ā have	
 Ā experienced	
 Ā cases	
 Ā of	
 Ā POLYSEMY!	
 Ā 
•  YOU	
 Ā HAVE	
 Ā TRIED	
 Ā TO	
 Ā DISAMBIGUATE	
 Ā THE	
 Ā SENSES	
 Ā MANUALLY,	
 Ā IE	
 Ā 
WITH	
 Ā YOUR	
 Ā HUMAN	
 Ā SKILLS…	
 Ā 	
 Ā 
7	
 Ā 
Previous	
 Ā lecture:	
 Ā end	
 Ā 
8	
 Ā 
Today:	
 Ā Word	
 Ā Sense	
 Ā Disambigua$on	
 Ā (WSD)	
 Ā 
•  Given:	
 Ā 
•  A	
 Ā word	
 Ā in	
 Ā context;	
 Ā 	
 Ā 
•  A	
  fixed	
 Ā inventory	
 Ā of	
 Ā poten(al	
 Ā word	
 Ā senses;	
 Ā 
•  Create	
 Ā a	
 Ā system	
 Ā that	
 Ā automa(cally	
 Ā decides	
 Ā which	
 Ā sense	
 Ā of	
 Ā 
the	
 Ā word	
 Ā is	
 Ā correct	
 Ā in	
 Ā that	
 Ā context.	
 Ā 
Word	
 Ā Sense	
 Ā Disambigua$on:	
  Defini$on	
 Ā 
•  Word	
 Ā Sense	
 Ā Disambitua(on	
 Ā (WSD)	
 Ā is	
 Ā the	
 Ā TASK	
 Ā of	
 Ā determining	
 Ā the	
 Ā 
correct	
 Ā sense	
 Ā of	
 Ā a	
 Ā word	
 Ā in	
 Ā context.	
 Ā 
•  It	
 Ā is	
 Ā an	
 Ā automa(c	
 Ā task:	
 Ā we	
 Ā create	
 Ā a	
 Ā system	
 Ā that	
 Ā automa-cally	
 Ā 
disambiguates	
 Ā the	
 Ā senses	
 Ā for	
 Ā us.	
 Ā 
•  Useful	
 Ā for	
 Ā many	
 Ā NLP	
 Ā tasks:	
 Ā informa(on	
 Ā retrieval	
 Ā (apple	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā or	
 Ā 
apple	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā ?),	
 Ā ques(on	
 Ā answering	
 Ā (does	
 Ā United	
 Ā serve	
 Ā 
Philadelphia?),	
 Ā machine	
 Ā transla(on	
 Ā (eng	
 Ā ā€batā€	
  à	
 Ā It:	
 Ā pipistrello	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 
or	
 Ā mazza	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā ?)	
 Ā 
10	
 Ā 	
 Ā 
Anecdote:	
 Ā the	
 Ā poison	
 Ā apple	
 Ā 
•  In	
 Ā 1954,	
 Ā Alan	
 Ā Turing	
 Ā died	
 Ā a^er	
 Ā bi(ng	
 Ā into	
 Ā an	
 Ā apple	
 Ā laced	
 Ā with	
 Ā 
cyanide	
 Ā 
•  It	
 Ā was	
 Ā said	
 Ā that	
 Ā this	
 Ā half-­‐biten	
 Ā apple	
 Ā inspired	
 Ā the	
 Ā Apple	
 Ā logo…	
 Ā 
but	
 Ā apparently	
 Ā it	
 Ā is	
 Ā a	
 Ā legend	
  J	
 Ā 	
 Ā 
•  hmp://mentalfloss.com/ar(cle/64049/did-­‐alan-­‐turing-­‐inspire-­‐
apple-­‐logo	
 Ā 	
 Ā 
11	
 Ā 
Be	
 Ā alert…	
 Ā 
•  Word	
 Ā sense	
 Ā ambiguity	
 Ā is	
 Ā pervasive	
 Ā !!!	
 Ā 
12	
 Ā 
Acknowledgements
Most	
 Ā slides	
 Ā borrowed	
 Ā or	
 Ā adapted	
 Ā from:	
 Ā 
Dan	
 Ā Jurafsky	
 Ā and	
 Ā James	
 Ā H.	
 Ā Mar(n	
 Ā 
Dan	
 Ā Jurafsky	
 Ā and	
 Ā Christopher	
 Ā Manning,	
 Ā Coursera	
 Ā 	
 Ā 
	
 Ā 
J&M(2015,	
 Ā dra^):	
 Ā hmps://web.stanford.edu/~jurafsky/slp3/	
 Ā 	
 Ā 	
 Ā 
	
 Ā 
	
 Ā 	
 Ā 	
 Ā 
Outline:	
 Ā WSD	
 Ā Methods	
 Ā 
•  Thesaurus/Dic(onary	
 Ā Methods	
 Ā 
•  Supervised	
 Ā Machine	
 Ā Learning	
 Ā 
•  Semi-­‐Supervised	
 Ā Learning	
 Ā (self-­‐reading)	
 Ā 
14	
 Ā 
Word Sense
Disambiguation
Dic(onary	
 Ā and	
 Ā 
Thesaurus	
 Ā Methods	
 Ā 
The	
  Simplified	
 Ā Lesk	
 Ā algorithm	
 Ā 
•  Let’s	
 Ā disambiguate	
 Ā ā€œbankā€	
 Ā in	
 Ā this	
 Ā sentence:	
 Ā 
The	
 Ā bank	
 Ā can	
 Ā guarantee	
 Ā deposits	
 Ā will	
 Ā eventually	
 Ā cover	
 Ā future	
 Ā tui(on	
 Ā costs	
 Ā 
because	
 Ā it	
 Ā invests	
 Ā in	
 Ā adjustable-­‐rate	
 Ā mortgage	
 Ā securi(es.	
 Ā 	
 Ā 
•  given	
 Ā the	
 Ā following	
 Ā two	
 Ā WordNet	
 Ā senses:	
 Ā 	
 Ā 
if overlap > max-overlap then
max-overlap overlap
best-sense sense
end
return(best-sense)
Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the
number of words in common between two sets, ignoring function words or other words on a
stop list. The original Lesk algorithm defines the context in a more complex way. The Cor-
pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled
training corpus data in the signature.
bank1 Gloss: a financial institution that accepts deposits and channels the
money into lending activities
Examples: ā€œhe cashed a check at the bankā€, ā€œthat bank holds the mortgage
on my homeā€
bank2 Gloss: sloping land (especially the slope beside a body of water)
Examples: ā€œthey pulled the canoe up on the bankā€, ā€œhe sat on the bank of
the river and watched the currentsā€
The	
  Simplified	
 Ā Lesk	
 Ā algorithm	
 Ā 
The	
 Ā bank	
 Ā can	
 Ā guarantee	
 Ā deposits	
 Ā will	
 Ā eventually	
 Ā cover	
 Ā future	
 Ā 
tui(on	
 Ā costs	
 Ā because	
 Ā it	
 Ā invests	
 Ā in	
 Ā adjustable-­‐rate	
 Ā mortgage	
 Ā 
securi(es.	
 Ā 	
 Ā 
if overlap > max-overlap then
max-overlap overlap
best-sense sense
end
return(best-sense)
Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the
number of words in common between two sets, ignoring function words or other words on a
stop list. The original Lesk algorithm defines the context in a more complex way. The Cor-
pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled
training corpus data in the signature.
bank1 Gloss: a financial institution that accepts deposits and channels the
money into lending activities
Examples: ā€œhe cashed a check at the bankā€, ā€œthat bank holds the mortgage
on my homeā€
bank2 Gloss: sloping land (especially the slope beside a body of water)
Examples: ā€œthey pulled the canoe up on the bankā€, ā€œhe sat on the bank of
the river and watched the currentsā€
Choose	
 Ā sense	
 Ā with	
 Ā most	
 Ā word	
 Ā overlap	
 Ā between	
 Ā gloss	
 Ā and	
 Ā context	
 Ā 
(not	
 Ā coun(ng	
 Ā func(on	
 Ā words)	
 Ā 
Drawback	
 Ā 
•  Glosses	
 Ā and	
 Ā examples	
 Ā migh	
 Ā be	
 Ā too	
 Ā short	
 Ā and	
 Ā may	
 Ā not	
 Ā provide	
 Ā 
enough	
 Ā chance	
 Ā to	
 Ā overlap	
 Ā with	
 Ā the	
 Ā context	
 Ā of	
 Ā the	
 Ā word	
 Ā to	
 Ā be	
 Ā 
disambiguated.	
 Ā 	
 Ā 
18	
 Ā 
The	
 Ā Corpus(-­‐based)	
 Ā Lesk	
 Ā algorithm	
 Ā 
•  Assumes	
 Ā we	
 Ā have	
 Ā some	
 Ā sense-­‐labeled	
 Ā data	
 Ā (like	
 Ā SemCor)	
 Ā 
•  Take	
 Ā all	
 Ā the	
 Ā sentences	
 Ā with	
 Ā the	
 Ā relevant	
 Ā word	
 Ā sense:	
 Ā 
These	
 Ā short,	
 Ā "streamlined"	
 Ā mee-ngs	
 Ā usually	
 Ā are	
 Ā sponsored	
 Ā by	
 Ā local	
 Ā banks1,	
 Ā 
Chambers	
 Ā of	
 Ā Commerce,	
 Ā trade	
 Ā associa-ons,	
 Ā or	
 Ā other	
 Ā civic	
 Ā organiza-ons.	
 Ā 
•  Now	
 Ā add	
 Ā these	
 Ā to	
 Ā the	
 Ā gloss	
 Ā +	
 Ā examples	
 Ā for	
 Ā each	
 Ā sense,	
 Ā call	
 Ā it	
 Ā the	
 Ā 
ā€œsignatureā€	
 Ā of	
 Ā a	
 Ā sense.	
 Ā Basically,	
 Ā it	
 Ā is	
 Ā an	
 Ā expansion	
 Ā of	
 Ā the	
 Ā 
dic(onary	
 Ā entry.	
 Ā 
•  Choose	
 Ā sense	
 Ā with	
 Ā most	
 Ā word	
 Ā overlap	
 Ā between	
 Ā context	
 Ā and	
 Ā 
signature	
 Ā (ie.	
 Ā the	
 Ā context	
 Ā words	
 Ā provided	
 Ā by	
 Ā the	
 Ā resources).	
 Ā 
Corpus	
 Ā Lesk:	
 Ā IDF	
 Ā weigh$ng	
 Ā 
•  Instead	
 Ā of	
 Ā just	
 Ā removing	
 Ā func(on	
 Ā words	
 Ā 
•  Weigh	
 Ā each	
 Ā word	
 Ā by	
 Ā its	
 Ā `promiscuity’	
 Ā across	
 Ā documents	
 Ā 
•  Down-­‐weights	
 Ā words	
 Ā that	
 Ā occur	
 Ā in	
 Ā every	
 Ā `document’	
 Ā (gloss,	
 Ā example,	
 Ā etc)	
 Ā 
•  These	
 Ā are	
 Ā generally	
 Ā func(on	
 Ā words,	
 Ā but	
 Ā is	
 Ā a	
 Ā more	
  fine-­‐grained	
 Ā measure	
 Ā 
•  Weigh	
 Ā each	
 Ā overlapping	
 Ā word	
 Ā by	
 Ā inverse	
 Ā document	
 Ā frequency	
 Ā 
(IDF).	
 Ā 
20	
 Ā 
Graph-­‐based	
 Ā methods	
 Ā 
•  First,	
 Ā WordNet	
 Ā can	
 Ā be	
 Ā viewed	
 Ā as	
 Ā a	
 Ā graph	
 Ā 
•  senses	
 Ā are	
 Ā nodes	
 Ā 
•  rela(ons	
 Ā (hypernymy,	
 Ā meronymy)	
 Ā are	
 Ā edges	
 Ā 
•  Also	
 Ā add	
 Ā edge	
 Ā between	
 Ā word	
 Ā and	
 Ā unambiguous	
 Ā gloss	
 Ā words	
 Ā 
21	
 Ā 
toastn
4
drinkv
1
drinkern
1
drinkingn
1
potationn
1
sipn
1
sipv
1
beveragen
1 milkn
1
liquidn
1foodn
1
drinkn
1
helpingn
1
supv
1
consumptionn
1
consumern
1
consumev
1
An	
 Ā undirected	
 Ā 
graph	
 Ā is	
 Ā set	
 Ā of	
 Ā 
nodes	
 Ā tha	
 Ā are	
 Ā 
connected	
 Ā 
together	
 Ā by	
 Ā 
bidirec(onal	
 Ā 
edges	
 Ā (lines).	
 Ā 	
 Ā 
How	
 Ā to	
 Ā use	
 Ā the	
 Ā graph	
 Ā for	
 Ā WSD	
 Ā 
ā€œShe	
 Ā drank	
 Ā some	
 Ā milkā€	
 Ā 
•  choose	
 Ā the	
 Ā 
	
 Ā 	
 Ā 	
 Ā 	
 Ā 	
 Ā most	
 Ā central	
 Ā sense	
 Ā 
	
 Ā 
(several	
 Ā algorithms	
 Ā 
	
 Ā have	
 Ā been	
 Ā proposed	
 Ā 
recently)	
 Ā 	
 Ā 
22	
 Ā 
drinkv
1
drinkern
1
beveragen
1
boozingn
1
foodn
1
drinkn
1 milkn
1
milkn
2
milkn
3
milkn
4
drinkv
2
drinkv
3
drinkv
4
drinkv
5
nutrimentn
1
ā€œdrinkā€ ā€œmilkā€
Word Meaning and
Similarity
Word	
 Ā Similarity:	
 Ā 
Thesaurus	
 Ā Methods	
 Ā 
beg:	
 Ā c_w8	
 Ā 
Word	
 Ā Similarity	
 Ā 
•  Synonymy:	
 Ā a	
 Ā binary	
 Ā rela(on	
 Ā 
•  Two	
 Ā words	
 Ā are	
 Ā either	
 Ā synonymous	
 Ā or	
 Ā not	
 Ā 
•  Similarity	
 Ā (or	
 Ā distance):	
 Ā a	
 Ā looser	
 Ā metric	
 Ā 
•  Two	
 Ā words	
 Ā are	
 Ā more	
 Ā similar	
 Ā if	
 Ā they	
 Ā share	
 Ā more	
 Ā features	
 Ā of	
 Ā meaning	
 Ā 
•  Similarity	
 Ā is	
 Ā properly	
 Ā a	
 Ā rela(on	
 Ā between	
 Ā senses	
 Ā 
•  We	
 Ā do	
 Ā not	
 Ā say	
 Ā ā€œThe	
 Ā word	
 Ā ā€œbankā€	
 Ā is	
 Ā not	
 Ā similar	
 Ā to	
 Ā the	
 Ā word	
 Ā ā€œslopeā€	
 Ā ā€œ,	
 Ā bu	
 Ā w	
 Ā say.	
 Ā 
•  Bank1	
 Ā is	
 Ā similar	
 Ā to	
 Ā fund3	
 Ā 
•  Bank2	
 Ā is	
 Ā similar	
 Ā to	
 Ā slope5	
 Ā 
•  But	
 Ā we’ll	
 Ā compute	
 Ā similarity	
 Ā over	
 Ā both	
 Ā words	
 Ā and	
 Ā senses	
 Ā 
Why	
 Ā word	
 Ā similarity	
 Ā 
•  Informa(on	
 Ā retrieval	
 Ā 
•  Ques(on	
 Ā answering	
 Ā 
•  Machine	
 Ā transla(on	
 Ā 
•  Natural	
 Ā language	
 Ā genera(on	
 Ā 
•  Language	
 Ā modeling	
 Ā 
•  Automa(c	
 Ā essay	
 Ā grading	
 Ā 
•  Plagiarism	
 Ā detec(on	
 Ā 
•  Document	
 Ā clustering	
 Ā 
Word	
 Ā similarity	
 Ā and	
 Ā word	
 Ā relatedness	
 Ā 
•  We	
 Ā o^en	
 Ā dis(nguish	
 Ā word	
 Ā similarity	
 Ā 	
 Ā from	
 Ā word	
 Ā 
relatedness	
 Ā 
•  Similar	
 Ā words:	
 Ā near-­‐synonyms	
 Ā 
•  car, bicycle:	
 Ā 	
 Ā 	
 Ā 	
 Ā similar	
 Ā 
•  Related	
 Ā words:	
 Ā can	
 Ā be	
 Ā related	
 Ā any	
 Ā way	
 Ā 
•  car, gasoline:	
 Ā 	
 Ā 	
 Ā related,	
 Ā not	
 Ā similar	
 Ā 
Cf.	
 Ā Synonyms:	
 Ā car	
 Ā &	
 Ā automobile	
 Ā 
Two	
 Ā classes	
 Ā of	
 Ā similarity	
 Ā algorithms	
 Ā 
•  Thesaurus-­‐based	
 Ā algorithms	
 Ā 
•  Are	
 Ā words	
 Ā ā€œnearbyā€	
 Ā in	
 Ā hypernym	
 Ā hierarchy?	
 Ā 
•  Do	
 Ā words	
 Ā have	
 Ā similar	
 Ā glosses	
  (defini(ons)?	
 Ā 
•  Distribu(onal	
 Ā algorithms:	
 Ā next	
 Ā (me!	
 Ā 
•  Do	
 Ā words	
 Ā have	
 Ā similar	
 Ā distribu(onal	
 Ā contexts?	
 Ā 
Path-­‐based	
 Ā similarity	
 Ā 
•  Two	
 Ā concepts	
 Ā (senses/synsets)	
 Ā are	
 Ā similar	
 Ā if	
 Ā 
they	
 Ā are	
 Ā near	
 Ā each	
 Ā other	
 Ā in	
 Ā the	
 Ā thesaurus	
 Ā 
hierarchy	
 Ā 	
 Ā 
•  =have	
 Ā a	
 Ā short	
 Ā path	
 Ā between	
 Ā them	
 Ā 
•  concepts	
 Ā have	
 Ā path	
 Ā 1	
 Ā to	
 Ā themselves	
 Ā 
Refinements	
 Ā to	
 Ā path-­‐based	
 Ā similarity	
 Ā 
•  pathlen(c1,c2) =	
 Ā (distance	
 Ā metric)	
 Ā =	
 Ā 1	
 Ā +	
 Ā number	
 Ā of	
 Ā edges	
 Ā in	
 Ā the	
 Ā 
shortest	
 Ā path	
 Ā in	
 Ā the	
 Ā hypernym	
 Ā graph	
 Ā between	
 Ā sense	
 Ā nodes	
 Ā c1	
 Ā 
and	
 Ā c2	
 Ā 
•  simpath(c1,c2) =
•  wordsim(w1,w2) = max sim(c1,c2)
c1∈senses(w1),c2∈senses(w2)	
 Ā 
1
pathlen(c1,c2 )
Sense	
 Ā similarity	
 Ā metric:	
 Ā 1	
 Ā 
over	
 Ā the	
 Ā distance!	
 Ā 
Word	
 Ā similarity	
 Ā metric:	
 Ā 	
 Ā 
max	
 Ā similarity	
 Ā among	
 Ā 
pairs	
 Ā of	
 Ā senses.	
 Ā 
For	
 Ā all	
 Ā senses	
 Ā of	
 Ā w1	
 Ā and	
 Ā all	
 Ā senses	
 Ā of	
 Ā w2,	
 Ā take	
 Ā the	
 Ā similarity	
 Ā between	
 Ā each	
 Ā of	
 Ā the	
 Ā senses	
 Ā of	
 Ā w1	
 Ā 
and	
 Ā each	
 Ā of	
 Ā the	
 Ā senses	
 Ā of	
 Ā w2	
 Ā and	
 Ā then	
 Ā take	
 Ā the	
 Ā maximum	
 Ā similarity	
 Ā between	
 Ā those	
 Ā pairs.	
 Ā 
Example:	
 Ā path-­‐based	
 Ā similarity	
 Ā 
simpath(c1,c2) = 1/pathlen(c1,c2)
simpath(nickel,coin)	
 Ā =	
 Ā 1/2 = .5
simpath(fund,budget)	
 Ā =	
 Ā 1/2 = .5
simpath(nickel,currency)	
 Ā =	
 Ā 1/4 = .25
simpath(nickel,money)	
 Ā =	
 Ā 1/6 = .17
simpath(coinage,Richter	
 Ā scale)	
 Ā =	
 Ā 1/6 = .17
Problem	
 Ā with	
 Ā basic	
 Ā path-­‐based	
 Ā similarity	
 Ā 
•  Assumes	
 Ā each	
 Ā link	
 Ā represents	
 Ā a	
 Ā uniform	
 Ā distance	
 Ā 
•  But	
 Ā nickel	
 Ā to	
 Ā money	
 Ā seems	
 Ā to	
 Ā us	
 Ā to	
 Ā be	
 Ā closer	
 Ā than	
 Ā nickel	
 Ā to	
 Ā 
standard	
 Ā 
•  Nodes	
 Ā high	
 Ā in	
 Ā the	
 Ā hierarchy	
 Ā are	
 Ā very	
 Ā abstract	
 Ā 
•  We	
 Ā instead	
 Ā want	
 Ā a	
 Ā metric	
 Ā that	
 Ā 
•  Represents	
 Ā the	
 Ā cost	
 Ā of	
 Ā each	
 Ā edge	
 Ā independently	
 Ā 
•  Words	
 Ā connected	
 Ā only	
 Ā through	
 Ā abstract	
 Ā nodes	
 Ā 	
 Ā 
•  are	
 Ā less	
 Ā similar	
 Ā 
Informa$on	
 Ā content	
 Ā similarity	
 Ā metrics	
 Ā 
•  In	
 Ā simple	
 Ā words:	
 Ā 
•  We	
  define	
 Ā the	
 Ā probability	
 Ā of	
 Ā a	
 Ā concept	
 Ā C	
 Ā as	
 Ā the	
 Ā probability	
 Ā that	
 Ā a	
 Ā 
randomly	
 Ā selected	
 Ā word	
 Ā in	
 Ā a	
 Ā corpus	
 Ā is	
 Ā an	
 Ā instance	
 Ā of	
 Ā that	
 Ā concept.	
 Ā 
•  Basically,	
 Ā for	
 Ā each	
 Ā random	
 Ā word	
 Ā in	
 Ā a	
 Ā corpus	
 Ā we	
 Ā compute	
 Ā how	
 Ā probable	
 Ā it	
 Ā 
is	
 Ā that	
 Ā it	
 Ā belongs	
 Ā to	
 Ā a	
 Ā certain	
 Ā concepts.	
 Ā 	
 Ā 
Resnik	
 Ā 1995.	
 Ā Using	
 Ā informa(on	
 Ā content	
 Ā to	
 Ā evaluate	
 Ā seman(c	
 Ā 
similarity	
 Ā in	
 Ā a	
 Ā taxonomy.	
 Ā IJCAI	
 Ā 
Formally:	
 Ā Informa$on	
 Ā content	
 Ā similarity	
 Ā metrics	
 Ā 
•  Let’s	
  define	
 Ā P(c) as:	
 Ā 
•  The	
 Ā probability	
 Ā that	
 Ā a	
 Ā randomly	
 Ā selected	
 Ā word	
 Ā in	
 Ā a	
 Ā corpus	
 Ā is	
 Ā an	
 Ā instance	
 Ā 
of	
 Ā concept	
 Ā c
•  Formally:	
 Ā there	
 Ā is	
 Ā a	
 Ā dis(nct	
 Ā random	
 Ā variable,	
 Ā ranging	
 Ā over	
 Ā words,	
 Ā 
associated	
 Ā with	
 Ā each	
 Ā concept	
 Ā in	
 Ā the	
 Ā hierarchy	
 Ā 
•  for	
 Ā a	
 Ā given	
 Ā concept,	
 Ā each	
 Ā observed	
 Ā noun	
 Ā is	
 Ā either	
 Ā 
•  	
 Ā a	
 Ā member	
 Ā of	
 Ā that	
 Ā concept	
 Ā 	
 Ā with	
 Ā probability	
 Ā P(c)
•  not	
 Ā a	
 Ā member	
 Ā of	
 Ā that	
 Ā concept	
 Ā with	
 Ā probability	
 Ā 1-P(c)
•  All	
 Ā words	
 Ā are	
 Ā members	
 Ā of	
 Ā the	
 Ā root	
 Ā node	
 Ā (En(ty)	
 Ā 
•  P(root)=1
•  The	
 Ā lower	
 Ā a	
 Ā node	
 Ā in	
 Ā hierarchy,	
 Ā the	
 Ā lower	
 Ā its	
 Ā probability	
 Ā 
Resnik	
 Ā 1995.	
 Ā Using	
 Ā informa(on	
 Ā content	
 Ā to	
 Ā evaluate	
 Ā seman(c	
 Ā 
similarity	
 Ā in	
 Ā a	
 Ā taxonomy.	
 Ā IJCAI	
 Ā 
Informa$on	
 Ā content	
 Ā similarity	
 Ā 
•  For	
 Ā every	
 Ā word	
 Ā (ex	
 Ā ā€œnatural	
 Ā eleva(onā€),	
 Ā we	
 Ā 
count	
 Ā all	
 Ā the	
 Ā words	
 Ā in	
 Ā that	
 Ā concepts,	
 Ā and	
 Ā 
then	
 Ā we	
 Ā normalize	
 Ā by	
 Ā the	
 Ā total	
 Ā number	
 Ā of	
 Ā 
words	
 Ā in	
 Ā the	
 Ā corpus.	
 Ā 
•  we	
 Ā get	
 Ā a	
 Ā probability	
 Ā value	
 Ā that	
 Ā tells	
 Ā us	
 Ā how	
 Ā 
probable	
 Ā it	
 Ā is	
 Ā that	
 Ā a	
 Ā random	
 Ā word	
 Ā is	
 Ā a	
 Ā an	
 Ā 
instance	
 Ā of	
 Ā that	
 Ā concept	
 Ā 	
 Ā 
P(c) =
count(w)
w∈words(c)
āˆ‘
N
geological-­‐forma(on	
 Ā 
shore	
 Ā 
hill	
 Ā 
natural	
 Ā eleva(on	
 Ā 
coast	
 Ā 
cave	
 Ā 
gromo	
 Ā ridge	
 Ā 
…	
 Ā 
en(ty	
 Ā 
In	
 Ā order	
 Ā o	
 Ā compute	
 Ā the	
 Ā 
probability	
 Ā of	
 Ā the	
 Ā term	
 Ā 
"natural	
 Ā eleva(on",	
 Ā we	
 Ā 
take	
 Ā ridge,	
 Ā hill	
 Ā +	
 Ā natural	
 Ā 
eleva(on	
 Ā itself	
 Ā 
Informa$on	
 Ā content	
 Ā similarity	
 Ā 
•  WordNet	
 Ā hierarchy	
 Ā augmented	
 Ā with	
 Ā probabili(es	
 Ā P(c)	
 Ā 
D.	
 Ā Lin.	
 Ā 1998.	
 Ā An	
 Ā Informa(on-­‐Theore(c	
  Defini(on	
 Ā of	
 Ā Similarity.	
 Ā ICML	
 Ā 1998	
 Ā 
Informa$on	
 Ā content:	
  defini$ons	
 Ā 
1.  Informa(on	
 Ā content:	
 Ā 
1.  IC(c) = -log P(c)
2.  Most	
 Ā informa(ve	
 Ā subsumer	
 Ā 
(Lowest	
 Ā common	
 Ā subsumer)	
 Ā 
LCS(c1,c2) =
The	
 Ā most	
 Ā informa(ve	
 Ā (lowest)	
 Ā 
node	
 Ā in	
 Ā the	
 Ā hierarchy	
 Ā 
subsuming	
 Ā both	
 Ā c1	
 Ā and	
 Ā c2	
 Ā 
IC	
 Ā aka…	
 Ā 
•  A	
 Ā lot	
 Ā of	
 Ā people	
 Ā prefer	
 Ā the	
 Ā term	
 Ā surprisal	
 Ā to	
 Ā informa(on	
 Ā or	
 Ā to	
 Ā 
informa(on	
 Ā content.	
 Ā 	
 Ā 
-­‐log	
 Ā p(x)	
 Ā 
It	
 Ā measures	
 Ā the	
 Ā amount	
 Ā of	
 Ā surprise	
 Ā generated	
 Ā by	
 Ā the	
 Ā event	
 Ā x.	
 Ā 	
 Ā 
The	
 Ā smaller	
 Ā the	
 Ā probability	
 Ā of	
 Ā x,	
 Ā the	
 Ā bigger	
 Ā the	
 Ā surprisal	
 Ā is.	
 Ā 
	
 Ā 
It's	
 Ā helpful	
 Ā to	
 Ā think	
 Ā about	
 Ā it	
 Ā this	
 Ā way,	
 Ā par(cularly	
 Ā for	
 Ā linguis(cs	
 Ā 
examples.	
 Ā 	
 Ā 
37	
 Ā 
Using	
 Ā informa$on	
 Ā content	
 Ā for	
 Ā similarity:	
 Ā 	
 Ā 
the	
 Ā Resnik	
 Ā method	
 Ā 
•  The	
 Ā similarity	
 Ā between	
 Ā two	
 Ā words	
 Ā is	
 Ā related	
 Ā to	
 Ā their	
 Ā 
common	
 Ā informa(on	
 Ā 
•  The	
 Ā more	
 Ā two	
 Ā words	
 Ā have	
 Ā in	
 Ā common,	
 Ā the	
 Ā more	
 Ā 
similar	
 Ā they	
 Ā are	
 Ā 
•  Resnik:	
 Ā measure	
 Ā common	
 Ā informa(on	
 Ā as:	
 Ā 
•  The	
 Ā informa(on	
 Ā content	
 Ā of	
 Ā the	
 Ā most	
 Ā informa(ve	
 Ā 
	
 Ā (lowest)	
 Ā subsumer	
 Ā (MIS/LCS)	
 Ā of	
 Ā the	
 Ā two	
 Ā nodes	
 Ā 
•  simresnik(c1,c2) = -log P( LCS(c1,c2) )
Philip	
 Ā Resnik.	
 Ā 1995.	
 Ā Using	
 Ā Informa(on	
 Ā Content	
 Ā to	
 Ā Evaluate	
 Ā Seman(c	
 Ā Similarity	
 Ā in	
 Ā a	
 Ā Taxonomy.	
 Ā IJCAI	
 Ā 1995.	
 Ā 
Philip	
 Ā Resnik.	
 Ā 1999.	
 Ā Seman(c	
 Ā Similarity	
 Ā in	
 Ā a	
 Ā Taxonomy:	
 Ā An	
 Ā Informa(on-­‐Based	
 Ā Measure	
 Ā and	
 Ā its	
 Ā Applica(on	
 Ā 
to	
 Ā Problems	
 Ā of	
 Ā Ambiguity	
 Ā in	
 Ā Natural	
 Ā Language.	
 Ā JAIR	
 Ā 11,	
 Ā 95-­‐130.	
 Ā 
Dekang	
 Ā Lin	
 Ā method	
 Ā 
•  Intui(on:	
 Ā Similarity	
 Ā between	
 Ā A	
 Ā and	
 Ā B	
 Ā is	
 Ā not	
 Ā just	
 Ā what	
 Ā they	
 Ā have	
 Ā 
in	
 Ā common	
 Ā 
•  The	
 Ā more	
 Ā differences	
 Ā between	
 Ā A	
 Ā and	
 Ā B,	
 Ā the	
 Ā less	
 Ā similar	
 Ā they	
 Ā are:	
 Ā 
•  Commonality:	
 Ā the	
 Ā more	
 Ā A	
 Ā and	
 Ā B	
 Ā have	
 Ā in	
 Ā common,	
 Ā the	
 Ā more	
 Ā similar	
 Ā they	
 Ā are	
 Ā 
•  Difference:	
 Ā the	
 Ā more	
 Ā differences	
 Ā between	
 Ā A	
 Ā and	
 Ā B,	
 Ā the	
 Ā less	
 Ā similar	
 Ā 
•  Commonality:	
 Ā IC(common(A,B))	
 Ā 
•  Difference:	
 Ā IC(descrip(on(A,B)-­‐IC(common(A,B))	
 Ā 
Dekang	
 Ā Lin.	
 Ā 1998.	
 Ā An	
 Ā Informa(on-­‐Theore(c	
  Defini(on	
 Ā of	
 Ā Similarity.	
 Ā ICML	
 Ā 
Dekang	
 Ā Lin	
 Ā similarity	
 Ā theorem	
 Ā 
•  The	
 Ā similarity	
 Ā between	
 Ā A	
 Ā and	
 Ā B	
 Ā is	
 Ā measured	
 Ā by	
 Ā the	
 Ā ra(o	
 Ā 
between	
 Ā the	
 Ā amount	
 Ā of	
 Ā informa(on	
 Ā needed	
 Ā to	
 Ā state	
 Ā the	
 Ā 
commonality	
 Ā of	
 Ā A	
 Ā and	
 Ā B	
 Ā and	
 Ā the	
 Ā informa(on	
 Ā needed	
 Ā to	
 Ā fully	
 Ā 
describe	
 Ā what	
 Ā A	
 Ā and	
 Ā B	
 Ā are	
 Ā 
	
 Ā 
simLin(A, B)āˆ
IC(common(A, B))
IC(description(A, B))
•  Lin	
 Ā (altering	
 Ā Resnik)	
  defines	
 Ā IC(common(A,B))	
 Ā as	
 Ā 2	
 Ā x	
 Ā informa(on	
 Ā of	
 Ā the	
 Ā LCS	
 Ā 
simLin(c1,c2 ) =
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
Lin	
 Ā similarity	
 Ā func$on	
 Ā 
simLin(A, B) =
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
simLin(hill,coast) =
2logP(geological-formation)
logP(hill)+ logP(coast)
=
2ln0.00176
ln0.0000189 + ln0.0000216
=.59
The	
 Ā (extended)	
 Ā Lesk	
 Ā Algorithm	
 Ā 	
 Ā 
•  A	
 Ā thesaurus-­‐based	
 Ā measure	
 Ā that	
 Ā looks	
 Ā at	
 Ā glosses	
 Ā 
•  Two	
 Ā concepts	
 Ā are	
 Ā similar	
 Ā if	
 Ā their	
 Ā glosses	
 Ā contain	
 Ā similar	
 Ā words	
 Ā 
•  Drawing	
 Ā paper:	
 Ā paper	
 Ā that	
 Ā is	
 Ā specially	
 Ā prepared	
 Ā for	
 Ā use	
 Ā in	
 Ā dra^ing	
 Ā 
•  Decal:	
 Ā the	
 Ā art	
 Ā of	
 Ā transferring	
 Ā designs	
 Ā from	
 Ā specially	
 Ā prepared	
 Ā paper	
 Ā to	
 Ā a	
 Ā 
wood	
 Ā or	
 Ā glass	
 Ā or	
 Ā metal	
 Ā surface	
 Ā 
•  For	
 Ā each	
 Ā n-­‐word	
 Ā phrase	
 Ā that’s	
 Ā in	
 Ā both	
 Ā glosses	
 Ā 
•  Add	
 Ā a	
 Ā score	
 Ā of	
 Ā n2	
 Ā 	
 Ā 
•  Paper	
 Ā and	
 Ā specially	
 Ā prepared	
 Ā for	
 Ā 1	
 Ā +	
 Ā 22	
 Ā =	
 Ā 5	
 Ā 
•  Compute	
 Ā overlap	
 Ā also	
 Ā for	
 Ā other	
 Ā rela(ons	
 Ā 
•  glosses	
 Ā of	
 Ā hypernyms	
 Ā and	
 Ā hyponyms	
 Ā 
Summary:	
 Ā thesaurus-­‐based	
 Ā similarity	
 Ā 
Libraries	
 Ā for	
 Ā compu$ng	
 Ā thesaurus-­‐based	
 Ā 
similarity	
 Ā 
•  NLTK	
 Ā 
•  hmp://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity	
 Ā -­‐	
 Ā 
nltk.corpus.reader.WordNetCorpusReader.res_similarity	
 Ā 
•  WordNet::Similarity	
 Ā 
•  hmp://wn-­‐similarity.sourceforge.net/	
 Ā 
•  Web-­‐based	
 Ā interface:	
 Ā 
•  hmp://marimba.d.umn.edu/cgi-­‐bin/similarity/similarity.cgi	
 Ā 
44	
 Ā 
Machine Learning
based approach
Basic	
 Ā idea	
 Ā 
•  If	
 Ā we	
 Ā have	
 Ā data	
 Ā that	
 Ā has	
 Ā been	
 Ā hand-­‐labelled	
 Ā with	
 Ā correct	
 Ā word	
 Ā 
senses,	
 Ā we	
 Ā can	
 Ā used	
 Ā a	
 Ā supervised	
 Ā learning	
 Ā approach	
 Ā and	
 Ā learn	
 Ā 
from	
 Ā it!	
 Ā 
•  We	
 Ā need	
 Ā to	
 Ā extract	
 Ā features	
 Ā and	
 Ā train	
 Ā a	
  classifier	
 Ā 
•  The	
 Ā output	
 Ā of	
 Ā training	
 Ā is	
 Ā an	
 Ā automa(c	
 Ā system	
 Ā capable	
 Ā of	
 Ā assigning	
 Ā sense	
 Ā 
labels	
 Ā TO	
 Ā unlabelled	
 Ā words	
 Ā in	
 Ā a	
 Ā context.	
 Ā 	
 Ā 
46	
 Ā 
Two	
 Ā variants	
 Ā of	
 Ā WSD	
 Ā task	
 Ā 
•  Lexical	
 Ā Sample	
 Ā task	
 Ā 	
 Ā 
•  (we	
 Ā need	
 Ā labelled	
 Ā corpora	
 Ā for	
 Ā individual	
 Ā senses)	
 Ā 
•  Small	
 Ā pre-­‐selected	
 Ā set	
 Ā of	
 Ā target	
 Ā words	
 Ā (ex	
  difficulty)	
 Ā 
•  And	
 Ā inventory	
 Ā of	
 Ā senses	
 Ā for	
 Ā each	
 Ā word	
 Ā 
•  Supervised	
 Ā machine	
 Ā learning:	
 Ā train	
 Ā a	
  classifier	
 Ā for	
 Ā each	
 Ā word	
 Ā 
•  All-­‐words	
 Ā task	
 Ā 	
 Ā 
•  (each	
 Ā word	
 Ā in	
 Ā each	
 Ā sentence	
 Ā is	
 Ā labelled	
 Ā with	
 Ā a	
 Ā sense)	
 Ā 
•  Every	
 Ā word	
 Ā in	
 Ā an	
 Ā en(re	
 Ā text	
 Ā 
•  A	
 Ā lexicon	
 Ā with	
 Ā senses	
 Ā for	
 Ā each	
 Ā word	
 Ā 
SENSEVAL	
 Ā 1-­‐2-­‐3	
 Ā 
Supervised	
 Ā Machine	
 Ā Learning	
 Ā Approaches	
 Ā 
•  Summary	
 Ā of	
 Ā what	
 Ā we	
 Ā need:	
 Ā 
•  the	
 Ā tag	
 Ā set	
 Ā (ā€œsense	
 Ā inventoryā€)	
 Ā 
•  the	
 Ā training	
 Ā corpus	
 Ā 
•  A	
 Ā set	
 Ā of	
 Ā features	
 Ā extracted	
 Ā from	
 Ā the	
 Ā training	
 Ā corpus	
 Ā 
•  A	
  classifier	
 Ā 
Supervised	
 Ā WSD	
 Ā 1:	
 Ā WSD	
 Ā Tags	
 Ā 
•  What’s	
 Ā a	
 Ā tag?	
 Ā 
A	
 Ā dic(onary	
 Ā sense?	
 Ā 
•  For	
 Ā example,	
 Ā for	
 Ā WordNet	
 Ā an	
 Ā instance	
 Ā of	
 Ā ā€œbassā€	
 Ā in	
 Ā a	
 Ā text	
 Ā has	
 Ā 8	
 Ā 
possible	
 Ā tags	
 Ā or	
 Ā labels	
 Ā (bass1	
 Ā through	
 Ā bass8).	
 Ā 
8	
 Ā senses	
 Ā of	
 Ā ā€œbassā€	
 Ā in	
 Ā WordNet	
 Ā 
1.  bass	
 Ā -­‐	
 Ā (the	
 Ā lowest	
 Ā part	
 Ā of	
 Ā the	
 Ā musical	
 Ā range)	
 Ā 
2.  bass,	
 Ā bass	
 Ā part	
 Ā -­‐	
 Ā (the	
 Ā lowest	
 Ā part	
 Ā in	
 Ā polyphonic	
 Ā 	
 Ā music)	
 Ā 
3.  bass,	
 Ā basso	
 Ā -­‐	
 Ā (an	
 Ā adult	
 Ā male	
 Ā singer	
 Ā with	
 Ā the	
 Ā lowest	
 Ā voice)	
 Ā 
4.  sea	
 Ā bass,	
 Ā bass	
 Ā -­‐	
 Ā (flesh	
 Ā of	
 Ā lean-­‐fleshed	
 Ā saltwater	
  fish	
 Ā of	
 Ā the	
 Ā family	
 Ā 
Serranidae)	
 Ā 
5.  freshwater	
 Ā bass,	
 Ā bass	
 Ā -­‐	
 Ā (any	
 Ā of	
 Ā various	
 Ā North	
 Ā American	
 Ā lean-­‐fleshed	
 Ā 
freshwater	
  fishes	
 Ā especially	
 Ā of	
 Ā the	
 Ā genus	
 Ā Micropterus)	
 Ā 
6.  bass,	
 Ā bass	
 Ā voice,	
 Ā basso	
 Ā -­‐	
 Ā (the	
 Ā lowest	
 Ā adult	
 Ā male	
 Ā singing	
 Ā voice)	
 Ā 
7.  bass	
 Ā -­‐	
 Ā (the	
 Ā member	
 Ā with	
 Ā the	
 Ā lowest	
 Ā range	
 Ā of	
 Ā a	
 Ā family	
 Ā of	
 Ā musical	
 Ā 
instruments)	
 Ā 
8.  bass	
 Ā -­‐	
 Ā (nontechnical	
 Ā name	
 Ā for	
 Ā any	
 Ā of	
 Ā numerous	
 Ā edible	
 Ā 	
 Ā marine	
 Ā and	
 Ā 
freshwater	
 Ā spiny-­‐finned	
  fishes)	
 Ā 
SemCor	
 Ā 
<wf	
 Ā pos=PRP>He</wf>	
 Ā 
<wf	
 Ā pos=VB	
 Ā lemma=recognize	
 Ā wnsn=4	
 Ā lexsn=2:31:00::>recognized</wf>	
 Ā 
<wf	
 Ā pos=DT>the</wf>	
 Ā 
<wf	
 Ā pos=NN	
 Ā lemma=gesture	
 Ā wnsn=1	
 Ā lexsn=1:04:00::>gesture</wf>	
 Ā 
<punc>.</punc>	
 Ā 
51	
 Ā 
SemCor: 234,000 words from Brown Corpus,
manually tagged with WordNet senses
Supervised	
 Ā WSD:	
 Ā Extract	
 Ā feature	
 Ā vectors	
 Ā 
Intui$on	
 Ā from	
 Ā Warren	
 Ā Weaver	
 Ā (1955):	
 Ā 
ā€œIf	
 Ā one	
 Ā examines	
 Ā the	
 Ā words	
 Ā in	
 Ā a	
 Ā book,	
 Ā one	
 Ā at	
 Ā a	
 Ā (me	
 Ā as	
 Ā through	
 Ā 
an	
 Ā opaque	
 Ā mask	
 Ā with	
 Ā a	
 Ā hole	
 Ā in	
 Ā it	
 Ā one	
 Ā word	
 Ā wide,	
 Ā then	
 Ā it	
 Ā is	
 Ā 
obviously	
 Ā impossible	
 Ā to	
 Ā determine,	
 Ā one	
 Ā at	
 Ā a	
 Ā (me,	
 Ā the	
 Ā meaning	
 Ā 
of	
 Ā the	
 Ā words…	
 Ā 	
 Ā 
But	
 Ā if	
 Ā one	
 Ā lengthens	
 Ā the	
 Ā slit	
 Ā in	
 Ā the	
 Ā opaque	
 Ā mask,	
 Ā un(l	
 Ā one	
 Ā can	
 Ā 
see	
 Ā not	
 Ā only	
 Ā the	
 Ā central	
 Ā word	
 Ā in	
 Ā ques(on	
 Ā but	
 Ā also	
 Ā say	
 Ā N	
 Ā words	
 Ā 
on	
 Ā either	
 Ā side,	
 Ā then	
 Ā if	
 Ā N	
 Ā is	
 Ā large	
 Ā enough	
 Ā one	
 Ā can	
 Ā unambiguously	
 Ā 
decide	
 Ā the	
 Ā meaning	
 Ā of	
 Ā the	
 Ā central	
 Ā word…	
 Ā 	
 Ā 
The	
 Ā prac(cal	
 Ā ques(on	
 Ā is	
 Ā :	
 Ā ``What	
 Ā minimum	
 Ā value	
 Ā of	
 Ā N	
 Ā will,	
 Ā at	
 Ā 
least	
 Ā in	
 Ā a	
 Ā tolerable	
 Ā frac(on	
 Ā of	
 Ā cases,	
 Ā lead	
 Ā to	
 Ā the	
 Ā correct	
 Ā choice	
 Ā 
of	
 Ā meaning	
 Ā for	
 Ā the	
 Ā central	
 Ā word?ā€	
 Ā 
the	
 Ā 	
 Ā window	
 Ā 
Feature	
 Ā vectors	
 Ā 
•  Vectors	
 Ā of	
 Ā sets	
 Ā of	
 Ā feature/value	
 Ā pairs	
 Ā 
Two	
 Ā kinds	
 Ā of	
 Ā features	
 Ā in	
 Ā the	
 Ā vectors	
 Ā 
•  Colloca$onal	
 Ā features	
 Ā and	
 Ā bag-­‐of-­‐words	
 Ā features	
 Ā 
•  Colloca$onal/Paradigma$c	
 Ā 
•  Features	
 Ā about	
 Ā words	
 Ā at	
  specific	
 Ā posi(ons	
 Ā near	
 Ā target	
 Ā word	
 Ā 
•  O^en	
 Ā limited	
 Ā to	
 Ā just	
 Ā word	
 Ā iden(ty	
 Ā and	
 Ā POS	
 Ā 
•  Bag-­‐of-­‐words	
 Ā 
•  Features	
 Ā about	
 Ā words	
 Ā that	
 Ā occur	
 Ā anywhere	
 Ā in	
 Ā the	
 Ā window	
 Ā (regardless	
 Ā 
of	
 Ā posi(on)	
 Ā 
•  Typically	
 Ā limited	
 Ā to	
 Ā frequency	
 Ā counts	
 Ā 
Generally speaking, a
collocation is a
sequence of words or
terms that co-occur
more often than would
be expected by
chance. But here the
meaning is not exactly
this…	
 Ā 
Examples	
 Ā 
•  Example	
 Ā text	
 Ā (WSJ):	
 Ā 
An	
 Ā electric	
 Ā guitar	
 Ā and	
 Ā bass	
 Ā player	
 Ā stand	
 Ā off	
 Ā to	
 Ā 
one	
 Ā side	
 Ā not	
 Ā really	
 Ā part	
 Ā of	
 Ā the	
 Ā scene	
 Ā 
•  Assume	
 Ā a	
 Ā window	
 Ā of	
 Ā +/-­‐	
 Ā 2	
 Ā from	
 Ā the	
 Ā target	
 Ā 
Examples	
 Ā 
•  Example	
 Ā text	
 Ā (WSJ)	
 Ā 
An	
 Ā electric	
 Ā guitar	
 Ā and	
 Ā bass	
 Ā player	
 Ā stand	
 Ā off	
 Ā to	
 Ā 
one	
 Ā side	
 Ā not	
 Ā really	
 Ā part	
 Ā of	
 Ā the	
 Ā scene,	
 Ā 	
 Ā 
•  Assume	
 Ā a	
 Ā window	
 Ā of	
 Ā +/-­‐	
 Ā 2	
 Ā from	
 Ā the	
 Ā target	
 Ā 
Colloca$onal	
 Ā features	
 Ā 
•  Posi(on-­‐specific	
 Ā informa(on	
 Ā about	
 Ā the	
 Ā words	
 Ā and	
 Ā 
colloca(ons	
 Ā in	
 Ā window	
 Ā 
•  guitar	
 Ā and	
 Ā bass	
 Ā player	
 Ā stand	
 Ā 
•  word	
 Ā 1,2,3	
 Ā grams	
 Ā in	
 Ā window	
 Ā of	
  ±3	
 Ā is	
 Ā common	
 Ā 
encoding local lexical and grammatical information that can often accurately isola
a given sense.
For example consider the ambiguous word bass in the following WSJ sentenc
(16.17) An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
A collocational feature vector, extracted from a window of two words to the rig
and left of the target word, made up of the words themselves, their respective part
of-speech, and pairs of words, that is,
[wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1
i 2,wi+1
i ] (16.1
would yield the following vector:
[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]
High performing systems generally use POS tags and word collocations of leng
1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N
For example consider the ambiguous word bass in the following WSJ sent
6.17) An electric guitar and bass player stand off to one side, not really par
the scene, just as a sort of nod to gringo expectations perhaps.
collocational feature vector, extracted from a window of two words to the
d left of the target word, made up of the words themselves, their respective
-speech, and pairs of words, that is,
[wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1
i 2,wi+1
i ] (
ould yield the following vector:
[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]
gh performing systems generally use POS tags and word collocations of l
2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
Bag-­‐of-­‐words	
 Ā features	
 Ā 
•  ā€œan	
 Ā unordered	
 Ā set	
 Ā of	
 Ā wordsā€	
  –	
 Ā posi(on	
 Ā ignored	
 Ā 
•  Choose	
 Ā a	
 Ā vocabulary:	
 Ā a	
 Ā useful	
 Ā subset	
 Ā of	
 Ā words	
 Ā in	
 Ā a	
 Ā 
training	
 Ā corpus	
 Ā 
•  Either:	
 Ā the	
 Ā count	
 Ā of	
 Ā how	
 Ā o^en	
 Ā each	
 Ā of	
 Ā those	
 Ā terms	
 Ā 
occurs	
 Ā in	
 Ā a	
 Ā given	
 Ā window	
 Ā OR	
 Ā just	
 Ā a	
 Ā binary	
 Ā ā€œindicatorā€	
 Ā 1	
 Ā 
or	
 Ā 0	
 Ā 
	
 Ā 
Co-­‐Occurrence	
 Ā Example	
 Ā 
•  Assume	
 Ā we’ve	
 Ā semled	
 Ā on	
 Ā a	
 Ā possible	
 Ā vocabulary	
 Ā of	
 Ā 12	
 Ā words	
 Ā in	
 Ā 
ā€œbassā€	
 Ā sentences:	
 Ā 	
 Ā 
	
 Ā 
[fishing,	
 Ā big,	
 Ā sound,	
 Ā player,	
  fly,	
 Ā rod,	
 Ā pound,	
 Ā double,	
 Ā runs,	
 Ā playing,	
 Ā guitar,	
 Ā band]	
 Ā 	
 Ā 
•  The	
 Ā vector	
 Ā for:	
 Ā 
	
 Ā guitar	
 Ā and	
 Ā bass	
 Ā player	
 Ā stand	
 Ā 
	
 Ā [0,0,0,1,0,0,0,0,0,0,1,0]	
 Ā 	
 Ā 
	
 Ā 
Word Sense
Disambiguation
Classifica(on	
 Ā 
Classifica$on	
 Ā 
•  Input:	
 Ā 
•  	
 Ā a	
 Ā word	
 Ā w	
 Ā and	
 Ā some	
 Ā features	
 Ā f	
 Ā 
•  	
 Ā a	
  fixed	
 Ā set	
 Ā of	
 Ā classes	
 Ā 	
 Ā C	
 Ā =	
 Ā {c1,	
 Ā c2,…,	
 Ā cJ}	
 Ā 
•  Output:	
 Ā a	
 Ā predicted	
 Ā class	
  c∈C	
 Ā 
Any	
 Ā kind	
 Ā of	
  classifier	
 Ā 
•  Naive	
 Ā Bayes	
 Ā 
•  Logis(c	
 Ā regression	
 Ā 
•  Neural	
 Ā Networks	
 Ā 
•  Support-­‐vector	
 Ā machines	
 Ā 
•  k-­‐Nearest	
 Ā Neighbors	
 Ā 
•  etc.	
 Ā 
	
 Ā 
The	
 Ā end	
 Ā 	
 Ā 
62	
 Ā 

Lecture: Word Sense Disambiguation

  • 1.
    Seman&c Ā Analysis Ā in Ā Language Ā Technology Ā  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Word Sense Disambiguation
 Marina Ā San(ni Ā  san$nim@stp.lingfil.uu.se Ā  Ā  Department Ā of Ā Linguis(cs Ā and Ā Philology Ā  Uppsala Ā University, Ā Uppsala, Ā Sweden Ā  Ā  Spring Ā 2016 Ā  Ā  Ā  1 Ā 
  • 2.
    Previous Ā Lecture: Ā Word Ā Senses Ā  •  Homonomy, Ā polysemy, Ā synonymy, Ā metonymy, Ā etc. Ā  Ā  Prac(cal Ā ac(vi(es: Ā  1) Ā SELECTIONAL Ā RESTRICTIONS Ā  2) Ā MANUAL Ā DISAMBIGUATION Ā OF Ā EXAMPLES Ā USING Ā SENSEVAL Ā  SENSES Ā  AIMS Ā OF Ā PRACTICAL Ā ACTIVITiES: Ā  Ā  •  STUDENTS Ā SHOULD Ā GET Ā ACQUINTED Ā WITH Ā REAL Ā DATA Ā  •  EXPLORATIONS Ā OF Ā APPLICATIONS, Ā RESOURCES Ā AND Ā METHODS. Ā  Ā  2 Ā 
  • 3.
    No Ā preset Ā solu$ons Ā (this Ā slide Ā is Ā to Ā tell Ā you Ā  that Ā you Ā are Ā doing Ā well Ā  Ā J Ā  Ā ) Ā  •  Whatever Ā your Ā experience Ā with Ā data, Ā it Ā is Ā a Ā valuable Ā experience: Ā  Ā  •  Disappointment Ā  •  Frustra(on Ā  •  Feeling Ā lost Ā  •  Happiness Ā  •  Power Ā  •  Excitement Ā  •  … Ā  •  All Ā the Ā students Ā so Ā far Ā  Ā (also Ā in Ā previous Ā courses) Ā have Ā presented Ā their Ā  Ā  own Ā solu(ons… Ā many Ā different Ā solu(ons Ā and Ā it Ā is Ā ok… Ā  Ā 3 Ā 
  • 4.
    J&M Ā own Ā solu$ons: Ā Selec$onal Ā Restric$ons Ā (just Ā for Ā your Ā  records, Ā does Ā not Ā mean Ā they Ā are Ā necessearily Ā beMer Ā than Ā yours… Ā ) Ā  4 Ā 
  • 5.
    Other Ā possible Ā solu$ons… Ā  •  Kissàconcrete Ā sense: Ā touching Ā  with Ā lips/mouth Ā  •  animate Ā kiss Ā [using Ā lips/ mouth] Ā animate/inanimate Ā  •  Ex: Ā he Ā kissed Ā her; Ā  Ā  •  The Ā dolphin Ā kissed Ā the Ā kid Ā  Ā  •  Why Ā does Ā the Ā pope Ā kiss Ā the Ā  ground Ā a^er Ā he Ā disembarks Ā ... Ā  •  Kissàfigura(ve Ā sense: Ā touching Ā  Ā  •  animate Ā kiss Ā inanimate Ā  •  Ex: Ā "Walk Ā as Ā if Ā you Ā are Ā kissing Ā the Ā Earth Ā  with Ā your Ā feet." Ā  5 Ā  pursed Ā lips? Ā 
  • 6.
    NO Ā solu$on Ā or Ā comments Ā provided Ā for Ā Senseval Ā  •  All Ā your Ā impressions Ā and Ā feelings Ā are Ā plausible Ā and Ā acceptable Ā J Ā  6 Ā 
  • 7.
    Remember Ā that Ā in Ā both Ā ac$vi$es… Ā  •  You Ā have Ā experienced Ā cases Ā of Ā POLYSEMY! Ā  •  YOU Ā HAVE Ā TRIED Ā TO Ā DISAMBIGUATE Ā THE Ā SENSES Ā MANUALLY, Ā IE Ā  WITH Ā YOUR Ā HUMAN Ā SKILLS… Ā  Ā  7 Ā 
  • 8.
  • 9.
    Today: Ā Word Ā Sense Ā Disambigua$on Ā (WSD) Ā  •  Given: Ā  •  A Ā word Ā in Ā context; Ā  Ā  •  A  fixed Ā inventory Ā of Ā poten(al Ā word Ā senses; Ā  •  Create Ā a Ā system Ā that Ā automa(cally Ā decides Ā which Ā sense Ā of Ā  the Ā word Ā is Ā correct Ā in Ā that Ā context. Ā 
  • 10.
    Word Ā Sense Ā Disambigua$on: Ā Defini$on Ā  •  Word Ā Sense Ā Disambitua(on Ā (WSD) Ā is Ā the Ā TASK Ā of Ā determining Ā the Ā  correct Ā sense Ā of Ā a Ā word Ā in Ā context. Ā  •  It Ā is Ā an Ā automa(c Ā task: Ā we Ā create Ā a Ā system Ā that Ā automa-cally Ā  disambiguates Ā the Ā senses Ā for Ā us. Ā  •  Useful Ā for Ā many Ā NLP Ā tasks: Ā informa(on Ā retrieval Ā (apple Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā or Ā  apple Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā ?), Ā ques(on Ā answering Ā (does Ā United Ā serve Ā  Philadelphia?), Ā machine Ā transla(on Ā (eng Ā ā€batā€  à Ā It: Ā pipistrello Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  or Ā mazza Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā ?) Ā  10 Ā  Ā 
  • 11.
    Anecdote: Ā the Ā poison Ā apple Ā  •  In Ā 1954, Ā Alan Ā Turing Ā died Ā a^er Ā bi(ng Ā into Ā an Ā apple Ā laced Ā with Ā  cyanide Ā  •  It Ā was Ā said Ā that Ā this Ā half-­‐biten Ā apple Ā inspired Ā the Ā Apple Ā logo… Ā  but Ā apparently Ā it Ā is Ā a Ā legend Ā J Ā  Ā  •  hmp://mentalfloss.com/ar(cle/64049/did-­‐alan-­‐turing-­‐inspire-­‐ apple-­‐logo Ā  Ā  11 Ā 
  • 12.
    Be Ā alert… Ā  • Word Ā sense Ā ambiguity Ā is Ā pervasive Ā !!! Ā  12 Ā 
  • 13.
    Acknowledgements Most Ā slides Ā borrowed Ā or Ā adapted Ā from: Ā  Dan Ā Jurafsky Ā and Ā James Ā H. Ā Mar(n Ā  Dan Ā Jurafsky Ā and Ā Christopher Ā Manning, Ā Coursera Ā  Ā  Ā  J&M(2015, Ā dra^): Ā hmps://web.stanford.edu/~jurafsky/slp3/ Ā  Ā  Ā  Ā  Ā  Ā  Ā 
  • 14.
    Outline: Ā WSD Ā Methods Ā  •  Thesaurus/Dic(onary Ā Methods Ā  •  Supervised Ā Machine Ā Learning Ā  •  Semi-­‐Supervised Ā Learning Ā (self-­‐reading) Ā  14 Ā 
  • 15.
    Word Sense Disambiguation Dic(onary Ā and Ā  Thesaurus Ā Methods Ā 
  • 16.
    The Ā Simplified Ā Lesk Ā algorithm Ā  •  Let’s Ā disambiguate Ā ā€œbankā€ Ā in Ā this Ā sentence: Ā  The Ā bank Ā can Ā guarantee Ā deposits Ā will Ā eventually Ā cover Ā future Ā tui(on Ā costs Ā  because Ā it Ā invests Ā in Ā adjustable-­‐rate Ā mortgage Ā securi(es. Ā  Ā  •  given Ā the Ā following Ā two Ā WordNet Ā senses: Ā  Ā  if overlap > max-overlap then max-overlap overlap best-sense sense end return(best-sense) Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way. The Cor- pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled training corpus data in the signature. bank1 Gloss: a financial institution that accepts deposits and channels the money into lending activities Examples: ā€œhe cashed a check at the bankā€, ā€œthat bank holds the mortgage on my homeā€ bank2 Gloss: sloping land (especially the slope beside a body of water) Examples: ā€œthey pulled the canoe up on the bankā€, ā€œhe sat on the bank of the river and watched the currentsā€
  • 17.
    The Ā Simplified Ā Lesk Ā algorithm Ā  The Ā bank Ā can Ā guarantee Ā deposits Ā will Ā eventually Ā cover Ā future Ā  tui(on Ā costs Ā because Ā it Ā invests Ā in Ā adjustable-­‐rate Ā mortgage Ā  securi(es. Ā  Ā  if overlap > max-overlap then max-overlap overlap best-sense sense end return(best-sense) Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way. The Cor- pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled training corpus data in the signature. bank1 Gloss: a financial institution that accepts deposits and channels the money into lending activities Examples: ā€œhe cashed a check at the bankā€, ā€œthat bank holds the mortgage on my homeā€ bank2 Gloss: sloping land (especially the slope beside a body of water) Examples: ā€œthey pulled the canoe up on the bankā€, ā€œhe sat on the bank of the river and watched the currentsā€ Choose Ā sense Ā with Ā most Ā word Ā overlap Ā between Ā gloss Ā and Ā context Ā  (not Ā coun(ng Ā func(on Ā words) Ā 
  • 18.
    Drawback Ā  •  Glosses Ā and Ā examples Ā migh Ā be Ā too Ā short Ā and Ā may Ā not Ā provide Ā  enough Ā chance Ā to Ā overlap Ā with Ā the Ā context Ā of Ā the Ā word Ā to Ā be Ā  disambiguated. Ā  Ā  18 Ā 
  • 19.
    The Ā Corpus(-­‐based) Ā Lesk Ā algorithm Ā  •  Assumes Ā we Ā have Ā some Ā sense-­‐labeled Ā data Ā (like Ā SemCor) Ā  •  Take Ā all Ā the Ā sentences Ā with Ā the Ā relevant Ā word Ā sense: Ā  These Ā short, Ā "streamlined" Ā mee-ngs Ā usually Ā are Ā sponsored Ā by Ā local Ā banks1, Ā  Chambers Ā of Ā Commerce, Ā trade Ā associa-ons, Ā or Ā other Ā civic Ā organiza-ons. Ā  •  Now Ā add Ā these Ā to Ā the Ā gloss Ā + Ā examples Ā for Ā each Ā sense, Ā call Ā it Ā the Ā  ā€œsignatureā€ Ā of Ā a Ā sense. Ā Basically, Ā it Ā is Ā an Ā expansion Ā of Ā the Ā  dic(onary Ā entry. Ā  •  Choose Ā sense Ā with Ā most Ā word Ā overlap Ā between Ā context Ā and Ā  signature Ā (ie. Ā the Ā context Ā words Ā provided Ā by Ā the Ā resources). Ā 
  • 20.
    Corpus Ā Lesk: Ā IDF Ā weigh$ng Ā  •  Instead Ā of Ā just Ā removing Ā func(on Ā words Ā  •  Weigh Ā each Ā word Ā by Ā its Ā `promiscuity’ Ā across Ā documents Ā  •  Down-­‐weights Ā words Ā that Ā occur Ā in Ā every Ā `document’ Ā (gloss, Ā example, Ā etc) Ā  •  These Ā are Ā generally Ā func(on Ā words, Ā but Ā is Ā a Ā more  fine-­‐grained Ā measure Ā  •  Weigh Ā each Ā overlapping Ā word Ā by Ā inverse Ā document Ā frequency Ā  (IDF). Ā  20 Ā 
  • 21.
    Graph-­‐based Ā methods Ā  • First, Ā WordNet Ā can Ā be Ā viewed Ā as Ā a Ā graph Ā  •  senses Ā are Ā nodes Ā  •  rela(ons Ā (hypernymy, Ā meronymy) Ā are Ā edges Ā  •  Also Ā add Ā edge Ā between Ā word Ā and Ā unambiguous Ā gloss Ā words Ā  21 Ā  toastn 4 drinkv 1 drinkern 1 drinkingn 1 potationn 1 sipn 1 sipv 1 beveragen 1 milkn 1 liquidn 1foodn 1 drinkn 1 helpingn 1 supv 1 consumptionn 1 consumern 1 consumev 1 An Ā undirected Ā  graph Ā is Ā set Ā of Ā  nodes Ā tha Ā are Ā  connected Ā  together Ā by Ā  bidirec(onal Ā  edges Ā (lines). Ā  Ā 
  • 22.
    How Ā to Ā use Ā the Ā graph Ā for Ā WSD Ā  ā€œShe Ā drank Ā some Ā milkā€ Ā  •  choose Ā the Ā  Ā  Ā  Ā  Ā  Ā most Ā central Ā sense Ā  Ā  (several Ā algorithms Ā  Ā have Ā been Ā proposed Ā  recently) Ā  Ā  22 Ā  drinkv 1 drinkern 1 beveragen 1 boozingn 1 foodn 1 drinkn 1 milkn 1 milkn 2 milkn 3 milkn 4 drinkv 2 drinkv 3 drinkv 4 drinkv 5 nutrimentn 1 ā€œdrinkā€ ā€œmilkā€
  • 23.
    Word Meaning and Similarity Word Ā Similarity: Ā  Thesaurus Ā Methods Ā  beg: Ā c_w8 Ā 
  • 24.
    Word Ā Similarity Ā  • Synonymy: Ā a Ā binary Ā rela(on Ā  •  Two Ā words Ā are Ā either Ā synonymous Ā or Ā not Ā  •  Similarity Ā (or Ā distance): Ā a Ā looser Ā metric Ā  •  Two Ā words Ā are Ā more Ā similar Ā if Ā they Ā share Ā more Ā features Ā of Ā meaning Ā  •  Similarity Ā is Ā properly Ā a Ā rela(on Ā between Ā senses Ā  •  We Ā do Ā not Ā say Ā ā€œThe Ā word Ā ā€œbankā€ Ā is Ā not Ā similar Ā to Ā the Ā word Ā ā€œslopeā€ Ā ā€œ, Ā bu Ā w Ā say. Ā  •  Bank1 Ā is Ā similar Ā to Ā fund3 Ā  •  Bank2 Ā is Ā similar Ā to Ā slope5 Ā  •  But Ā we’ll Ā compute Ā similarity Ā over Ā both Ā words Ā and Ā senses Ā 
  • 25.
    Why Ā word Ā similarity Ā  •  Informa(on Ā retrieval Ā  •  Ques(on Ā answering Ā  •  Machine Ā transla(on Ā  •  Natural Ā language Ā genera(on Ā  •  Language Ā modeling Ā  •  Automa(c Ā essay Ā grading Ā  •  Plagiarism Ā detec(on Ā  •  Document Ā clustering Ā 
  • 26.
    Word Ā similarity Ā and Ā word Ā relatedness Ā  •  We Ā o^en Ā dis(nguish Ā word Ā similarity Ā  Ā from Ā word Ā  relatedness Ā  •  Similar Ā words: Ā near-­‐synonyms Ā  •  car, bicycle: Ā  Ā  Ā  Ā similar Ā  •  Related Ā words: Ā can Ā be Ā related Ā any Ā way Ā  •  car, gasoline: Ā  Ā  Ā related, Ā not Ā similar Ā  Cf. Ā Synonyms: Ā car Ā & Ā automobile Ā 
  • 27.
    Two Ā classes Ā of Ā similarity Ā algorithms Ā  •  Thesaurus-­‐based Ā algorithms Ā  •  Are Ā words Ā ā€œnearbyā€ Ā in Ā hypernym Ā hierarchy? Ā  •  Do Ā words Ā have Ā similar Ā glosses Ā (defini(ons)? Ā  •  Distribu(onal Ā algorithms: Ā next Ā (me! Ā  •  Do Ā words Ā have Ā similar Ā distribu(onal Ā contexts? Ā 
  • 28.
    Path-­‐based Ā similarity Ā  • Two Ā concepts Ā (senses/synsets) Ā are Ā similar Ā if Ā  they Ā are Ā near Ā each Ā other Ā in Ā the Ā thesaurus Ā  hierarchy Ā  Ā  •  =have Ā a Ā short Ā path Ā between Ā them Ā  •  concepts Ā have Ā path Ā 1 Ā to Ā themselves Ā 
  • 29.
    Refinements Ā to Ā path-­‐based Ā similarity Ā  •  pathlen(c1,c2) = Ā (distance Ā metric) Ā = Ā 1 Ā + Ā number Ā of Ā edges Ā in Ā the Ā  shortest Ā path Ā in Ā the Ā hypernym Ā graph Ā between Ā sense Ā nodes Ā c1 Ā  and Ā c2 Ā  •  simpath(c1,c2) = •  wordsim(w1,w2) = max sim(c1,c2) c1∈senses(w1),c2∈senses(w2) Ā  1 pathlen(c1,c2 ) Sense Ā similarity Ā metric: Ā 1 Ā  over Ā the Ā distance! Ā  Word Ā similarity Ā metric: Ā  Ā  max Ā similarity Ā among Ā  pairs Ā of Ā senses. Ā  For Ā all Ā senses Ā of Ā w1 Ā and Ā all Ā senses Ā of Ā w2, Ā take Ā the Ā similarity Ā between Ā each Ā of Ā the Ā senses Ā of Ā w1 Ā  and Ā each Ā of Ā the Ā senses Ā of Ā w2 Ā and Ā then Ā take Ā the Ā maximum Ā similarity Ā between Ā those Ā pairs. Ā 
  • 30.
    Example: Ā path-­‐based Ā similarity Ā  simpath(c1,c2) = 1/pathlen(c1,c2) simpath(nickel,coin) Ā = Ā 1/2 = .5 simpath(fund,budget) Ā = Ā 1/2 = .5 simpath(nickel,currency) Ā = Ā 1/4 = .25 simpath(nickel,money) Ā = Ā 1/6 = .17 simpath(coinage,Richter Ā scale) Ā = Ā 1/6 = .17
  • 31.
    Problem Ā with Ā basic Ā path-­‐based Ā similarity Ā  •  Assumes Ā each Ā link Ā represents Ā a Ā uniform Ā distance Ā  •  But Ā nickel Ā to Ā money Ā seems Ā to Ā us Ā to Ā be Ā closer Ā than Ā nickel Ā to Ā  standard Ā  •  Nodes Ā high Ā in Ā the Ā hierarchy Ā are Ā very Ā abstract Ā  •  We Ā instead Ā want Ā a Ā metric Ā that Ā  •  Represents Ā the Ā cost Ā of Ā each Ā edge Ā independently Ā  •  Words Ā connected Ā only Ā through Ā abstract Ā nodes Ā  Ā  •  are Ā less Ā similar Ā 
  • 32.
    Informa$on Ā content Ā similarity Ā metrics Ā  •  In Ā simple Ā words: Ā  •  We Ā define Ā the Ā probability Ā of Ā a Ā concept Ā C Ā as Ā the Ā probability Ā that Ā a Ā  randomly Ā selected Ā word Ā in Ā a Ā corpus Ā is Ā an Ā instance Ā of Ā that Ā concept. Ā  •  Basically, Ā for Ā each Ā random Ā word Ā in Ā a Ā corpus Ā we Ā compute Ā how Ā probable Ā it Ā  is Ā that Ā it Ā belongs Ā to Ā a Ā certain Ā concepts. Ā  Ā  Resnik Ā 1995. Ā Using Ā informa(on Ā content Ā to Ā evaluate Ā seman(c Ā  similarity Ā in Ā a Ā taxonomy. Ā IJCAI Ā 
  • 33.
    Formally: Ā Informa$on Ā content Ā similarity Ā metrics Ā  •  Let’s Ā define Ā P(c) as: Ā  •  The Ā probability Ā that Ā a Ā randomly Ā selected Ā word Ā in Ā a Ā corpus Ā is Ā an Ā instance Ā  of Ā concept Ā c •  Formally: Ā there Ā is Ā a Ā dis(nct Ā random Ā variable, Ā ranging Ā over Ā words, Ā  associated Ā with Ā each Ā concept Ā in Ā the Ā hierarchy Ā  •  for Ā a Ā given Ā concept, Ā each Ā observed Ā noun Ā is Ā either Ā  •  Ā a Ā member Ā of Ā that Ā concept Ā  Ā with Ā probability Ā P(c) •  not Ā a Ā member Ā of Ā that Ā concept Ā with Ā probability Ā 1-P(c) •  All Ā words Ā are Ā members Ā of Ā the Ā root Ā node Ā (En(ty) Ā  •  P(root)=1 •  The Ā lower Ā a Ā node Ā in Ā hierarchy, Ā the Ā lower Ā its Ā probability Ā  Resnik Ā 1995. Ā Using Ā informa(on Ā content Ā to Ā evaluate Ā seman(c Ā  similarity Ā in Ā a Ā taxonomy. Ā IJCAI Ā 
  • 34.
    Informa$on Ā content Ā similarity Ā  •  For Ā every Ā word Ā (ex Ā ā€œnatural Ā eleva(onā€), Ā we Ā  count Ā all Ā the Ā words Ā in Ā that Ā concepts, Ā and Ā  then Ā we Ā normalize Ā by Ā the Ā total Ā number Ā of Ā  words Ā in Ā the Ā corpus. Ā  •  we Ā get Ā a Ā probability Ā value Ā that Ā tells Ā us Ā how Ā  probable Ā it Ā is Ā that Ā a Ā random Ā word Ā is Ā a Ā an Ā  instance Ā of Ā that Ā concept Ā  Ā  P(c) = count(w) w∈words(c) āˆ‘ N geological-­‐forma(on Ā  shore Ā  hill Ā  natural Ā eleva(on Ā  coast Ā  cave Ā  gromo Ā ridge Ā  … Ā  en(ty Ā  In Ā order Ā o Ā compute Ā the Ā  probability Ā of Ā the Ā term Ā  "natural Ā eleva(on", Ā we Ā  take Ā ridge, Ā hill Ā + Ā natural Ā  eleva(on Ā itself Ā 
  • 35.
    Informa$on Ā content Ā similarity Ā  •  WordNet Ā hierarchy Ā augmented Ā with Ā probabili(es Ā P(c) Ā  D. Ā Lin. Ā 1998. Ā An Ā Informa(on-­‐Theore(c Ā Defini(on Ā of Ā Similarity. Ā ICML Ā 1998 Ā 
  • 36.
    Informa$on Ā content: Ā defini$ons Ā  1.  Informa(on Ā content: Ā  1.  IC(c) = -log P(c) 2.  Most Ā informa(ve Ā subsumer Ā  (Lowest Ā common Ā subsumer) Ā  LCS(c1,c2) = The Ā most Ā informa(ve Ā (lowest) Ā  node Ā in Ā the Ā hierarchy Ā  subsuming Ā both Ā c1 Ā and Ā c2 Ā 
  • 37.
    IC Ā aka… Ā  • A Ā lot Ā of Ā people Ā prefer Ā the Ā term Ā surprisal Ā to Ā informa(on Ā or Ā to Ā  informa(on Ā content. Ā  Ā  -­‐log Ā p(x) Ā  It Ā measures Ā the Ā amount Ā of Ā surprise Ā generated Ā by Ā the Ā event Ā x. Ā  Ā  The Ā smaller Ā the Ā probability Ā of Ā x, Ā the Ā bigger Ā the Ā surprisal Ā is. Ā  Ā  It's Ā helpful Ā to Ā think Ā about Ā it Ā this Ā way, Ā par(cularly Ā for Ā linguis(cs Ā  examples. Ā  Ā  37 Ā 
  • 38.
    Using Ā informa$on Ā content Ā for Ā similarity: Ā  Ā  the Ā Resnik Ā method Ā  •  The Ā similarity Ā between Ā two Ā words Ā is Ā related Ā to Ā their Ā  common Ā informa(on Ā  •  The Ā more Ā two Ā words Ā have Ā in Ā common, Ā the Ā more Ā  similar Ā they Ā are Ā  •  Resnik: Ā measure Ā common Ā informa(on Ā as: Ā  •  The Ā informa(on Ā content Ā of Ā the Ā most Ā informa(ve Ā  Ā (lowest) Ā subsumer Ā (MIS/LCS) Ā of Ā the Ā two Ā nodes Ā  •  simresnik(c1,c2) = -log P( LCS(c1,c2) ) Philip Ā Resnik. Ā 1995. Ā Using Ā Informa(on Ā Content Ā to Ā Evaluate Ā Seman(c Ā Similarity Ā in Ā a Ā Taxonomy. Ā IJCAI Ā 1995. Ā  Philip Ā Resnik. Ā 1999. Ā Seman(c Ā Similarity Ā in Ā a Ā Taxonomy: Ā An Ā Informa(on-­‐Based Ā Measure Ā and Ā its Ā Applica(on Ā  to Ā Problems Ā of Ā Ambiguity Ā in Ā Natural Ā Language. Ā JAIR Ā 11, Ā 95-­‐130. Ā 
  • 39.
    Dekang Ā Lin Ā method Ā  •  Intui(on: Ā Similarity Ā between Ā A Ā and Ā B Ā is Ā not Ā just Ā what Ā they Ā have Ā  in Ā common Ā  •  The Ā more Ā differences Ā between Ā A Ā and Ā B, Ā the Ā less Ā similar Ā they Ā are: Ā  •  Commonality: Ā the Ā more Ā A Ā and Ā B Ā have Ā in Ā common, Ā the Ā more Ā similar Ā they Ā are Ā  •  Difference: Ā the Ā more Ā differences Ā between Ā A Ā and Ā B, Ā the Ā less Ā similar Ā  •  Commonality: Ā IC(common(A,B)) Ā  •  Difference: Ā IC(descrip(on(A,B)-­‐IC(common(A,B)) Ā  Dekang Ā Lin. Ā 1998. Ā An Ā Informa(on-­‐Theore(c Ā Defini(on Ā of Ā Similarity. Ā ICML Ā 
  • 40.
    Dekang Ā Lin Ā similarity Ā theorem Ā  •  The Ā similarity Ā between Ā A Ā and Ā B Ā is Ā measured Ā by Ā the Ā ra(o Ā  between Ā the Ā amount Ā of Ā informa(on Ā needed Ā to Ā state Ā the Ā  commonality Ā of Ā A Ā and Ā B Ā and Ā the Ā informa(on Ā needed Ā to Ā fully Ā  describe Ā what Ā A Ā and Ā B Ā are Ā  Ā  simLin(A, B)āˆ IC(common(A, B)) IC(description(A, B)) •  Lin Ā (altering Ā Resnik) Ā defines Ā IC(common(A,B)) Ā as Ā 2 Ā x Ā informa(on Ā of Ā the Ā LCS Ā  simLin(c1,c2 ) = 2logP(LCS(c1,c2 )) logP(c1)+ logP(c2 )
  • 41.
    Lin Ā similarity Ā func$on Ā  simLin(A, B) = 2logP(LCS(c1,c2 )) logP(c1)+ logP(c2 ) simLin(hill,coast) = 2logP(geological-formation) logP(hill)+ logP(coast) = 2ln0.00176 ln0.0000189 + ln0.0000216 =.59
  • 42.
    The Ā (extended) Ā Lesk Ā Algorithm Ā  Ā  •  A Ā thesaurus-­‐based Ā measure Ā that Ā looks Ā at Ā glosses Ā  •  Two Ā concepts Ā are Ā similar Ā if Ā their Ā glosses Ā contain Ā similar Ā words Ā  •  Drawing Ā paper: Ā paper Ā that Ā is Ā specially Ā prepared Ā for Ā use Ā in Ā dra^ing Ā  •  Decal: Ā the Ā art Ā of Ā transferring Ā designs Ā from Ā specially Ā prepared Ā paper Ā to Ā a Ā  wood Ā or Ā glass Ā or Ā metal Ā surface Ā  •  For Ā each Ā n-­‐word Ā phrase Ā that’s Ā in Ā both Ā glosses Ā  •  Add Ā a Ā score Ā of Ā n2 Ā  Ā  •  Paper Ā and Ā specially Ā prepared Ā for Ā 1 Ā + Ā 22 Ā = Ā 5 Ā  •  Compute Ā overlap Ā also Ā for Ā other Ā rela(ons Ā  •  glosses Ā of Ā hypernyms Ā and Ā hyponyms Ā 
  • 43.
  • 44.
    Libraries Ā for Ā compu$ng Ā thesaurus-­‐based Ā  similarity Ā  •  NLTK Ā  •  hmp://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity Ā -­‐ Ā  nltk.corpus.reader.WordNetCorpusReader.res_similarity Ā  •  WordNet::Similarity Ā  •  hmp://wn-­‐similarity.sourceforge.net/ Ā  •  Web-­‐based Ā interface: Ā  •  hmp://marimba.d.umn.edu/cgi-­‐bin/similarity/similarity.cgi Ā  44 Ā 
  • 45.
  • 46.
    Basic Ā idea Ā  • If Ā we Ā have Ā data Ā that Ā has Ā been Ā hand-­‐labelled Ā with Ā correct Ā word Ā  senses, Ā we Ā can Ā used Ā a Ā supervised Ā learning Ā approach Ā and Ā learn Ā  from Ā it! Ā  •  We Ā need Ā to Ā extract Ā features Ā and Ā train Ā a Ā classifier Ā  •  The Ā output Ā of Ā training Ā is Ā an Ā automa(c Ā system Ā capable Ā of Ā assigning Ā sense Ā  labels Ā TO Ā unlabelled Ā words Ā in Ā a Ā context. Ā  Ā  46 Ā 
  • 47.
    Two Ā variants Ā of Ā WSD Ā task Ā  •  Lexical Ā Sample Ā task Ā  Ā  •  (we Ā need Ā labelled Ā corpora Ā for Ā individual Ā senses) Ā  •  Small Ā pre-­‐selected Ā set Ā of Ā target Ā words Ā (ex Ā difficulty) Ā  •  And Ā inventory Ā of Ā senses Ā for Ā each Ā word Ā  •  Supervised Ā machine Ā learning: Ā train Ā a Ā classifier Ā for Ā each Ā word Ā  •  All-­‐words Ā task Ā  Ā  •  (each Ā word Ā in Ā each Ā sentence Ā is Ā labelled Ā with Ā a Ā sense) Ā  •  Every Ā word Ā in Ā an Ā en(re Ā text Ā  •  A Ā lexicon Ā with Ā senses Ā for Ā each Ā word Ā  SENSEVAL Ā 1-­‐2-­‐3 Ā 
  • 48.
    Supervised Ā Machine Ā Learning Ā Approaches Ā  •  Summary Ā of Ā what Ā we Ā need: Ā  •  the Ā tag Ā set Ā (ā€œsense Ā inventoryā€) Ā  •  the Ā training Ā corpus Ā  •  A Ā set Ā of Ā features Ā extracted Ā from Ā the Ā training Ā corpus Ā  •  A Ā classifier Ā 
  • 49.
    Supervised Ā WSD Ā 1: Ā WSD Ā Tags Ā  •  What’s Ā a Ā tag? Ā  A Ā dic(onary Ā sense? Ā  •  For Ā example, Ā for Ā WordNet Ā an Ā instance Ā of Ā ā€œbassā€ Ā in Ā a Ā text Ā has Ā 8 Ā  possible Ā tags Ā or Ā labels Ā (bass1 Ā through Ā bass8). Ā 
  • 50.
    8 Ā senses Ā of Ā ā€œbassā€ Ā in Ā WordNet Ā  1.  bass Ā -­‐ Ā (the Ā lowest Ā part Ā of Ā the Ā musical Ā range) Ā  2.  bass, Ā bass Ā part Ā -­‐ Ā (the Ā lowest Ā part Ā in Ā polyphonic Ā  Ā music) Ā  3.  bass, Ā basso Ā -­‐ Ā (an Ā adult Ā male Ā singer Ā with Ā the Ā lowest Ā voice) Ā  4.  sea Ā bass, Ā bass Ā -­‐ Ā (flesh Ā of Ā lean-­‐fleshed Ā saltwater  fish Ā of Ā the Ā family Ā  Serranidae) Ā  5.  freshwater Ā bass, Ā bass Ā -­‐ Ā (any Ā of Ā various Ā North Ā American Ā lean-­‐fleshed Ā  freshwater  fishes Ā especially Ā of Ā the Ā genus Ā Micropterus) Ā  6.  bass, Ā bass Ā voice, Ā basso Ā -­‐ Ā (the Ā lowest Ā adult Ā male Ā singing Ā voice) Ā  7.  bass Ā -­‐ Ā (the Ā member Ā with Ā the Ā lowest Ā range Ā of Ā a Ā family Ā of Ā musical Ā  instruments) Ā  8.  bass Ā -­‐ Ā (nontechnical Ā name Ā for Ā any Ā of Ā numerous Ā edible Ā  Ā marine Ā and Ā  freshwater Ā spiny-­‐finned  fishes) Ā 
  • 51.
    SemCor Ā  <wf Ā pos=PRP>He</wf> Ā  <wf Ā pos=VB Ā lemma=recognize Ā wnsn=4 Ā lexsn=2:31:00::>recognized</wf> Ā  <wf Ā pos=DT>the</wf> Ā  <wf Ā pos=NN Ā lemma=gesture Ā wnsn=1 Ā lexsn=1:04:00::>gesture</wf> Ā  <punc>.</punc> Ā  51 Ā  SemCor: 234,000 words from Brown Corpus, manually tagged with WordNet senses
  • 52.
    Supervised Ā WSD: Ā Extract Ā feature Ā vectors Ā  Intui$on Ā from Ā Warren Ā Weaver Ā (1955): Ā  ā€œIf Ā one Ā examines Ā the Ā words Ā in Ā a Ā book, Ā one Ā at Ā a Ā (me Ā as Ā through Ā  an Ā opaque Ā mask Ā with Ā a Ā hole Ā in Ā it Ā one Ā word Ā wide, Ā then Ā it Ā is Ā  obviously Ā impossible Ā to Ā determine, Ā one Ā at Ā a Ā (me, Ā the Ā meaning Ā  of Ā the Ā words… Ā  Ā  But Ā if Ā one Ā lengthens Ā the Ā slit Ā in Ā the Ā opaque Ā mask, Ā un(l Ā one Ā can Ā  see Ā not Ā only Ā the Ā central Ā word Ā in Ā ques(on Ā but Ā also Ā say Ā N Ā words Ā  on Ā either Ā side, Ā then Ā if Ā N Ā is Ā large Ā enough Ā one Ā can Ā unambiguously Ā  decide Ā the Ā meaning Ā of Ā the Ā central Ā word… Ā  Ā  The Ā prac(cal Ā ques(on Ā is Ā : Ā ``What Ā minimum Ā value Ā of Ā N Ā will, Ā at Ā  least Ā in Ā a Ā tolerable Ā frac(on Ā of Ā cases, Ā lead Ā to Ā the Ā correct Ā choice Ā  of Ā meaning Ā for Ā the Ā central Ā word?ā€ Ā  the Ā  Ā window Ā 
  • 53.
    Feature Ā vectors Ā  • Vectors Ā of Ā sets Ā of Ā feature/value Ā pairs Ā 
  • 54.
    Two Ā kinds Ā of Ā features Ā in Ā the Ā vectors Ā  •  Colloca$onal Ā features Ā and Ā bag-­‐of-­‐words Ā features Ā  •  Colloca$onal/Paradigma$c Ā  •  Features Ā about Ā words Ā at Ā specific Ā posi(ons Ā near Ā target Ā word Ā  •  O^en Ā limited Ā to Ā just Ā word Ā iden(ty Ā and Ā POS Ā  •  Bag-­‐of-­‐words Ā  •  Features Ā about Ā words Ā that Ā occur Ā anywhere Ā in Ā the Ā window Ā (regardless Ā  of Ā posi(on) Ā  •  Typically Ā limited Ā to Ā frequency Ā counts Ā  Generally speaking, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. But here the meaning is not exactly this… Ā 
  • 55.
    Examples Ā  •  Example Ā text Ā (WSJ): Ā  An Ā electric Ā guitar Ā and Ā bass Ā player Ā stand Ā off Ā to Ā  one Ā side Ā not Ā really Ā part Ā of Ā the Ā scene Ā  •  Assume Ā a Ā window Ā of Ā +/-­‐ Ā 2 Ā from Ā the Ā target Ā 
  • 56.
    Examples Ā  •  Example Ā text Ā (WSJ) Ā  An Ā electric Ā guitar Ā and Ā bass Ā player Ā stand Ā off Ā to Ā  one Ā side Ā not Ā really Ā part Ā of Ā the Ā scene, Ā  Ā  •  Assume Ā a Ā window Ā of Ā +/-­‐ Ā 2 Ā from Ā the Ā target Ā 
  • 57.
    Colloca$onal Ā features Ā  • Posi(on-­‐specific Ā informa(on Ā about Ā the Ā words Ā and Ā  colloca(ons Ā in Ā window Ā  •  guitar Ā and Ā bass Ā player Ā stand Ā  •  word Ā 1,2,3 Ā grams Ā in Ā window Ā of  ±3 Ā is Ā common Ā  encoding local lexical and grammatical information that can often accurately isola a given sense. For example consider the ambiguous word bass in the following WSJ sentenc (16.17) An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. A collocational feature vector, extracted from a window of two words to the rig and left of the target word, made up of the words themselves, their respective part of-speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] (16.1 would yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] High performing systems generally use POS tags and word collocations of leng 1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N For example consider the ambiguous word bass in the following WSJ sent 6.17) An electric guitar and bass player stand off to one side, not really par the scene, just as a sort of nod to gringo expectations perhaps. collocational feature vector, extracted from a window of two words to the d left of the target word, made up of the words themselves, their respective -speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] ( ould yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] gh performing systems generally use POS tags and word collocations of l 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
  • 58.
    Bag-­‐of-­‐words Ā features Ā  ā€¢ā€Æā€œan Ā unordered Ā set Ā of Ā wordsā€  – Ā posi(on Ā ignored Ā  •  Choose Ā a Ā vocabulary: Ā a Ā useful Ā subset Ā of Ā words Ā in Ā a Ā  training Ā corpus Ā  •  Either: Ā the Ā count Ā of Ā how Ā o^en Ā each Ā of Ā those Ā terms Ā  occurs Ā in Ā a Ā given Ā window Ā OR Ā just Ā a Ā binary Ā ā€œindicatorā€ Ā 1 Ā  or Ā 0 Ā  Ā 
  • 59.
    Co-­‐Occurrence Ā Example Ā  • Assume Ā we’ve Ā semled Ā on Ā a Ā possible Ā vocabulary Ā of Ā 12 Ā words Ā in Ā  ā€œbassā€ Ā sentences: Ā  Ā  Ā  [fishing, Ā big, Ā sound, Ā player,  fly, Ā rod, Ā pound, Ā double, Ā runs, Ā playing, Ā guitar, Ā band] Ā  Ā  •  The Ā vector Ā for: Ā  Ā guitar Ā and Ā bass Ā player Ā stand Ā  Ā [0,0,0,1,0,0,0,0,0,0,1,0] Ā  Ā  Ā 
  • 60.
  • 61.
    Classifica$on Ā  •  Input: Ā  •  Ā a Ā word Ā w Ā and Ā some Ā features Ā f Ā  •  Ā a  fixed Ā set Ā of Ā classes Ā  Ā C Ā = Ā {c1, Ā c2,…, Ā cJ} Ā  •  Output: Ā a Ā predicted Ā class Ā c∈C Ā  Any Ā kind Ā of Ā classifier Ā  •  Naive Ā Bayes Ā  •  Logis(c Ā regression Ā  •  Neural Ā Networks Ā  •  Support-­‐vector Ā machines Ā  •  k-­‐Nearest Ā Neighbors Ā  •  etc. Ā  Ā 
  • 62.
    The Ā end Ā  Ā  62 Ā