Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lecture: Word Sense Disambiguation

4,262 views

Published on

word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.

Published in: Education
  • Memory Improvement: How To Improve Your Memory In Just 30 Days, click here..  https://tinyurl.com/brainpill101
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Lecture: Word Sense Disambiguation

  1. 1. Seman&c  Analysis  in  Language  Technology   http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Word Sense Disambiguation
 Marina  San(ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016       1  
  2. 2. Previous  Lecture:  Word  Senses   •  Homonomy,  polysemy,  synonymy,  metonymy,  etc.     Prac(cal  ac(vi(es:   1)  SELECTIONAL  RESTRICTIONS   2)  MANUAL  DISAMBIGUATION  OF  EXAMPLES  USING  SENSEVAL   SENSES   AIMS  OF  PRACTICAL  ACTIVITiES:     •  STUDENTS  SHOULD  GET  ACQUINTED  WITH  REAL  DATA   •  EXPLORATIONS  OF  APPLICATIONS,  RESOURCES  AND  METHODS.     2  
  3. 3. No  preset  solu$ons  (this  slide  is  to  tell  you   that  you  are  doing  well    J    )   •  Whatever  your  experience  with  data,  it  is  a  valuable  experience:     •  Disappointment   •  Frustra(on   •  Feeling  lost   •  Happiness   •  Power   •  Excitement   •  …   •  All  the  students  so  far    (also  in  previous  courses)  have  presented  their     own  solu(ons…  many  different  solu(ons  and  it  is  ok…    3  
  4. 4. J&M  own  solu$ons:  Selec$onal  Restric$ons  (just  for  your   records,  does  not  mean  they  are  necessearily  beMer  than  yours…  )   4  
  5. 5. Other  possible  solu$ons…   •  Kissàconcrete  sense:  touching   with  lips/mouth   •  animate  kiss  [using  lips/ mouth]  animate/inanimate   •  Ex:  he  kissed  her;     •  The  dolphin  kissed  the  kid     •  Why  does  the  pope  kiss  the   ground  a^er  he  disembarks  ...   •  Kissàfigura(ve  sense:  touching     •  animate  kiss  inanimate   •  Ex:  "Walk  as  if  you  are  kissing  the  Earth   with  your  feet."   5   pursed  lips?  
  6. 6. NO  solu$on  or  comments  provided  for  Senseval   •  All  your  impressions  and  feelings  are  plausible  and  acceptable  J   6  
  7. 7. Remember  that  in  both  ac$vi$es…   •  You  have  experienced  cases  of  POLYSEMY!   •  YOU  HAVE  TRIED  TO  DISAMBIGUATE  THE  SENSES  MANUALLY,  IE   WITH  YOUR  HUMAN  SKILLS…     7  
  8. 8. Previous  lecture:  end   8  
  9. 9. Today:  Word  Sense  Disambigua$on  (WSD)   •  Given:   •  A  word  in  context;     •  A  fixed  inventory  of  poten(al  word  senses;   •  Create  a  system  that  automa(cally  decides  which  sense  of   the  word  is  correct  in  that  context.  
  10. 10. Word  Sense  Disambigua$on:  Defini$on   •  Word  Sense  Disambitua(on  (WSD)  is  the  TASK  of  determining  the   correct  sense  of  a  word  in  context.   •  It  is  an  automa(c  task:  we  create  a  system  that  automa-cally   disambiguates  the  senses  for  us.   •  Useful  for  many  NLP  tasks:  informa(on  retrieval  (apple                                    or   apple                            ?),  ques(on  answering  (does  United  serve   Philadelphia?),  machine  transla(on  (eng  ”bat”  à  It:  pipistrello                     or  mazza                                                                        ?)   10    
  11. 11. Anecdote:  the  poison  apple   •  In  1954,  Alan  Turing  died  a^er  bi(ng  into  an  apple  laced  with   cyanide   •  It  was  said  that  this  half-­‐biten  apple  inspired  the  Apple  logo…   but  apparently  it  is  a  legend  J     •  hmp://mentalfloss.com/ar(cle/64049/did-­‐alan-­‐turing-­‐inspire-­‐ apple-­‐logo     11  
  12. 12. Be  alert…   •  Word  sense  ambiguity  is  pervasive  !!!   12  
  13. 13. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  James  H.  Mar(n   Dan  Jurafsky  and  Christopher  Manning,  Coursera       J&M(2015,  dra^):  hmps://web.stanford.edu/~jurafsky/slp3/              
  14. 14. Outline:  WSD  Methods   •  Thesaurus/Dic(onary  Methods   •  Supervised  Machine  Learning   •  Semi-­‐Supervised  Learning  (self-­‐reading)   14  
  15. 15. Word Sense Disambiguation Dic(onary  and   Thesaurus  Methods  
  16. 16. The  Simplified  Lesk  algorithm   •  Let’s  disambiguate  “bank”  in  this  sentence:   The  bank  can  guarantee  deposits  will  eventually  cover  future  tui(on  costs   because  it  invests  in  adjustable-­‐rate  mortgage  securi(es.     •  given  the  following  two  WordNet  senses:     if overlap > max-overlap then max-overlap overlap best-sense sense end return(best-sense) Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way. The Cor- pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled training corpus data in the signature. bank1 Gloss: a financial institution that accepts deposits and channels the money into lending activities Examples: “he cashed a check at the bank”, “that bank holds the mortgage on my home” bank2 Gloss: sloping land (especially the slope beside a body of water) Examples: “they pulled the canoe up on the bank”, “he sat on the bank of the river and watched the currents”
  17. 17. The  Simplified  Lesk  algorithm   The  bank  can  guarantee  deposits  will  eventually  cover  future   tui(on  costs  because  it  invests  in  adjustable-­‐rate  mortgage   securi(es.     if overlap > max-overlap then max-overlap overlap best-sense sense end return(best-sense) Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way. The Cor- pus Lesk algorithm weights each overlapping word w by its logP(w) and includes labeled training corpus data in the signature. bank1 Gloss: a financial institution that accepts deposits and channels the money into lending activities Examples: “he cashed a check at the bank”, “that bank holds the mortgage on my home” bank2 Gloss: sloping land (especially the slope beside a body of water) Examples: “they pulled the canoe up on the bank”, “he sat on the bank of the river and watched the currents” Choose  sense  with  most  word  overlap  between  gloss  and  context   (not  coun(ng  func(on  words)  
  18. 18. Drawback   •  Glosses  and  examples  migh  be  too  short  and  may  not  provide   enough  chance  to  overlap  with  the  context  of  the  word  to  be   disambiguated.     18  
  19. 19. The  Corpus(-­‐based)  Lesk  algorithm   •  Assumes  we  have  some  sense-­‐labeled  data  (like  SemCor)   •  Take  all  the  sentences  with  the  relevant  word  sense:   These  short,  "streamlined"  mee-ngs  usually  are  sponsored  by  local  banks1,   Chambers  of  Commerce,  trade  associa-ons,  or  other  civic  organiza-ons.   •  Now  add  these  to  the  gloss  +  examples  for  each  sense,  call  it  the   “signature”  of  a  sense.  Basically,  it  is  an  expansion  of  the   dic(onary  entry.   •  Choose  sense  with  most  word  overlap  between  context  and   signature  (ie.  the  context  words  provided  by  the  resources).  
  20. 20. Corpus  Lesk:  IDF  weigh$ng   •  Instead  of  just  removing  func(on  words   •  Weigh  each  word  by  its  `promiscuity’  across  documents   •  Down-­‐weights  words  that  occur  in  every  `document’  (gloss,  example,  etc)   •  These  are  generally  func(on  words,  but  is  a  more  fine-­‐grained  measure   •  Weigh  each  overlapping  word  by  inverse  document  frequency   (IDF).   20  
  21. 21. Graph-­‐based  methods   •  First,  WordNet  can  be  viewed  as  a  graph   •  senses  are  nodes   •  rela(ons  (hypernymy,  meronymy)  are  edges   •  Also  add  edge  between  word  and  unambiguous  gloss  words   21   toastn 4 drinkv 1 drinkern 1 drinkingn 1 potationn 1 sipn 1 sipv 1 beveragen 1 milkn 1 liquidn 1foodn 1 drinkn 1 helpingn 1 supv 1 consumptionn 1 consumern 1 consumev 1 An  undirected   graph  is  set  of   nodes  tha  are   connected   together  by   bidirec(onal   edges  (lines).    
  22. 22. How  to  use  the  graph  for  WSD   “She  drank  some  milk”   •  choose  the            most  central  sense     (several  algorithms    have  been  proposed   recently)     22   drinkv 1 drinkern 1 beveragen 1 boozingn 1 foodn 1 drinkn 1 milkn 1 milkn 2 milkn 3 milkn 4 drinkv 2 drinkv 3 drinkv 4 drinkv 5 nutrimentn 1 “drink” “milk”
  23. 23. Word Meaning and Similarity Word  Similarity:   Thesaurus  Methods   beg:  c_w8  
  24. 24. Word  Similarity   •  Synonymy:  a  binary  rela(on   •  Two  words  are  either  synonymous  or  not   •  Similarity  (or  distance):  a  looser  metric   •  Two  words  are  more  similar  if  they  share  more  features  of  meaning   •  Similarity  is  properly  a  rela(on  between  senses   •  We  do  not  say  “The  word  “bank”  is  not  similar  to  the  word  “slope”  “,  bu  w  say.   •  Bank1  is  similar  to  fund3   •  Bank2  is  similar  to  slope5   •  But  we’ll  compute  similarity  over  both  words  and  senses  
  25. 25. Why  word  similarity   •  Informa(on  retrieval   •  Ques(on  answering   •  Machine  transla(on   •  Natural  language  genera(on   •  Language  modeling   •  Automa(c  essay  grading   •  Plagiarism  detec(on   •  Document  clustering  
  26. 26. Word  similarity  and  word  relatedness   •  We  o^en  dis(nguish  word  similarity    from  word   relatedness   •  Similar  words:  near-­‐synonyms   •  car, bicycle:        similar   •  Related  words:  can  be  related  any  way   •  car, gasoline:      related,  not  similar   Cf.  Synonyms:  car  &  automobile  
  27. 27. Two  classes  of  similarity  algorithms   •  Thesaurus-­‐based  algorithms   •  Are  words  “nearby”  in  hypernym  hierarchy?   •  Do  words  have  similar  glosses  (defini(ons)?   •  Distribu(onal  algorithms:  next  (me!   •  Do  words  have  similar  distribu(onal  contexts?  
  28. 28. Path-­‐based  similarity   •  Two  concepts  (senses/synsets)  are  similar  if   they  are  near  each  other  in  the  thesaurus   hierarchy     •  =have  a  short  path  between  them   •  concepts  have  path  1  to  themselves  
  29. 29. Refinements  to  path-­‐based  similarity   •  pathlen(c1,c2) =  (distance  metric)  =  1  +  number  of  edges  in  the   shortest  path  in  the  hypernym  graph  between  sense  nodes  c1   and  c2   •  simpath(c1,c2) = •  wordsim(w1,w2) = max sim(c1,c2) c1∈senses(w1),c2∈senses(w2)   1 pathlen(c1,c2 ) Sense  similarity  metric:  1   over  the  distance!   Word  similarity  metric:     max  similarity  among   pairs  of  senses.   For  all  senses  of  w1  and  all  senses  of  w2,  take  the  similarity  between  each  of  the  senses  of  w1   and  each  of  the  senses  of  w2  and  then  take  the  maximum  similarity  between  those  pairs.  
  30. 30. Example:  path-­‐based  similarity   simpath(c1,c2) = 1/pathlen(c1,c2) simpath(nickel,coin)  =  1/2 = .5 simpath(fund,budget)  =  1/2 = .5 simpath(nickel,currency)  =  1/4 = .25 simpath(nickel,money)  =  1/6 = .17 simpath(coinage,Richter  scale)  =  1/6 = .17
  31. 31. Problem  with  basic  path-­‐based  similarity   •  Assumes  each  link  represents  a  uniform  distance   •  But  nickel  to  money  seems  to  us  to  be  closer  than  nickel  to   standard   •  Nodes  high  in  the  hierarchy  are  very  abstract   •  We  instead  want  a  metric  that   •  Represents  the  cost  of  each  edge  independently   •  Words  connected  only  through  abstract  nodes     •  are  less  similar  
  32. 32. Informa$on  content  similarity  metrics   •  In  simple  words:   •  We  define  the  probability  of  a  concept  C  as  the  probability  that  a   randomly  selected  word  in  a  corpus  is  an  instance  of  that  concept.   •  Basically,  for  each  random  word  in  a  corpus  we  compute  how  probable  it   is  that  it  belongs  to  a  certain  concepts.     Resnik  1995.  Using  informa(on  content  to  evaluate  seman(c   similarity  in  a  taxonomy.  IJCAI  
  33. 33. Formally:  Informa$on  content  similarity  metrics   •  Let’s  define  P(c) as:   •  The  probability  that  a  randomly  selected  word  in  a  corpus  is  an  instance   of  concept  c •  Formally:  there  is  a  dis(nct  random  variable,  ranging  over  words,   associated  with  each  concept  in  the  hierarchy   •  for  a  given  concept,  each  observed  noun  is  either   •   a  member  of  that  concept    with  probability  P(c) •  not  a  member  of  that  concept  with  probability  1-P(c) •  All  words  are  members  of  the  root  node  (En(ty)   •  P(root)=1 •  The  lower  a  node  in  hierarchy,  the  lower  its  probability   Resnik  1995.  Using  informa(on  content  to  evaluate  seman(c   similarity  in  a  taxonomy.  IJCAI  
  34. 34. Informa$on  content  similarity   •  For  every  word  (ex  “natural  eleva(on”),  we   count  all  the  words  in  that  concepts,  and   then  we  normalize  by  the  total  number  of   words  in  the  corpus.   •  we  get  a  probability  value  that  tells  us  how   probable  it  is  that  a  random  word  is  a  an   instance  of  that  concept     P(c) = count(w) w∈words(c) ∑ N geological-­‐forma(on   shore   hill   natural  eleva(on   coast   cave   gromo  ridge   …   en(ty   In  order  o  compute  the   probability  of  the  term   "natural  eleva(on",  we   take  ridge,  hill  +  natural   eleva(on  itself  
  35. 35. Informa$on  content  similarity   •  WordNet  hierarchy  augmented  with  probabili(es  P(c)   D.  Lin.  1998.  An  Informa(on-­‐Theore(c  Defini(on  of  Similarity.  ICML  1998  
  36. 36. Informa$on  content:  defini$ons   1.  Informa(on  content:   1.  IC(c) = -log P(c) 2.  Most  informa(ve  subsumer   (Lowest  common  subsumer)   LCS(c1,c2) = The  most  informa(ve  (lowest)   node  in  the  hierarchy   subsuming  both  c1  and  c2  
  37. 37. IC  aka…   •  A  lot  of  people  prefer  the  term  surprisal  to  informa(on  or  to   informa(on  content.     -­‐log  p(x)   It  measures  the  amount  of  surprise  generated  by  the  event  x.     The  smaller  the  probability  of  x,  the  bigger  the  surprisal  is.     It's  helpful  to  think  about  it  this  way,  par(cularly  for  linguis(cs   examples.     37  
  38. 38. Using  informa$on  content  for  similarity:     the  Resnik  method   •  The  similarity  between  two  words  is  related  to  their   common  informa(on   •  The  more  two  words  have  in  common,  the  more   similar  they  are   •  Resnik:  measure  common  informa(on  as:   •  The  informa(on  content  of  the  most  informa(ve    (lowest)  subsumer  (MIS/LCS)  of  the  two  nodes   •  simresnik(c1,c2) = -log P( LCS(c1,c2) ) Philip  Resnik.  1995.  Using  Informa(on  Content  to  Evaluate  Seman(c  Similarity  in  a  Taxonomy.  IJCAI  1995.   Philip  Resnik.  1999.  Seman(c  Similarity  in  a  Taxonomy:  An  Informa(on-­‐Based  Measure  and  its  Applica(on   to  Problems  of  Ambiguity  in  Natural  Language.  JAIR  11,  95-­‐130.  
  39. 39. Dekang  Lin  method   •  Intui(on:  Similarity  between  A  and  B  is  not  just  what  they  have   in  common   •  The  more  differences  between  A  and  B,  the  less  similar  they  are:   •  Commonality:  the  more  A  and  B  have  in  common,  the  more  similar  they  are   •  Difference:  the  more  differences  between  A  and  B,  the  less  similar   •  Commonality:  IC(common(A,B))   •  Difference:  IC(descrip(on(A,B)-­‐IC(common(A,B))   Dekang  Lin.  1998.  An  Informa(on-­‐Theore(c  Defini(on  of  Similarity.  ICML  
  40. 40. Dekang  Lin  similarity  theorem   •  The  similarity  between  A  and  B  is  measured  by  the  ra(o   between  the  amount  of  informa(on  needed  to  state  the   commonality  of  A  and  B  and  the  informa(on  needed  to  fully   describe  what  A  and  B  are     simLin(A, B)∝ IC(common(A, B)) IC(description(A, B)) •  Lin  (altering  Resnik)  defines  IC(common(A,B))  as  2  x  informa(on  of  the  LCS   simLin(c1,c2 ) = 2logP(LCS(c1,c2 )) logP(c1)+ logP(c2 )
  41. 41. Lin  similarity  func$on   simLin(A, B) = 2logP(LCS(c1,c2 )) logP(c1)+ logP(c2 ) simLin(hill,coast) = 2logP(geological-formation) logP(hill)+ logP(coast) = 2ln0.00176 ln0.0000189 + ln0.0000216 =.59
  42. 42. The  (extended)  Lesk  Algorithm     •  A  thesaurus-­‐based  measure  that  looks  at  glosses   •  Two  concepts  are  similar  if  their  glosses  contain  similar  words   •  Drawing  paper:  paper  that  is  specially  prepared  for  use  in  dra^ing   •  Decal:  the  art  of  transferring  designs  from  specially  prepared  paper  to  a   wood  or  glass  or  metal  surface   •  For  each  n-­‐word  phrase  that’s  in  both  glosses   •  Add  a  score  of  n2     •  Paper  and  specially  prepared  for  1  +  22  =  5   •  Compute  overlap  also  for  other  rela(ons   •  glosses  of  hypernyms  and  hyponyms  
  43. 43. Summary:  thesaurus-­‐based  similarity  
  44. 44. Libraries  for  compu$ng  thesaurus-­‐based   similarity   •  NLTK   •  hmp://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity  -­‐   nltk.corpus.reader.WordNetCorpusReader.res_similarity   •  WordNet::Similarity   •  hmp://wn-­‐similarity.sourceforge.net/   •  Web-­‐based  interface:   •  hmp://marimba.d.umn.edu/cgi-­‐bin/similarity/similarity.cgi   44  
  45. 45. Machine Learning based approach
  46. 46. Basic  idea   •  If  we  have  data  that  has  been  hand-­‐labelled  with  correct  word   senses,  we  can  used  a  supervised  learning  approach  and  learn   from  it!   •  We  need  to  extract  features  and  train  a  classifier   •  The  output  of  training  is  an  automa(c  system  capable  of  assigning  sense   labels  TO  unlabelled  words  in  a  context.     46  
  47. 47. Two  variants  of  WSD  task   •  Lexical  Sample  task     •  (we  need  labelled  corpora  for  individual  senses)   •  Small  pre-­‐selected  set  of  target  words  (ex  difficulty)   •  And  inventory  of  senses  for  each  word   •  Supervised  machine  learning:  train  a  classifier  for  each  word   •  All-­‐words  task     •  (each  word  in  each  sentence  is  labelled  with  a  sense)   •  Every  word  in  an  en(re  text   •  A  lexicon  with  senses  for  each  word   SENSEVAL  1-­‐2-­‐3  
  48. 48. Supervised  Machine  Learning  Approaches   •  Summary  of  what  we  need:   •  the  tag  set  (“sense  inventory”)   •  the  training  corpus   •  A  set  of  features  extracted  from  the  training  corpus   •  A  classifier  
  49. 49. Supervised  WSD  1:  WSD  Tags   •  What’s  a  tag?   A  dic(onary  sense?   •  For  example,  for  WordNet  an  instance  of  “bass”  in  a  text  has  8   possible  tags  or  labels  (bass1  through  bass8).  
  50. 50. 8  senses  of  “bass”  in  WordNet   1.  bass  -­‐  (the  lowest  part  of  the  musical  range)   2.  bass,  bass  part  -­‐  (the  lowest  part  in  polyphonic    music)   3.  bass,  basso  -­‐  (an  adult  male  singer  with  the  lowest  voice)   4.  sea  bass,  bass  -­‐  (flesh  of  lean-­‐fleshed  saltwater  fish  of  the  family   Serranidae)   5.  freshwater  bass,  bass  -­‐  (any  of  various  North  American  lean-­‐fleshed   freshwater  fishes  especially  of  the  genus  Micropterus)   6.  bass,  bass  voice,  basso  -­‐  (the  lowest  adult  male  singing  voice)   7.  bass  -­‐  (the  member  with  the  lowest  range  of  a  family  of  musical   instruments)   8.  bass  -­‐  (nontechnical  name  for  any  of  numerous  edible    marine  and   freshwater  spiny-­‐finned  fishes)  
  51. 51. SemCor   <wf  pos=PRP>He</wf>   <wf  pos=VB  lemma=recognize  wnsn=4  lexsn=2:31:00::>recognized</wf>   <wf  pos=DT>the</wf>   <wf  pos=NN  lemma=gesture  wnsn=1  lexsn=1:04:00::>gesture</wf>   <punc>.</punc>   51   SemCor: 234,000 words from Brown Corpus, manually tagged with WordNet senses
  52. 52. Supervised  WSD:  Extract  feature  vectors   Intui$on  from  Warren  Weaver  (1955):   “If  one  examines  the  words  in  a  book,  one  at  a  (me  as  through   an  opaque  mask  with  a  hole  in  it  one  word  wide,  then  it  is   obviously  impossible  to  determine,  one  at  a  (me,  the  meaning   of  the  words…     But  if  one  lengthens  the  slit  in  the  opaque  mask,  un(l  one  can   see  not  only  the  central  word  in  ques(on  but  also  say  N  words   on  either  side,  then  if  N  is  large  enough  one  can  unambiguously   decide  the  meaning  of  the  central  word…     The  prac(cal  ques(on  is  :  ``What  minimum  value  of  N  will,  at   least  in  a  tolerable  frac(on  of  cases,  lead  to  the  correct  choice   of  meaning  for  the  central  word?”   the    window  
  53. 53. Feature  vectors   •  Vectors  of  sets  of  feature/value  pairs  
  54. 54. Two  kinds  of  features  in  the  vectors   •  Colloca$onal  features  and  bag-­‐of-­‐words  features   •  Colloca$onal/Paradigma$c   •  Features  about  words  at  specific  posi(ons  near  target  word   •  O^en  limited  to  just  word  iden(ty  and  POS   •  Bag-­‐of-­‐words   •  Features  about  words  that  occur  anywhere  in  the  window  (regardless   of  posi(on)   •  Typically  limited  to  frequency  counts   Generally speaking, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. But here the meaning is not exactly this…  
  55. 55. Examples   •  Example  text  (WSJ):   An  electric  guitar  and  bass  player  stand  off  to   one  side  not  really  part  of  the  scene   •  Assume  a  window  of  +/-­‐  2  from  the  target  
  56. 56. Examples   •  Example  text  (WSJ)   An  electric  guitar  and  bass  player  stand  off  to   one  side  not  really  part  of  the  scene,     •  Assume  a  window  of  +/-­‐  2  from  the  target  
  57. 57. Colloca$onal  features   •  Posi(on-­‐specific  informa(on  about  the  words  and   colloca(ons  in  window   •  guitar  and  bass  player  stand   •  word  1,2,3  grams  in  window  of  ±3  is  common   encoding local lexical and grammatical information that can often accurately isola a given sense. For example consider the ambiguous word bass in the following WSJ sentenc (16.17) An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. A collocational feature vector, extracted from a window of two words to the rig and left of the target word, made up of the words themselves, their respective part of-speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] (16.1 would yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] High performing systems generally use POS tags and word collocations of leng 1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N For example consider the ambiguous word bass in the following WSJ sent 6.17) An electric guitar and bass player stand off to one side, not really par the scene, just as a sort of nod to gringo expectations perhaps. collocational feature vector, extracted from a window of two words to the d left of the target word, made up of the words themselves, their respective -speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] ( ould yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] gh performing systems generally use POS tags and word collocations of l 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
  58. 58. Bag-­‐of-­‐words  features   •  “an  unordered  set  of  words”  –  posi(on  ignored   •  Choose  a  vocabulary:  a  useful  subset  of  words  in  a   training  corpus   •  Either:  the  count  of  how  o^en  each  of  those  terms   occurs  in  a  given  window  OR  just  a  binary  “indicator”  1   or  0    
  59. 59. Co-­‐Occurrence  Example   •  Assume  we’ve  semled  on  a  possible  vocabulary  of  12  words  in   “bass”  sentences:       [fishing,  big,  sound,  player,  fly,  rod,  pound,  double,  runs,  playing,  guitar,  band]     •  The  vector  for:    guitar  and  bass  player  stand    [0,0,0,1,0,0,0,0,0,0,1,0]      
  60. 60. Word Sense Disambiguation Classifica(on  
  61. 61. Classifica$on   •  Input:   •   a  word  w  and  some  features  f   •   a  fixed  set  of  classes    C  =  {c1,  c2,…,  cJ}   •  Output:  a  predicted  class  c∈C   Any  kind  of  classifier   •  Naive  Bayes   •  Logis(c  regression   •  Neural  Networks   •  Support-­‐vector  machines   •  k-­‐Nearest  Neighbors   •  etc.    
  62. 62. The  end     62  

×