Successfully reported this slideshow.
Your SlideShare is downloading. ×

Lecture: Vector Semantics (aka Distributional Semantics)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Semantic Role Labeling
Semantic Role Labeling
Loading in …3
×

Check these out next

1 of 56 Ad

Lecture: Vector Semantics (aka Distributional Semantics)

Download to read offline

term-context matrix, distributional models, Zellig Harris, John Rupert Firth, PMI, Pointwise Mutual Information, PPMI, Positive Pointwise Mutual Information, joint probability, marginals, smoothing, cosine metric, cosine similarity measure, dot product, vectors.

term-context matrix, distributional models, Zellig Harris, John Rupert Firth, PMI, Pointwise Mutual Information, PPMI, Positive Pointwise Mutual Information, joint probability, marginals, smoothing, cosine metric, cosine similarity measure, dot product, vectors.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to Lecture: Vector Semantics (aka Distributional Semantics) (20)

More from Marina Santini (19)

Advertisement

Lecture: Vector Semantics (aka Distributional Semantics)

  1. 1. Seman&c  Analysis  in  Language  Technology   http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Vector Semantics
 (aka Distributional Semantics)
 Marina  San(ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016       1  
  2. 2. Previous  Lecture:  Word  Sense  Disambigua$on   2  
  3. 3. Similarity  measures  (dic$onary-­‐based)  
  4. 4. Colloca$onal  features:  supervised   •  Posi(on-­‐specific  informa(on  about  the  words  and   colloca(ons  in  window   •  guitar  and  bass  player  stand   •  word  1,2,3  grams  in  window  of  ±3  is  common   encoding local lexical and grammatical information that can often accurately isola a given sense. For example consider the ambiguous word bass in the following WSJ sentenc (16.17) An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. A collocational feature vector, extracted from a window of two words to the rig and left of the target word, made up of the words themselves, their respective part of-speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] (16.1 would yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] High performing systems generally use POS tags and word collocations of leng 1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N For example consider the ambiguous word bass in the following WSJ sent 6.17) An electric guitar and bass player stand off to one side, not really par the scene, just as a sort of nod to gringo expectations perhaps. collocational feature vector, extracted from a window of two words to the d left of the target word, made up of the words themselves, their respective -speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] ( ould yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] gh performing systems generally use POS tags and word collocations of l 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
  5. 5. Bag-­‐of-­‐words  features:  supervised   •  Assume  we’ve  seGled  on  a  possible  vocabulary  of  12  words  in   “bass”  sentences:       [fishing,  big,  sound,  player,  fly,  rod,  pound,  double,  runs,  playing,  guitar,  band]     •  The  vector  for:    guitar  and  bass  player  stand    [0,0,0,1,0,0,0,0,0,0,1,0]      
  6. 6. Prac$cal  ac$vity:  Lesk  algorithms   •  Michael  Lesk  (1986):  Original  Lesk   •  Compare  the  target  word’s  signature  with  the  signature  of  each  of  the   context  words   •  Kilgarriff  and  Rosenzweig  (2000):  Simplified  Lesk   •  Compare  the  target  word’s  signature  with  the  context  words   •  Vasilescu  et  al.  (2004):  Corpus  Lesk     •  Add  all  the  words  in  a  labelled  corpus  sentence  for  a  word  sense  into  the   signature  of  that  sense  (remember  the  labelled  sentences  in  Senseval  2).     signature  <-­‐  set  of  words  in  the  gloss  and  examples  of  sense    6      
  7. 7. Simplified  Lesk:  Time  flies  like  an  arrow   •  Common  sense:   •  Modern  English  speakers  unambiguously  understand  the  sentence  to   mean  "As  a  generalisa(on,  (me  passes  in  the  same  way  that  an  arrow   generally  flies  (i.e.  quickly)"  (as  in  the  common  metaphor  5me  goes  by   quickly).   7  
  8. 8. Ref:  wikipedia   •  But  formally/logically/syntactally/seman(cally  à  ambiguous:   1.  (as  an  impera(ve)  Measure  the  speed  of  flies  like  you  would  measure  that   of  an  arrow  -­‐  i.e.  (You  should)  (me  flies  as  you  would  (me  an  arrow.   2.  (impera(ve)  Measure  the  speed  of  flies  like  an  arrow  would  -­‐  i.e.  (You   should)  (me  flies  in  the  same  manner  that  an  arrow  would  (me  them.   3.  (impera(ve)  Measure  the  speed  of  flies  that  are  like  arrows  -­‐  i.e.  (You   should)  (me  those  flies  that  are  like  an  arrow.   4.  (declara(ve)  Time  moves  in  a  way  an  arrow  would.   5.  (declara(ve,  i.e.  neutrally  sta(ng  a  proposi(on)  Certain  flying  insects,   "(me  flies,"  enjoy  an  arrow.   8  
  9. 9. Simplified  Lesk  algorithm  (2000)  and  WordNet  (3.1)   •   Disambigua(ng  $me  :   •  (me#n#5  shares  ”pass”    and  ”$me  flies  as  an  arrow”  with  flies#v#8   •  Disambigua(ng  flies   •  flies#v#8  shares  ”pass”    and  ”$me  flies  as  an  arrow”  with  (me#v#5   So  we  select  the  following  senses:  (me#n#5    and  flies#v#8.   9  
  10. 10. like  &  arrow    Disambigua(ng  like  :   •    like#a#1  shares  like    with  flies#v#8         Arrow  cannot  be  disambiguated     10  
  11. 11. 11   Similar  a#3   like  a#1   fly  v#8   Time  n#5  
  12. 12. Corpus  Lesk  Algorithm   •  Expands  the  approach  by:   •  Adding  all  the  words  of  any  sense-­‐tagged  corpus  data  (like  SemCor)  for  a   word  sense  into  the  signature  for  that  sense.   •  Signature=  gloss+examples  of  a  word  sense   12  
  13. 13. MacMillan  dic$onary       13   Arrow???   Time  n#1   Fly  v#6   Like  a#1  
  14. 14. Arrow  ???   14  
  15. 15. Implementa$on?   •  What  if,  the  next  ac(vity  was:   •  Build  an  implementa$on  of    your  solu$on  of  the  simplified  Lesk  ?   •  Watch  out  :  licences  (commercial,  academic,  crea(ve  commons,   etc.)     15  
  16. 16. Problems  with  thesaurus-­‐based  meaning   •  We  don’t  have  a  thesaurus  for  every  language   •  Even  if  we  do,  they  have  problems  with  recall   •  Many  words  are  missing   •  Most  (if  not  all)  phrases  are  missing   •  Some  connec(ons  between  senses  are  missing   •  Thesauri  work  less  well  for  verbs,  adjec(ves    
  17. 17. End  of  previous  lecture   17  
  18. 18. Vector/Distribu$onal  Seman$cs   •  The  meaning  of  a  word  is  computed  from  the  distribu(on  of   words  around  it.     •  These  words  are  represented  as  a  vector  of  numbers.   •  Very  popular  and  very  intruiging!   18  
  19. 19. hZp://esslli2016.unibz.it/?page_id=256     19  
  20. 20. (Oversimplified)  Preliminaries     (cf  also  Lect  03:  SA,  Turney  Algorithm)   •  Probability   •  Joint  probability   •  Marginals   •  PMI   •  PPMI   •  Smoothing   •  Dot  product  (aka  inner  product)   •  Window   20  
  21. 21. Probability   •  Probability  is  the  measure  of  how  likely  an  event  is.       21   Ex:     John  has  a  box  with  a  book,  a  map  and  a  ruler  in  it  (Cantos  Gomez,  2013)   This  sentence  has  14  words  and  5  nouns.       The  probability  of  picking  up  a  noun  is:     P(noun)=  5/14  =  0.357  
  22. 22. Joints  and  Marginals  (oversimplifying)   •  Joint:  The  probability  of  word  A  occurring  together  with  word  B   à  the  frequency  with  which  the  two  words  appear  together   •  P(A,B)   •  Marginals:  the  probability  of  a  word  A  &  the  probability  of  the   other  word  B   •  P(A)          P(B)   22  
  23. 23. Can  also  be  said  in  other  ways:     Dependent  and  independent  events:  Joints  &  Marginals   •  Two  events  are  dependent  if  the  outcome  or  occurrence  of  the   first  affects  the  outcome  or  occurrence  of  the  second  so  that  the   probability  is  changed.     •  Consider  two  dependent  events,  A  and  B.    The  joint  probability  that  A  and  B   occur  together  is  :   •  P(A  and  B)  =  P(A)*P(B  given  A)  OR  P(A  and  B)  =  P(B)*P(A  given  B)     •  Two  events  are  independent,  each  probability  is  mul(plied   together  to  find  the  overall  probability  for  the  set  of  events.     •  P(A  and  B)  =  P(A)*P(B)   Marginal  probability  is  the  probability     of  the  occurrence  of  a  single  event  in  joint  probability.  23   Equivalent  Nota(ons   (joint)     •  P(A,B)  or  P(A  ∩B)  
  24. 24. Associa$on  measure   •  Pointwise  mutual  informa$on:     •  How  much  more  do  events  x  and  y  co-­‐occur  than  if  they  were  independent?   Read:  the  joint  probability  of  two  dependent  events  (ie,  the  2  words  that  are  supposed  to  be   associated)  divided  by  the  product  of  the  individual  probabili(es  (ie,  we  assume  that  the  words   are  not  associated,  we  assume  they  are  independent),  and  we  take  the  log  of  it.     It  tells  us  how  much  more  the  two  events  co-­‐occur  than  if  they  were  independent   PMI(X,Y) = log2 P(x,y) P(x)P(y)
  25. 25. POSITIVE  PMI   •  We  replace  all  the  nega(ve  values  with  0.   25  
  26. 26. Smoothing  (addi$ve,  Laplace,  etc.)   •  In  very  simple  words:  we  add  an  arbitrary  value  to  the  counts.   •  In  a  bag  of  words  model  of  natural  language  processing  and   informa(on  retrieval,  addi(ve  smoothing  allows  the  assignment   of  non-­‐zero  probabili(es  to  words  which  do  not  occur  in  the   sample  à  data  sparsenessà  mul(plica(on  by  0  probability:  all   the  counts  are  0.   •  (Addi(ve  smoothing  is  commonly  a  component  of  naive  Bayes   classifiers.  26  
  27. 27. Dot  product  (aka  inner  product)   •  Given  the  two  vectors:   •  The  dot  product  is  :     •  The  Dot  Product  is  wriGen  using  a  central  dot   27  
  28. 28. Window  (around  the  ambiguous  word)   •  The  number  of  words  that  we  take  into  account  before  and  axer   the  word  we  want  to  disambiguate:   •  We  can  decide  any  arbirtrary  value,  eg:     •  -­‐3  ???  +3  :     •  Ex:  The  president  said  central  banks  should  maintain  flows  of   cheap  credit  to  households     28  
  29. 29. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  James  H.  Mar(n   Dan  Jurafsky  and  Christopher  Manning,  Coursera       J&M(2015,  drax):  hGps://web.stanford.edu/~jurafsky/slp3/              
  30. 30. Distributional Semantics Term-­‐context  matrix  
  31. 31. Distribu$onal  models  of  meaning   •  Also  called  vector-­‐space  models  of  meaning   •  Offer  much  higher  recall  than  hand-­‐built  thesauri   •  Although  they  tend  to  have  lower  precision   •  Zellig  Harris  (1954):  “oculist  and  eye-­‐doctor  …   occur  in  almost  the  same  environments….                                   If  A  and  B  have  almost  iden$cal  environments   we  say  that  they  are  synonyms.     •  Firth  (1957):  “You  shall  know  a  word  by  the   company  it  keeps!”  31   •  Also  called  vector-­‐space  models  of  meaning   •  Offer  much  higher  recall  than  hand-­‐built  thesauri   •  Although  they  tend  to  have  lower  precision  
  32. 32. Intui$on  of  distribu$onal  word  similarity   •  Examples:   A bottle of tesgüino is on the table! Everybody likes tesgüino! Tesgüino makes you drunk! We make tesgüino out of corn.! •  From context words humans can guess tesgüino means •  an  alcoholic  beverage  like  beer   •  Intui(on  for  algorithm:     •  Two  words  are  similar  if  they  have  similar  word  contexts.  
  33. 33. As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 IR:  Term-­‐document  matrix   •  Each  cell:  count  of  term  t  in  a  document  d:    |t,d:     •  Each  document  is  a  count  vector  in  ℕv:  a  column  below     33  
  34. 34. Document  similarity:  Term-­‐document  matrix   •  Two  documents  are  similar  if  their  vectors  are  similar   34   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  35. 35. The  words  in  a  term-­‐document  matrix   •  Each  word  is  a  count  vector  in  ℕD:  a  row  below     35   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  36. 36. The  words  in  a  term-­‐document  matrix   •  Two  words  are  similar  if  their  vectors  are  similar   36   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  37. 37. The  intui$on  of  distribu$onal  word  similarity…   •  Instead  of  using  en(re  documents,  use  smaller  contexts   •  Paragraph   •  Window  of  10  words   •  A  word  is  now  defined  by  a  vector  over  counts  of   context  words   37  
  38. 38. Sample  contexts:  20  words  (Brown  corpus)       •  equal  amount  of  sugar,  a  sliced  lemon,  a  tablespoonful  of  apricot   preserve  or  jam,  a  pinch  each  of  clove  and  nutmeg,   •  on  board  for  their  enjoyment.  Cau(ously  she  sampled  her  first   pineapple  and  another  fruit  whose  taste  she  likened  to  that  of   38   •  of  a  recursive  type  well  suited  to  programming  on   the  digital  computer.  In  finding  the  op(mal  R-­‐stage   policy  from  that  of   •  substan(ally  affect  commerce,  for  the  purpose  of   gathering  data  and  informa$on  necessary  for  the   study  authorized  in  the  first  sec(on  of  this  
  39. 39. Term-­‐context  matrix  for  word  similarity   •  Two  words  are  similar  in  meaning  if  their  context   vectors  are  similar   39   aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0
  40. 40. Should  we  use  raw  counts?   •  For  the  term-­‐document  matrix   •  We  used  |-­‐idf  instead  of  raw  term  counts   •  For  the  term-­‐context  matrix   •  Posi(ve  Pointwise  Mutual  Informa(on  (PPMI)  is  common   40  
  41. 41. Pointwise  Mutual  Informa$on   •  Pointwise  mutual  informa$on:     •  Do  events  x  and  y  co-­‐occur  more  than  if  they  were  independent?   •  PMI  between  two  words:    (Church  &  Hanks  1989)   •   Do  words  x  and  y  co-­‐occur  more  than  if  they  were  independent?     •  Posi$ve  PMI  between  two  words  (Niwa  &  NiGa  1994)   •   Replace  all  PMI  values  less  than  0  with  zero   PMI(X,Y) = log2 P(x,y) P(x)P(y) PMI(word1,word2 ) = log2 P(word1,word2) P(word1)P(word2)
  42. 42. Compu$ng  PPMI  on  a  term-­‐context  matrix   •  Matrix  F  with  W  rows  (words)  and  C  columns  (contexts)   •  fij  is  #  of  $mes  wi  occurs  in  context  cj 42   pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ pmiij = log2 pij pi* p* j ppmiij = pmiij if pmiij > 0 0 otherwise ! " # $# The  count  of  all   the  words  that   occur  in  that   context   The  count  of  all  the   contexts  where  the   word  appear   The  sum  of  all  words  in   all  contexts  =  all  the   numbers  in  the  matrix  
  43. 43. p(w=informa(on,c=data)  =     p(w=informa(on)  =   p(c=data)  =   43   =  .32  6/19   11/19   =  .58   7/19   =  .37   pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N The  count  of  all  the  words   that  occur  in  that  context   The  count   of  all  the   contexts   where  the   word   appear   N=The sum of all words in all contexts = all the numbers in the matrix  
  44. 44. 44   pmiij = log2 pij pi* p* j •  pmi(informa(on,data)  =  log2  (   PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1 .32  /   (.37*.58)  )    =  .58  
  45. 45. Weighing  PMI   •  PMI  is  biased  toward  infrequent  events   •  Various  weigh(ng  schemes  help  alleviate  this   •  See  Turney  and  Pantel  (2010)   •  Add-­‐one  smoothing  can  also  help   45  
  46. 46. 46   Add#2%Smoothed%Count(w,context) computer data pinch result sugar apricot 2 2 3 2 3 pineapple 2 2 3 2 3 digital 4 3 2 3 2 information 3 8 2 6 2 p(w,context),[add02] p(w) computer data pinch result sugar apricot 0.03 0.03 0.05 0.03 0.05 0.20 pineapple 0.03 0.03 0.05 0.03 0.05 0.20 digital 0.07 0.05 0.03 0.05 0.03 0.24 information 0.05 0.14 0.03 0.10 0.03 0.36 p(context) 0.19 0.25 0.17 0.22 0.17
  47. 47. Original  vs  add-­‐2  smoothing   47   PPMI(w,context).[add22] computer data pinch result sugar apricot 0.00 0.00 0.56 0.00 0.56 pineapple 0.00 0.00 0.56 0.00 0.56 digital 0.62 0.00 0.00 0.00 0.00 information 0.00 0.58 0.00 0.37 0.00 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1
  48. 48. Distributional Semantics Dependency  rela(ons  
  49. 49. Using  syntax  to  define  a  word’s  context   •  Zellig  Harris  (1968)   •  “The  meaning  of  en((es,  and  the  meaning  of  gramma(cal  rela(ons  among  them,  is   related  to  the  restric(on  of  combina(ons  of  these  en((es  rela(ve  to  other  en((es”   •  Two  words  are  similar  if  they  have  similar  parse  contexts   •  Duty  and  responsibility  (Chris  Callison-­‐Burch’s  example)   Modified  by   adjec$ves   addi(onal,  administra(ve,  assumed,   collec(ve,  congressional,  cons(tu(onal  …   Objects  of  verbs   assert,  assign,  assume,  aGend  to,  avoid,   become,  breach  …  
  50. 50. Co-­‐occurrence  vectors  based  on  syntac$c  dependencies   •  The  contexts  C  are  different  dependency  rela(ons   •  Subject-­‐of-­‐  “absorb”   •  Preposi(onal-­‐object  of  “inside”   •  Counts  for  the  word  cell:   Dekang  Lin,  1998  “Automa(c  Retrieval  and  Clustering  of  Similar  Words”  
  51. 51. PMI  applied  to  dependency  rela$ons   •  “Drink it” more  common  than  “drink wine”! •  But  “wine”  is  a  beGer  “drinkable”  thing  than  “it”   Object  of  “drink”   Count   PMI   it   3   1.3   anything   3   5.2   wine   2   9.3   tea   2   11.8   liquid   2   10.5   Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL Object  of  “drink”   Count   PMI   tea   2   11.8   liquid   2   10.5   wine   2   9.3   anything   3   5.2   it   3   1.3  
  52. 52. Cosine  for  compu$ng  similarity   cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ Dot product Unit vectors vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i. Cos(v,w) is the cosine similarity of v and w Sec. 6.3
  53. 53. Cosine  as  a  similarity  metric   •  -­‐1:  vectors  point  in  opposite  direc(ons     •  +1:    vectors  point  in  same  direc(ons   •  0:  vectors  are  orthogonal   •  Raw  frequency  or  PPMI  are  non-­‐ nega(ve,  so    cosine  range  0-­‐1   53  
  54. 54. large   data   computer   apricot   1   0   0   digital   0   1   2   informa(on   1   6   1   54   Which  pair  of  words  is  more  similar?   cosine(apricot,informa(on)  =       cosine(digital,informa(on)  =     cosine(apricot,digital)  =     cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ 1+ 0 + 0 1+ 0 + 0 1+36 +1 1+36 +1 0 +1+ 4 0 +1+ 4 1+ 0 + 0 0 + 6 + 2 0 + 0 + 0 = 1 38 =.16 = 8 38 5 =.58 = 0
  55. 55. Other  possible  similarity  measures  
  56. 56. The end

×