Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Similarity

7,497 views

Published on

Published in: Technology, Business
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
7,497
On SlideShare
0
From Embeds
0
Number of Embeds
5,568
Actions
Shares
0
Downloads
21
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Similarity

  1. 1. Align,  Disambiguate  and  Walk    :     A  Unified  Approach  for   Measuring  Seman7c  Similarity Mohammad  Taher  Pilehvar,  David  Jurgens  and   Roberto  Navigli   ACL  2013   最先端NLP勉強会  #5@chiba    2013/08/31   紹介者  :  Koji  Matsuda   13/09/03 snlp#5  matsuda 1 2013/09/03  改訂
  2. 2. Sentence  Textual  Similarity  (STS) 13/09/03 snlp#5  matsuda 2 Measure  the  degree  of  seman7c  equivalence  between  two  sentences NOTE:  Differ  from  Textual  Entailment(TE)  and  Paraphrase(PARA)   •    TE            :    STS  assumes  symmetric  and  graded  equivalence  of  the  pair   •  PARA  :    STS  need  incorporates  graded  seman7c  similarity [Agirre+,  SemEval-­‐2012] → STS  is  more  directly  applicable  number  of  NLP  tasks   MT,  Summariza7on,  Deep  QA,  etc.
  3. 3. Example •  Surface  Based  Approach  :   •  labeled  DISSIMILAR  due  to  minimal  lexical  overlap   •  Sense  Representa7on  Based  Approach:   •  enables  consider  similarity  between  meanings  of  the  word   •  (e.g.    fire  and  terminate)   •  but,  difficult  to  incorporate  those  informa7on   •  due  to  Polysemy,  Representa7on  of  individual  sense 13/09/03 snlp#5  matsuda 3
  4. 4. Seman7c  Similarity  at  mul7ple  Levels Sense Sense Word Word Text Text 13/09/03 snlp#5  matsuda 4
  5. 5. Seman7c  Similarity  at  mul7ple  Levels Sense Sense Word Word Text Text Seman7c   Signature Seman7c   Signature 1.  How  to  create  Seman7c  Signature?   2.  How  to  calculate  Similarity  of  Seman7c  Signatures? 13/09/03 snlp#5  matsuda 5 Unified  Seman7c  Representa7on  of   Lexical-­‐item   (arbitrarily-­‐sized  piece  of  text,  or  sense)
  6. 6. Overview  of  Proposed  Method 13/09/03 snlp#5  matsuda 6 Random  Walk  over     the  WordNet  Graph Compare  Sense  Level   Seman>c  Signatures   -­‐  Cosine   -­‐  Weighted  Overlap   -­‐  Top-­‐k  Jaccard Note:  figure  from  slide  by  authors
  7. 7. Seman7c  Signatures •  mul7-­‐seeded  random  walk  over  WordNet  Graph Random  walk  over   WordNet  Graph Seman7c  Signature   (mul7nomial  distribu7on   over  senses(WordNet  Synset)) Sense Word Text Set  of   Senses seeds   (v(0)) 13/09/03 snlp#5  matsuda 7
  8. 8. Personalized  PageRank 13/09/03 snlp#5  matsuda 8 Yellow  Node      :  Seed  Node(Synset)   Red  Node  Size:  Probability  of  Synset   Egde                                  :  WordNet  Rela7on   Note:  figure  from  slide  by  authors
  9. 9. Alignment-­‐Based  Disambigua7on •  How  to  extract  “Set  of  Senses”  (seeds)  from   Text/Word?   – Need  solve  WSD   •  They  proposed  Alignment-­‐Based  WSD   – Maximize  sum  of  similarity  between  two  text/ word   – Can  use  arbitrary  similarity  measure  over  senses   13/09/03 snlp#5  matsuda 9
  10. 10. Alignment-­‐Based  Disambigua7on manager fire worker employee terminate work   boss R(man,emp) 13/09/03 snlp#5  matsuda 10 Word  Level  Alignment
  11. 11. Alignment-­‐Based  Disambigua7on manager fire worker employee terminate work   boss R(man,emp) R(man,bos) R(man,ter) R(man,wor) 13/09/03 snlp#5  matsuda 11 Word  Level  Alignment ←  Maximum  Relatedness  on  Word  Level
  12. 12. Alignment-­‐Based  Disambigua7on manager fire worker employee terminate work   boss R(man,bos) 13/09/03 snlp#5  matsuda 12 manager#1 manager#2 boss   #1 boss   #2 R(m#1,b#1) R(m#1,b#2) R(m#2,b#1) R(m#2,b#2) Word  Level  Alignment Sense  Level  Alignment
  13. 13. Alignment-­‐Based  Disambigua7on manager fire worker employee terminate work   boss R(man,bos) 13/09/03 snlp#5  matsuda 13 manager#1 manager#2 boss   #1 boss   #2 R(m#1,b#1) R(m#1,b#2) R(m#2,b#1) R(m#2,b#2) Word  Level  Alignment Sense  Level  Alignment ↑   Maximum  Relatedness  on  Sense
  14. 14. Alignment-­‐Based  Disambigua7on manager fire worker employee terminate work   boss R(man,bos) R(fir,ter) R(fir,wor) R(wor,emp) 13/09/03 snlp#5  matsuda 14 manager#1 manager#2 boss   #1 boss   #2 R(m#1,b#2) Word  Level  Alignment Sense  Level  Alignment
  15. 15. Alignment-­‐Based  Disambigua7on manager fire worker employee terminate work   boss R(man,bos) R(fir,ter) R(fir,wor) R(wor,emp) 13/09/03 snlp#5  matsuda 15 manager#1 manager#2 boss   #1 boss   #2 R(m#1,b#2) Word  Level  Alignment Sense  Level  Alignment Result  :
  16. 16. Seman7c  Signature  Similarity •  How  to  calculate  similarity  of  Seman7c   Signatures?   – Parametric   •  Cosine   – Non  Parametric(Rank-­‐Based)   •  Weighted  Overlap   •  Top-­‐k  Jaccard   13/09/03 snlp#5  matsuda 16 Sense          a          b        c        d        e Sense          a          b        c        d        e Compare
  17. 17. Seman7c  Signature  Similarity •  Weighted  Overlap  (ADWWO) 13/09/03 snlp#5  matsuda 17 Sense          a          b        c        d        e Rank(r1)            2      4        1        0      3                    (r2)            4      1        2        5      0 •  Top-­‐k  Jaccard  (ADWJac) Sense          a          b        c        d        e Rank(r1)            2      4        1        5      3                    (r2)            4      1        2        5      3 |{a,c,e}∩  {b,c,e}|     |{a,c,e}∪{b,c,e}| Rjac  =   Rwo  = 1     (2+4)+(4+1)+(1+2) Max  when  same  sense  has  same  rank Max  when  top-­‐k  sets  has  same  senses  
  18. 18. Overview  of  Proposed  Method 13/09/03 snlp#5  matsuda 18 Random  Walk  over     the  WordNet  Graph Compare  Sense  Level   Seman>c  Signatures   -­‐  Cosine   -­‐  Weighted  Overlap   -­‐  Top-­‐k  Jaccard Note:  figure  from  slide  by  authors
  19. 19. Experiments •  Textual  Similarity   – SemEval-­‐2012  STS  task  [Agirre+,  SemEval2012]   •  Word  Similarity   – TOEFL  Dataset     – RG-­‐65  Dataset   •  Sense  Similarity   – Sense  Coarsening  (OntoNotes,  Senseval-­‐2)     13/09/03 snlp#5  matsuda 19
  20. 20. Textual  Similarity •  SemEval  2012  STS  task  (task  17)     •  Model   –  Regression  (Gaussian  Process)   –  Features   •  Main  :  ADWcos,  ADWWO,  ADWJac(k=250,500,1000,2500)   •  String-­‐Based  :  Longest  Common  Subsequence(Substring),  Greedy  String  Tiling,  character/ word  n-­‐gram  similarity   id Sentence Score(0-­‐5) 1 The  bird  is  bathing  in  the  sink. 0 Birdie  is  washing  itself  in  the  water  basin. 2 In  May  2010,  the  troops  axempted  to  invade  Kabul. 1 The  US  army  invaded  Kabul  on  May  7th  last  year,  2010. 3 John  said  he  is  considered  a  witness  but  not  a  suspect. 2 "He  is  not  a  suspect  anymore."  John  said. 4 They  flew  out  of  the  nest  in  groups. 3 They  flew  into  the  nest  together. 400    750  pairs  *  5  Set 13/09/03 snlp#5  matsuda 20
  21. 21. Textual  Similarity  Performance Table  2  :  Pearson  correla7on  coefficient 13/09/03 snlp#5  matsuda 21
  22. 22. Textual  Similarity  (detail) 13/09/03 snlp#5  matsuda 22 Mpar  :    MSR  Paraphrase  Corpus  (web  news)    contain  many  named-­‐en7ty   Mvid  :      MSR  Video  Paraphrase  Corpus   SMTe  :    French  to  English  SMT  result  and  Reference  Transla7on  pair                                  from  Europerl  Corpus  [ACL  2007,  2008  SMT  Workshop]   SMTn  :    Same  as  SMTe,  but  News  conversa7on  Corpus  is  used   OnWN  :  Glosses  from  OntoNotes  and  WordNet
  23. 23. Textual  Similarity  (detail) 13/09/03 snlp#5  matsuda 23 DW                      :  Without  performing  any  Alignment   ADW-­‐MF  :  Main  feature  only  (  don’t  make  use  of  string  based  feature) •  Alignment  is  helpful   •  In  Mper  dataset  (  contain  many  Named  En7ty  ),              string-­‐based  method  is  strong  baseline       improve
  24. 24. Word  Similarity •  TOEFL  dataset  [Landauer  and  Dumais,  1997]   – Synonym  selec7on  task   – 80  mul7ple-­‐choice  ques7ons   •  4  choice  per  ques7on   •  RG-­‐65  dataset  [Rubenstein  amd  Goodenough,1965]     – Similarity  grading  for  word  pair   – 65  word-­‐pair     •  Judged  by  51  human  subject   – Scale  0  -­‐  4   13/09/03 snlp#5  matsuda 24 Note:  figure  from  slide  by  authors
  25. 25. Word  Similarity  (TOEFL) 13/09/03 snlp#5  matsuda 25
  26. 26. Word  Similarity  (RG-­‐65) 13/09/03 snlp#5  matsuda 26
  27. 27. Sense  Similarity •  Coarsening  WordNet  sense  inventory 13/09/03 snlp#5  matsuda 27 Note:  figure  from  slide  by  authors
  28. 28. Sense  Coarsing Onto  :  OntoNotes  [Hovy+,  2006],      SE-­‐2  :  Senseval-­‐2  sense  groping  set  [Kilgarriff,  2001] Binary  Classifica7on  (senses  can  be  merged  or  not?)  F-­‐Score 13/09/03 snlp#5  matsuda 28
  29. 29. Conclusions •  Unified  approach  for  compu7ng  seman7c   similarity  at  mul7ple  lexical  levels   – Based  on  Random-­‐Walk  over  WordNet  Graph   – Alignment  based  Word  Sense  Disambigua7on   – Similarity  Measure  based  on  ranking  of  sense   •  Achieves  state-­‐of-­‐the-­‐art  performance  in   three  tasks   – Similarity  judgment  tasks  (sense,  word,  text)   13/09/03 snlp#5  matsuda 29
  30. 30. My  Comment •   ☺  I  think  that  this  method  provides  simple  but  powerful   representa7on  of  seman7cs  for  rela7vely  longer  sentence  and   individual  word,  or  word  sense   –   ☺  As  a  result,  this  method  expand  solvable  type  of  STS  problem   –   ☹  But  ignore  sequence  order  and  parse  tree.  So  I  think  it  is  impotant   for  represen7ng  short  phrase  or  compound.   •  Actually,  this  work  is  simply  combined  method  of  Personalized   PageRank-­‐based  WSD  [Agirre  and  Soroa,  EACL  2009]  and  Word-­‐ level  Alignment  for  Similarity  Calc  [Corley  and  Mihalcea,  ACL  2005]   •   ☹  As  view  from  the  perspec7ve  of  compo7sional  seman7cs,  I  think   that  this  work  make  an  incorrect  assump7on.   –  Let  S(x)  as  Seman7c  Signature  of  x,  they  suppose  S(xy)  ∝  S(x)+S(y)  ?   •  e.g.  S(red  car)  ∝  S(red)  +  S(car)        ?   13/09/03 snlp#5  matsuda 30
  31. 31. Toward  STS  with  various  clues 13/09/03 snlp#5  matsuda 31 Syntax Word  Sense Domain  Knowlegde Surface Explicit Implicit Concrete Abstract This  Work   Composi7onal  Seman7cs Automa7c  Extending   Lexical  Resoueces   Robust  Similarity  Measures Named  En7ty   Linking  to  Knowledge  Base
  32. 32. 頂いたコメントへの返信/その他メモ •  Synset間のリンクは全て用いているのか?(乾先生)   –  Personalized  PageRank-­‐based  WSDの元論文[Agirre  and  Soroa,  09]では,すべての rela7onを用いたと述べられている(本論文でも踏襲)   –  しかし,antonymなど,単純に伝播させるべきではないリンクが存在する,というのはそ うかもしれない   •  意味をぼやかす(周囲のSynsetに伝播させる)ことで,WSDの性能が上が るというのは一般性がある性質なのか?(乾先生)   –  Knowledge-­‐based  WSDにおいては,知識ベースの不完全さ(スパースさ,カバレッジの 低さ)が問題になることが多く,その影響を和らげるためにソフトな情報を用いることは よく行われている   •  Word  to  Wordの場合もアラインメントを行うのか?(松原さん)   –  はい,実際は語義レベルでのアラインメントを行っている(図が説明不足でした)   •  アラインメントで,「最大値」をとってきている(好意的な解釈をさがす)ので, 類似度の「下限」のようなものをもとめているといえる   –  多義性が問題になる場合,overes7mateすることがあるように思える   •  文や単語の「ペア」に対して類似度を定義するモデルであるため, representa7on単体で用いるのは難しい   –  WordNet  Synsetのglossとのペアを用いるという手段はある 13/09/03 snlp#5  matsuda 32

×