Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Concept-Based Information Retrieval using Explicit Semantic Analysis

4,150 views

Published on

My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. Comments/questions are very welcome!

Published in: Technology, Education
  • Be the first to comment

Concept-Based Information Retrieval using Explicit Semantic Analysis

  1. 1. Concept-Based Information Retrieval using Explicit Semantic Analysis<br />M.Sc. Seminar talk<br />Ofer Egozi, CS Department, Technion<br />Supervisor: Prof. Shaul Markovitch<br />24/6/09 <br />
  2. 2. Information Retrieval<br />Query<br />IR<br />Recall<br />Precision<br />
  3. 3. Ranked retrieval<br />Query<br />IR<br />
  4. 4. Keyword-based retrieval<br />Bag Of Words (BOW)<br />Query<br />IR<br />
  5. 5. Problem: retrieval misses<br />TREC document LA071689-0089 <br />“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday.&quot;<br />TREC topic #411<br />salvaging shipwreck treasure<br />I<br />I<br />Query<br />IR<br />
  6. 6. The vocabulary problem<br />Identity: Syntax <br /> (tokenization, stemming…)<br />Similarity: Synonyms (Wordnet etc.)<br />Relatedness: Semantics / world knowledge<br /> (???)<br />“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday.&quot;<br />?<br />[but also shipping/treasurer]<br />Synonymy / Polysemy<br />?<br />[but also deliver/scavenge/relieve]<br />salvaging shipwreck treasure<br />
  7. 7. Concept-based retrieval<br />“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday.&quot;<br />IR<br />salvaging shipwreck treasure<br />
  8. 8. Concept-based representations<br />Human-edited Thesauri (e.g. WordNet)<br />Source: editors , concepts: words, mapping: manual<br />Corpus-based Thesauri (e.g. co-occurrence)<br />Source: corpus , concepts: words , mapping: automatic<br />Ontology mapping (e.g. KeyConcept)<br />Source: ontology , concepts: ontology node(s) , mapping: automatic<br />Latent analysis (e.g. LSA, pLSA, LDA)<br />Source: corpus , concepts: word distributions , mapping: automatic<br />Insufficient granularity<br />Non-intuitive Concepts<br />Expensive repetitive computations<br />Non-scalable solution<br />
  9. 9. Concept-based representations<br />Human-edited Thesauri (e.g. WordNet)<br />Source: editors , concepts: words, mapping: manual<br />Corpus-based Thesauri (e.g. co-occurrence)<br />Source: corpus , concepts: words , mapping: automatic<br />Ontology mapping (e.g. KeyConcept)<br />Source: ontology , concepts: ontology node(s) , mapping: automatic<br />Latent analysis (e.g. LSA, pLSA, LDA)<br />Source: corpus , concepts: word distributions , mapping: automatic<br /> Is it possible to devise a <br /> concept-based representation, that is scalable, computationally<br />feasible, and uses intuitive<br /> and granular concepts?<br />Insufficient granularity<br />Non-intuitive Concepts<br />Expensive repetitive computations<br />Non-scalable solution<br />
  10. 10. Explicit Semantic Analysis<br />Gabrilovich and Markovitch (2005,2006,2007)<br />
  11. 11. Explicit Semantic Analysis (ESA)<br />Wikipedia is viewed as an ontology - a collection of ~1M concepts<br />World War II<br />Panthera<br />Jane Fonda<br />Island<br />concept<br />
  12. 12. Explicit Semantic Analysis (ESA)<br />Wikipedia is viewed as an ontology - a collection of ~1M concepts<br />Every Wikipedia article represents a concept<br />Panthera<br />Cat [0.92]<br />Leopard [0.84]<br />Article words are associated with the concept(TF.IDF)<br />Roar [0.77]<br />concept<br />
  13. 13. Explicit Semantic Analysis (ESA)<br />Wikipedia is viewed as an ontology - a collection of ~1M concepts<br />Every Wikipedia article represents a concept<br />Panthera<br />Cat [0.92]<br />Leopard [0.84]<br />Article words are associated with the concept(TF.IDF)<br />Roar [0.77]<br />
  14. 14. Explicit Semantic Analysis (ESA)<br />Wikipedia is viewed as an ontology - a collection of ~1M concepts<br />Every Wikipedia article represents a concept<br />Article words are associated with the concept(TF.IDF)<br />Panthera<br />The semantics of a word is the vector of its associations with Wikipedia concepts<br />Cat [0.92]<br />Leopard [0.84]<br />Panthera<br />[0.92]<br />Cat<br />[0.95]<br />Jane Fonda<br />[0.07]<br />Cat<br />Roar [0.77]<br />
  15. 15. Explicit Semantic Analysis (ESA)<br />The semantics of a text fragment is the average vector (centroid) of the semantics of its words<br />In practice – disambiguation…<br />Mouse (computing)<br />[0.81]<br />MickeyMouse[0.81]<br />Game Controller<br />[0.64]<br />Button<br />[0.93]<br />Game Controller<br />[0.32]<br />Mouse (rodent)<br />[0.91]<br />John Steinbeck<br />[0.17]<br />Mouse (computing)<br />[0.95]<br />Mouse (rodent)<br />[0.56]<br />Dick Button<br />[0.84]<br />Mouse (computing)<br />[0.84]<br />Drag- and-drop<br />[0.91]<br />button<br />mouse<br />mouse button<br />mouse button<br />
  16. 16. MORAG*: An ESA-based information retrieval algorithm<br />*MORAG: Flail in Hebrew<br />“Concept-based feature generation and selection for information retrieval”, AAAI-2008<br />
  17. 17. Enrich documents/queries<br />ESA<br />IR<br />Query<br />Constraint: use only the<br /> strongest concepts<br />
  18. 18. Problem: documents (in)coherence<br />TREC document LA120790-0036<br />REFERENCE BOOKS SPEAK VOLUMES TO KIDS;<br />With the school year in high gear, it&apos;s a good time to consider new additions to children&apos;s home reference libraries…<br />…Also new from Pharos-World Almanac: &quot;The World Almanac InfoPedia,&quot; a single-volume visual encyclopedia designed for ages 8 to 16…<br />…&quot;The Doubleday Children&apos;s Encyclopedia,&quot; designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books…<br />…&quot;The Lost Wreck of the Isis&quot; by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard&apos;s 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea…<br />Document is judged relevant for topic 411 due to one relevant passage in it<br />Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based?<br />Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…<br />
  19. 19. Solution: split to passages<br />ESA<br />IR<br />Query<br />ConceptScore(d) = <br />ConceptScore(full-doc) + <br /> max ConceptScore(passage)<br />passaged<br />Index both full document and passages.<br />Best performance achieved by fixed-length overlapping sliding windows.<br />
  20. 20. Morag ranking<br />Score(q,d) = <br /> ConceptScore(q,d) +<br /> (1-)KeywordScore(q,d)<br />IR<br />Query<br />
  21. 21. ESA-based retrieval example<br /><ul><li>Shipwreck
  22. 22. Treasure
  23. 23. Maritime archaeology
  24. 24. Marine salvage
  25. 25. History of the British Virgin Islands
  26. 26. Wrecking (shipwreck)
  27. 27. Key West, Florida
  28. 28. Flotsam and jetsam
  29. 29. Wreck diving
  30. 30. Spanish treasure fleet</li></ul>“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday.&quot;<br /><br /><ul><li>Scuba diving
  31. 31. Wreck diving
  32. 32. RMS Titanic
  33. 33. USS Hoel (DD-533)
  34. 34. Shipwreck
  35. 35. Underwater archaeology
  36. 36. USS Maine (ACR-1)
  37. 37. Maritime archaeology
  38. 38. Tomb Raider II
  39. 39. USS Meade (DD-602)</li></ul>salvaging shipwreck treasure<br />
  40. 40. Problem: irrelevant docs retrieved<br /><ul><li> Estonia
  41. 41. Economy of Estonia
  42. 42. Estonia at the 2000 Summer Olympics
  43. 43. Estonia at the 2004 Summer Olympics
  44. 44. Estonia national football team
  45. 45. Estonia at the 2006 Winter Olympics
  46. 46. Baltic Sea
  47. 47. Eurozone
  48. 48. TiitVähi
  49. 49. Military of Estonia</li></ul>TREC topic #434<br />!<br />I<br />I<br />Estonia economy<br />“Olympic News In Brief: Cycling win for Estonia. <br />Erika Salumae won Estonia&apos;s first Olympic gold when retaining the women&apos;s cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. &quot;<br /><br />??<br />??<br /><ul><li> Estonia at the 2000 Summer Olympics
  50. 50. Estonia at the 2004 Summer Olympics
  51. 51. 2006 Commonwealth Games
  52. 52. Estonia at the 2006 Winter Olympics
  53. 53. 1992 Summer Olympics
  54. 54. Athletics at the 2004 Summer Olympics
  55. 55. 2000 Summer Olympics
  56. 56. 2006 Winter Olympics
  57. 57. Cross-country skiing 2006 Winter Olympics
  58. 58. New Zealand at the 2006 Winter Olympics</li></li></ul><li>
  59. 59. “Economy” is not mentioned, but TF·IDF of “Estonia” is strong enough to trigger this concept on its own…<br />
  60. 60. Selection could remove noisy ESA concepts<br />However, IR task provides no training data…<br />Problem: selecting query features<br />Focus on query concepts - Query is short and noisy, while FS at indexing lacks context<br />Utility function U(+|-) requires target measure &gt;&gt; training set<br />U<br />f<br />=ESA(q)<br />Filter<br />f’<br />
  61. 61. Solution: Pseudo Relevance Feedback<br />Use BOW results as positive / negative examples<br />
  62. 62. ESA feature selection methods<br />IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features<br />RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples<br />IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search<br />
  63. 63. <ul><li> Estonia
  64. 64. Economy of Estonia
  65. 65. Estonia at the 2000 Summer Olympics
  66. 66. Estonia at the 2004 Summer Olympics
  67. 67. Estonia national football team
  68. 68. Estonia at the 2006 Winter Olympics
  69. 69. Baltic Sea
  70. 70. Eurozone
  71. 71. TiitVähi
  72. 72. Military of Estonia</li></ul>ESA-based retrieval – FS example<br /><ul><li> Monetary Policy
  73. 73. Euro
  74. 74. Economy of Europe
  75. 75. Nordic Countries
  76. 76. Prime Minister of Estonia</li></ul>Estonia economy<br />“Olympic News In Brief: Cycling win for Estonia. <br />Erika Salumae won Estonia&apos;s first Olympic gold when retaining the women&apos;s cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. &quot;<br /><br /><br /><br /><br /><br />Broad<br />features<br />Noise <br />features<br />RV adds<br />features<br />Useful ones “bubble up”<br /><br /><br /><br /><br /><ul><li>Neoliberalism
  77. 77. Estonia at the 2000 Summer Olympics
  78. 78. Estonia at the 2004 Summer Olympics
  79. 79. 2006 Commonwealth Games
  80. 80. Estonia at the 2006 Winter Olympics
  81. 81. 1992 Summer Olympics
  82. 82. Athletics at the 2004 Summer Olympics
  83. 83. 2000 Summer Olympics
  84. 84. 2006 Winter Olympics
  85. 85. Cross-country skiing 2006 Winter Olympics
  86. 86. New Zealand at the 2006 Winter Olympics</li></li></ul><li>Morag evaluation<br />Testing over TREC-8 and Robust-04 datasets (528K documents, 50 web-like queries)<br />Feature selection is highly effective<br />
  87. 87. Morag evaluation<br />Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines <br />Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…<br />
  88. 88. Morag evaluation<br />Optimal (“Oracle”) selection analysis shows much more potential for Morag<br />
  89. 89. Morag evaluation<br />Pseudo-relevance proves to be a good approximation of actual relevance<br />
  90. 90. Conclusion<br />Morag: a new methodology for concept-based information retrieval<br />Documents and query are enhanced by Wikipedia concepts<br />Informative features are selected using pseudo-relevance feedback<br />The generated features improve the performance of BOW-based systems<br />
  91. 91. Thank you!<br />

×