Your SlideShare is downloading. ×
Concept-Based Information Retrieval using Explicit Semantic Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Concept-Based Information Retrieval using Explicit Semantic Analysis

2,912
views

Published on

My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. …

My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. Comments/questions are very welcome!

Published in: Technology, Education

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,912
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
152
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is a relevant document for this TREC query that is not retrieved by standard BOW system – none of the keywords are found in the document
  • Methods for dealing mainly with synonymy. Each of the existing methods have their issues – stemming loses nuances, tokenization may create words the author did not intend; synonyms may intensify ambiguity>Polysemy. However, mapping “treasure” to “ancient artifacts found in a sunken Roman ship” requires significant world knowledge that these methods cannot offer.
  • The promise of concept-based retrieval is that by transforming to a domain of concepts and performing retrieval in it rather than in the domain of words, the previously described problems will be dramatically reduced.
  • Existing concept-based representation approaches.KeyConceptis most similar to ESA, but: 1) very small ontology is used (1564 concepts), 2) query processing is MANUAL
  • ESA can also be generated using other knowledge sources – was succesfully applied to ODP – but recent papers focused on Wikipedia which proved most fitting.
  • These vectors are for illustration only, actual concepts and weights are different in real life (so don’t try the maths…)
  • First results were published in AAAI 2008
  • Constraint is due to the very large number of concepts in vector – can easily inflate the index to a huge scale
  • Enough overlap between concepts of query and target document – document is retrieved despite having no keywords match!
  • However, we seem to also have false positives causing results to be far from optimal. These (and previous slide) are actual top 10 concepts generated for these texts.
  • Going back to the ESA classifier generation, we see why Baltic Sea was triggered by the query (although we still consider it not relevant enough)
  • So still the bottom line is that we prefer this not to happen. One option is to change the method of how concepts are generated for more than one word, in this research we decided not to make any changes to the ESA mechanism itself.
  • Indeed, when applying ESA to text categorization, training data was crucial in removing noisy features…
  • Relevance feedback is the process where a user assigns relevancy labels to retrieved documents. Pseudo means we let the system decide that the top documents are considered relevant. Naturally this is less accurate, but better than no data at all.
  • More details on these actual methods are given in the paper.
  • The less relevant features are such that will usually either appear in both positive and negative example sets, or not appear in both sets. Useful features are such that appear more in positive examples.
  • TREC is the most comprehensive and studies IR benchmark. We used two datasets and a third one (TREC-7) for parameter tuning. These graphs show feature selection impact, hence they show performance of concept-based subsystem alone.
  • The full MORAG system results. Parameter tuning works well. Improvement is most apparent when baseline is weaker. Consider that the concept-based retrieval itself is quite low, one major reason for that is that there is a high chance it finds relevant documents that other systems did not find, and therefore were not judged in the TREC ‘pooling’ judgment method. That means that MORAG’s perfromance is probably underrated. See paper for more details.
  • We attempted to estimate what further potential future work may uncover. Performing exhaustive search in all features subsets provides such an estimate, which shows a lot of potential for more work.
  • The graphs are not so far apart. An interesting trend is that beyond a certain threshold, adding more pseudo-relevant documents harms performance as their relevance becomes less accurate, but when using true relevant documents this doesn’t happen, which proves the cause is indeed that.
  • Transcript

    • 1. Concept-Based Information Retrieval using Explicit Semantic Analysis
      M.Sc. Seminar talk
      Ofer Egozi, CS Department, Technion
      Supervisor: Prof. Shaul Markovitch
      24/6/09
    • 2. Information Retrieval
      Query
      IR
      Recall
      Precision
    • 3. Ranked retrieval
      Query
      IR
    • 4. Keyword-based retrieval
      Bag Of Words (BOW)
      Query
      IR
    • 5. Problem: retrieval misses
      TREC document LA071689-0089
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
      TREC topic #411
      salvaging shipwreck treasure
      I
      I
      Query
      IR
    • 6. The vocabulary problem
      Identity: Syntax
      (tokenization, stemming…)
      Similarity: Synonyms (Wordnet etc.)
      Relatedness: Semantics / world knowledge
      (???)
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
      ?
      [but also shipping/treasurer]
      Synonymy / Polysemy
      ?
      [but also deliver/scavenge/relieve]
      salvaging shipwreck treasure
    • 7. Concept-based retrieval
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
      IR
      salvaging shipwreck treasure
    • 8. Concept-based representations
      Human-edited Thesauri (e.g. WordNet)
      Source: editors , concepts: words, mapping: manual
      Corpus-based Thesauri (e.g. co-occurrence)
      Source: corpus , concepts: words , mapping: automatic
      Ontology mapping (e.g. KeyConcept)
      Source: ontology , concepts: ontology node(s) , mapping: automatic
      Latent analysis (e.g. LSA, pLSA, LDA)
      Source: corpus , concepts: word distributions , mapping: automatic
      Insufficient granularity
      Non-intuitive Concepts
      Expensive repetitive computations
      Non-scalable solution
    • 9. Concept-based representations
      Human-edited Thesauri (e.g. WordNet)
      Source: editors , concepts: words, mapping: manual
      Corpus-based Thesauri (e.g. co-occurrence)
      Source: corpus , concepts: words , mapping: automatic
      Ontology mapping (e.g. KeyConcept)
      Source: ontology , concepts: ontology node(s) , mapping: automatic
      Latent analysis (e.g. LSA, pLSA, LDA)
      Source: corpus , concepts: word distributions , mapping: automatic
      Is it possible to devise a
      concept-based representation, that is scalable, computationally
      feasible, and uses intuitive
      and granular concepts?
      Insufficient granularity
      Non-intuitive Concepts
      Expensive repetitive computations
      Non-scalable solution
    • 10. Explicit Semantic Analysis
      Gabrilovich and Markovitch (2005,2006,2007)
    • 11. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      World War II
      Panthera
      Jane Fonda
      Island
      concept
    • 12. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      Every Wikipedia article represents a concept
      Panthera
      Cat [0.92]
      Leopard [0.84]
      Article words are associated with the concept(TF.IDF)
      Roar [0.77]
      concept
    • 13. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      Every Wikipedia article represents a concept
      Panthera
      Cat [0.92]
      Leopard [0.84]
      Article words are associated with the concept(TF.IDF)
      Roar [0.77]
    • 14. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      Every Wikipedia article represents a concept
      Article words are associated with the concept(TF.IDF)
      Panthera
      The semantics of a word is the vector of its associations with Wikipedia concepts
      Cat [0.92]
      Leopard [0.84]
      Panthera
      [0.92]
      Cat
      [0.95]
      Jane Fonda
      [0.07]
      Cat
      Roar [0.77]
    • 15. Explicit Semantic Analysis (ESA)
      The semantics of a text fragment is the average vector (centroid) of the semantics of its words
      In practice – disambiguation…
      Mouse (computing)
      [0.81]
      MickeyMouse[0.81]
      Game Controller
      [0.64]
      Button
      [0.93]
      Game Controller
      [0.32]
      Mouse (rodent)
      [0.91]
      John Steinbeck
      [0.17]
      Mouse (computing)
      [0.95]
      Mouse (rodent)
      [0.56]
      Dick Button
      [0.84]
      Mouse (computing)
      [0.84]
      Drag- and-drop
      [0.91]
      button
      mouse
      mouse button
      mouse button
    • 16. MORAG*: An ESA-based information retrieval algorithm
      *MORAG: Flail in Hebrew
      “Concept-based feature generation and selection for information retrieval”, AAAI-2008
    • 17. Enrich documents/queries
      ESA
      IR
      Query
      Constraint: use only the
      strongest concepts
    • 18. Problem: documents (in)coherence
      TREC document LA120790-0036
      REFERENCE BOOKS SPEAK VOLUMES TO KIDS;
      With the school year in high gear, it's a good time to consider new additions to children's home reference libraries…
      …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16…
      …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books…
      …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea…
      Document is judged relevant for topic 411 due to one relevant passage in it
      Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based?
      Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
    • 19. Solution: split to passages
      ESA
      IR
      Query
      ConceptScore(d) =
      ConceptScore(full-doc) +
      max ConceptScore(passage)
      passaged
      Index both full document and passages.
      Best performance achieved by fixed-length overlapping sliding windows.
    • 20. Morag ranking
      Score(q,d) =
      ConceptScore(q,d) +
      (1-)KeywordScore(q,d)
      IR
      Query
    • 21. ESA-based retrieval example
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

      salvaging shipwreck treasure
    • 40. Problem: irrelevant docs retrieved
      • Estonia
      • 41. Economy of Estonia
      • 42. Estonia at the 2000 Summer Olympics
      • 43. Estonia at the 2004 Summer Olympics
      • 44. Estonia national football team
      • 45. Estonia at the 2006 Winter Olympics
      • 46. Baltic Sea
      • 47. Eurozone
      • 48. TiitVähi
      • 49. Military of Estonia
      TREC topic #434
      !
      I
      I
      Estonia economy
      “Olympic News In Brief: Cycling win for Estonia.
      Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "

      ??
      ??
      • Estonia at the 2000 Summer Olympics
      • 50. Estonia at the 2004 Summer Olympics
      • 51. 2006 Commonwealth Games
      • 52. Estonia at the 2006 Winter Olympics
      • 53. 1992 Summer Olympics
      • 54. Athletics at the 2004 Summer Olympics
      • 55. 2000 Summer Olympics
      • 56. 2006 Winter Olympics
      • 57. Cross-country skiing 2006 Winter Olympics
      • 58. New Zealand at the 2006 Winter Olympics
    • 59. “Economy” is not mentioned, but TF·IDF of “Estonia” is strong enough to trigger this concept on its own…
    • 60. Selection could remove noisy ESA concepts
      However, IR task provides no training data…
      Problem: selecting query features
      Focus on query concepts - Query is short and noisy, while FS at indexing lacks context
      Utility function U(+|-) requires target measure >> training set
      U
      f
      =ESA(q)
      Filter
      f’
    • 61. Solution: Pseudo Relevance Feedback
      Use BOW results as positive / negative examples
    • 62. ESA feature selection methods
      IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features
      RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples
      IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search
    • 63.
      • Estonia
      • 64. Economy of Estonia
      • 65. Estonia at the 2000 Summer Olympics
      • 66. Estonia at the 2004 Summer Olympics
      • 67. Estonia national football team
      • 68. Estonia at the 2006 Winter Olympics
      • 69. Baltic Sea
      • 70. Eurozone
      • 71. TiitVähi
      • 72. Military of Estonia
      ESA-based retrieval – FS example
      Estonia economy
      “Olympic News In Brief: Cycling win for Estonia.
      Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "





      Broad
      features
      Noise
      features
      RV adds
      features
      Useful ones “bubble up”




      • Neoliberalism
      • 77. Estonia at the 2000 Summer Olympics
      • 78. Estonia at the 2004 Summer Olympics
      • 79. 2006 Commonwealth Games
      • 80. Estonia at the 2006 Winter Olympics
      • 81. 1992 Summer Olympics
      • 82. Athletics at the 2004 Summer Olympics
      • 83. 2000 Summer Olympics
      • 84. 2006 Winter Olympics
      • 85. Cross-country skiing 2006 Winter Olympics
      • 86. New Zealand at the 2006 Winter Olympics
    • Morag evaluation
      Testing over TREC-8 and Robust-04 datasets (528K documents, 50 web-like queries)
      Feature selection is highly effective
    • 87. Morag evaluation
      Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines
      Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…
    • 88. Morag evaluation
      Optimal (“Oracle”) selection analysis shows much more potential for Morag
    • 89. Morag evaluation
      Pseudo-relevance proves to be a good approximation of actual relevance
    • 90. Conclusion
      Morag: a new methodology for concept-based information retrieval
      Documents and query are enhanced by Wikipedia concepts
      Informative features are selected using pseudo-relevance feedback
      The generated features improve the performance of BOW-based systems
    • 91. Thank you!