Concept-Based Information Retrieval using Explicit Semantic Analysis

  • 2,810 views
Uploaded on

My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. …

My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. Comments/questions are very welcome!

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,810
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
146
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • This is a relevant document for this TREC query that is not retrieved by standard BOW system – none of the keywords are found in the document
  • Methods for dealing mainly with synonymy. Each of the existing methods have their issues – stemming loses nuances, tokenization may create words the author did not intend; synonyms may intensify ambiguity>Polysemy. However, mapping “treasure” to “ancient artifacts found in a sunken Roman ship” requires significant world knowledge that these methods cannot offer.
  • The promise of concept-based retrieval is that by transforming to a domain of concepts and performing retrieval in it rather than in the domain of words, the previously described problems will be dramatically reduced.
  • Existing concept-based representation approaches.KeyConceptis most similar to ESA, but: 1) very small ontology is used (1564 concepts), 2) query processing is MANUAL
  • ESA can also be generated using other knowledge sources – was succesfully applied to ODP – but recent papers focused on Wikipedia which proved most fitting.
  • These vectors are for illustration only, actual concepts and weights are different in real life (so don’t try the maths…)
  • First results were published in AAAI 2008
  • Constraint is due to the very large number of concepts in vector – can easily inflate the index to a huge scale
  • Enough overlap between concepts of query and target document – document is retrieved despite having no keywords match!
  • However, we seem to also have false positives causing results to be far from optimal. These (and previous slide) are actual top 10 concepts generated for these texts.
  • Going back to the ESA classifier generation, we see why Baltic Sea was triggered by the query (although we still consider it not relevant enough)
  • So still the bottom line is that we prefer this not to happen. One option is to change the method of how concepts are generated for more than one word, in this research we decided not to make any changes to the ESA mechanism itself.
  • Indeed, when applying ESA to text categorization, training data was crucial in removing noisy features…
  • Relevance feedback is the process where a user assigns relevancy labels to retrieved documents. Pseudo means we let the system decide that the top documents are considered relevant. Naturally this is less accurate, but better than no data at all.
  • More details on these actual methods are given in the paper.
  • The less relevant features are such that will usually either appear in both positive and negative example sets, or not appear in both sets. Useful features are such that appear more in positive examples.
  • TREC is the most comprehensive and studies IR benchmark. We used two datasets and a third one (TREC-7) for parameter tuning. These graphs show feature selection impact, hence they show performance of concept-based subsystem alone.
  • The full MORAG system results. Parameter tuning works well. Improvement is most apparent when baseline is weaker. Consider that the concept-based retrieval itself is quite low, one major reason for that is that there is a high chance it finds relevant documents that other systems did not find, and therefore were not judged in the TREC ‘pooling’ judgment method. That means that MORAG’s perfromance is probably underrated. See paper for more details.
  • We attempted to estimate what further potential future work may uncover. Performing exhaustive search in all features subsets provides such an estimate, which shows a lot of potential for more work.
  • The graphs are not so far apart. An interesting trend is that beyond a certain threshold, adding more pseudo-relevant documents harms performance as their relevance becomes less accurate, but when using true relevant documents this doesn’t happen, which proves the cause is indeed that.

Transcript

  • 1. Concept-Based Information Retrieval using Explicit Semantic Analysis
    M.Sc. Seminar talk
    Ofer Egozi, CS Department, Technion
    Supervisor: Prof. Shaul Markovitch
    24/6/09
  • 2. Information Retrieval
    Query
    IR
    Recall
    Precision
  • 3. Ranked retrieval
    Query
    IR
  • 4. Keyword-based retrieval
    Bag Of Words (BOW)
    Query
    IR
  • 5. Problem: retrieval misses
    TREC document LA071689-0089
    “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
    TREC topic #411
    salvaging shipwreck treasure
    I
    I
    Query
    IR
  • 6. The vocabulary problem
    Identity: Syntax
    (tokenization, stemming…)
    Similarity: Synonyms (Wordnet etc.)
    Relatedness: Semantics / world knowledge
    (???)
    “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
    ?
    [but also shipping/treasurer]
    Synonymy / Polysemy
    ?
    [but also deliver/scavenge/relieve]
    salvaging shipwreck treasure
  • 7. Concept-based retrieval
    “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
    IR
    salvaging shipwreck treasure
  • 8. Concept-based representations
    Human-edited Thesauri (e.g. WordNet)
    Source: editors , concepts: words, mapping: manual
    Corpus-based Thesauri (e.g. co-occurrence)
    Source: corpus , concepts: words , mapping: automatic
    Ontology mapping (e.g. KeyConcept)
    Source: ontology , concepts: ontology node(s) , mapping: automatic
    Latent analysis (e.g. LSA, pLSA, LDA)
    Source: corpus , concepts: word distributions , mapping: automatic
    Insufficient granularity
    Non-intuitive Concepts
    Expensive repetitive computations
    Non-scalable solution
  • 9. Concept-based representations
    Human-edited Thesauri (e.g. WordNet)
    Source: editors , concepts: words, mapping: manual
    Corpus-based Thesauri (e.g. co-occurrence)
    Source: corpus , concepts: words , mapping: automatic
    Ontology mapping (e.g. KeyConcept)
    Source: ontology , concepts: ontology node(s) , mapping: automatic
    Latent analysis (e.g. LSA, pLSA, LDA)
    Source: corpus , concepts: word distributions , mapping: automatic
    Is it possible to devise a
    concept-based representation, that is scalable, computationally
    feasible, and uses intuitive
    and granular concepts?
    Insufficient granularity
    Non-intuitive Concepts
    Expensive repetitive computations
    Non-scalable solution
  • 10. Explicit Semantic Analysis
    Gabrilovich and Markovitch (2005,2006,2007)
  • 11. Explicit Semantic Analysis (ESA)
    Wikipedia is viewed as an ontology - a collection of ~1M concepts
    World War II
    Panthera
    Jane Fonda
    Island
    concept
  • 12. Explicit Semantic Analysis (ESA)
    Wikipedia is viewed as an ontology - a collection of ~1M concepts
    Every Wikipedia article represents a concept
    Panthera
    Cat [0.92]
    Leopard [0.84]
    Article words are associated with the concept(TF.IDF)
    Roar [0.77]
    concept
  • 13. Explicit Semantic Analysis (ESA)
    Wikipedia is viewed as an ontology - a collection of ~1M concepts
    Every Wikipedia article represents a concept
    Panthera
    Cat [0.92]
    Leopard [0.84]
    Article words are associated with the concept(TF.IDF)
    Roar [0.77]
  • 14. Explicit Semantic Analysis (ESA)
    Wikipedia is viewed as an ontology - a collection of ~1M concepts
    Every Wikipedia article represents a concept
    Article words are associated with the concept(TF.IDF)
    Panthera
    The semantics of a word is the vector of its associations with Wikipedia concepts
    Cat [0.92]
    Leopard [0.84]
    Panthera
    [0.92]
    Cat
    [0.95]
    Jane Fonda
    [0.07]
    Cat
    Roar [0.77]
  • 15. Explicit Semantic Analysis (ESA)
    The semantics of a text fragment is the average vector (centroid) of the semantics of its words
    In practice – disambiguation…
    Mouse (computing)
    [0.81]
    MickeyMouse[0.81]
    Game Controller
    [0.64]
    Button
    [0.93]
    Game Controller
    [0.32]
    Mouse (rodent)
    [0.91]
    John Steinbeck
    [0.17]
    Mouse (computing)
    [0.95]
    Mouse (rodent)
    [0.56]
    Dick Button
    [0.84]
    Mouse (computing)
    [0.84]
    Drag- and-drop
    [0.91]
    button
    mouse
    mouse button
    mouse button
  • 16. MORAG*: An ESA-based information retrieval algorithm
    *MORAG: Flail in Hebrew
    “Concept-based feature generation and selection for information retrieval”, AAAI-2008
  • 17. Enrich documents/queries
    ESA
    IR
    Query
    Constraint: use only the
    strongest concepts
  • 18. Problem: documents (in)coherence
    TREC document LA120790-0036
    REFERENCE BOOKS SPEAK VOLUMES TO KIDS;
    With the school year in high gear, it's a good time to consider new additions to children's home reference libraries…
    …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16…
    …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books…
    …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea…
    Document is judged relevant for topic 411 due to one relevant passage in it
    Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based?
    Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
  • 19. Solution: split to passages
    ESA
    IR
    Query
    ConceptScore(d) =
    ConceptScore(full-doc) +
    max ConceptScore(passage)
    passaged
    Index both full document and passages.
    Best performance achieved by fixed-length overlapping sliding windows.
  • 20. Morag ranking
    Score(q,d) =
    ConceptScore(q,d) +
    (1-)KeywordScore(q,d)
    IR
    Query
  • 21. ESA-based retrieval example
    “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

    salvaging shipwreck treasure
  • 40. Problem: irrelevant docs retrieved
    • Estonia
    • 41. Economy of Estonia
    • 42. Estonia at the 2000 Summer Olympics
    • 43. Estonia at the 2004 Summer Olympics
    • 44. Estonia national football team
    • 45. Estonia at the 2006 Winter Olympics
    • 46. Baltic Sea
    • 47. Eurozone
    • 48. TiitVähi
    • 49. Military of Estonia
    TREC topic #434
    !
    I
    I
    Estonia economy
    “Olympic News In Brief: Cycling win for Estonia.
    Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "

    ??
    ??
    • Estonia at the 2000 Summer Olympics
    • 50. Estonia at the 2004 Summer Olympics
    • 51. 2006 Commonwealth Games
    • 52. Estonia at the 2006 Winter Olympics
    • 53. 1992 Summer Olympics
    • 54. Athletics at the 2004 Summer Olympics
    • 55. 2000 Summer Olympics
    • 56. 2006 Winter Olympics
    • 57. Cross-country skiing 2006 Winter Olympics
    • 58. New Zealand at the 2006 Winter Olympics
  • 59. “Economy” is not mentioned, but TF·IDF of “Estonia” is strong enough to trigger this concept on its own…
  • 60. Selection could remove noisy ESA concepts
    However, IR task provides no training data…
    Problem: selecting query features
    Focus on query concepts - Query is short and noisy, while FS at indexing lacks context
    Utility function U(+|-) requires target measure >> training set
    U
    f
    =ESA(q)
    Filter
    f’
  • 61. Solution: Pseudo Relevance Feedback
    Use BOW results as positive / negative examples
  • 62. ESA feature selection methods
    IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features
    RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples
    IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search
  • 63.
    • Estonia
    • 64. Economy of Estonia
    • 65. Estonia at the 2000 Summer Olympics
    • 66. Estonia at the 2004 Summer Olympics
    • 67. Estonia national football team
    • 68. Estonia at the 2006 Winter Olympics
    • 69. Baltic Sea
    • 70. Eurozone
    • 71. TiitVähi
    • 72. Military of Estonia
    ESA-based retrieval – FS example
    Estonia economy
    “Olympic News In Brief: Cycling win for Estonia.
    Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "





    Broad
    features
    Noise
    features
    RV adds
    features
    Useful ones “bubble up”




    • Neoliberalism
    • 77. Estonia at the 2000 Summer Olympics
    • 78. Estonia at the 2004 Summer Olympics
    • 79. 2006 Commonwealth Games
    • 80. Estonia at the 2006 Winter Olympics
    • 81. 1992 Summer Olympics
    • 82. Athletics at the 2004 Summer Olympics
    • 83. 2000 Summer Olympics
    • 84. 2006 Winter Olympics
    • 85. Cross-country skiing 2006 Winter Olympics
    • 86. New Zealand at the 2006 Winter Olympics
  • Morag evaluation
    Testing over TREC-8 and Robust-04 datasets (528K documents, 50 web-like queries)
    Feature selection is highly effective
  • 87. Morag evaluation
    Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines
    Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…
  • 88. Morag evaluation
    Optimal (“Oracle”) selection analysis shows much more potential for Morag
  • 89. Morag evaluation
    Pseudo-relevance proves to be a good approximation of actual relevance
  • 90. Conclusion
    Morag: a new methodology for concept-based information retrieval
    Documents and query are enhanced by Wikipedia concepts
    Informative features are selected using pseudo-relevance feedback
    The generated features improve the performance of BOW-based systems
  • 91. Thank you!