Concept-Based Information Retrieval using Explicit Semantic Analysis

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    This is a relevant document for this TREC query that is not retrieved by standard BOW system – none of the keywords are found in the document

    Methods for dealing mainly with synonymy. Each of the existing methods have their issues – stemming loses nuances, tokenization may create words the author did not intend; synonyms may intensify ambiguity>Polysemy. However, mapping “treasure” to “ancient artifacts found in a sunken Roman ship” requires significant world knowledge that these methods cannot offer.

    The promise of concept-based retrieval is that by transforming to a domain of concepts and performing retrieval in it rather than in the domain of words, the previously described problems will be dramatically reduced.

    Existing concept-based representation approaches.KeyConceptis most similar to ESA, but: 1) very small ontology is used (1564 concepts), 2) query processing is MANUAL

    ESA can also be generated using other knowledge sources – was succesfully applied to ODP – but recent papers focused on Wikipedia which proved most fitting.

    These vectors are for illustration only, actual concepts and weights are different in real life (so don’t try the maths…)

    First results were published in AAAI 2008

    Constraint is due to the very large number of concepts in vector – can easily inflate the index to a huge scale

    Enough overlap between concepts of query and target document – document is retrieved despite having no keywords match!

    However, we seem to also have false positives causing results to be far from optimal. These (and previous slide) are actual top 10 concepts generated for these texts.

    Going back to the ESA classifier generation, we see why Baltic Sea was triggered by the query (although we still consider it not relevant enough)

    So still the bottom line is that we prefer this not to happen. One option is to change the method of how concepts are generated for more than one word, in this research we decided not to make any changes to the ESA mechanism itself.

    Indeed, when applying ESA to text categorization, training data was crucial in removing noisy features…

    Relevance feedback is the process where a user assigns relevancy labels to retrieved documents. Pseudo means we let the system decide that the top documents are considered relevant. Naturally this is less accurate, but better than no data at all.

    More details on these actual methods are given in the paper.

    The less relevant features are such that will usually either appear in both positive and negative example sets, or not appear in both sets. Useful features are such that appear more in positive examples.

    TREC is the most comprehensive and studies IR benchmark. We used two datasets and a third one (TREC-7) for parameter tuning. These graphs show feature selection impact, hence they show performance of concept-based subsystem alone.

    The full MORAG system results. Parameter tuning works well. Improvement is most apparent when baseline is weaker. Consider that the concept-based retrieval itself is quite low, one major reason for that is that there is a high chance it finds relevant documents that other systems did not find, and therefore were not judged in the TREC ‘pooling’ judgment method. That means that MORAG’s perfromance is probably underrated. See paper for more details.

    We attempted to estimate what further potential future work may uncover. Performing exhaustive search in all features subsets provides such an estimate, which shows a lot of potential for more work.

    The graphs are not so far apart. An interesting trend is that beyond a certain threshold, adding more pseudo-relevant documents harms performance as their relevance becomes less accurate, but when using true relevant documents this doesn’t happen, which proves the cause is indeed that.

    Favorites, Groups & Events

    Concept-Based Information Retrieval using Explicit Semantic Analysis - Presentation Transcript

    1. Concept-Based Information Retrieval using Explicit Semantic Analysis
      M.Sc. Seminar talk
      Ofer Egozi, CS Department, Technion
      Supervisor: Prof. Shaul Markovitch
      24/6/09
    2. Information Retrieval
      Query
      IR
      Recall
      Precision
    3. Ranked retrieval
      Query
      IR
    4. Keyword-based retrieval
      Bag Of Words (BOW)
      Query
      IR
    5. Problem: retrieval misses
      TREC document LA071689-0089
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
      TREC topic #411
      salvaging shipwreck treasure
      I
      I
      Query
      IR
    6. The vocabulary problem
      Identity: Syntax
      (tokenization, stemming…)
      Similarity: Synonyms (Wordnet etc.)
      Relatedness: Semantics / world knowledge
      (???)
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
      ?
      [but also shipping/treasurer]
      Synonymy / Polysemy
      ?
      [but also deliver/scavenge/relieve]
      salvaging shipwreck treasure
    7. Concept-based retrieval
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
      IR
      salvaging shipwreck treasure
    8. Concept-based representations
      Human-edited Thesauri (e.g. WordNet)
      Source: editors , concepts: words, mapping: manual
      Corpus-based Thesauri (e.g. co-occurrence)
      Source: corpus , concepts: words , mapping: automatic
      Ontology mapping (e.g. KeyConcept)
      Source: ontology , concepts: ontology node(s) , mapping: automatic
      Latent analysis (e.g. LSA, pLSA, LDA)
      Source: corpus , concepts: word distributions , mapping: automatic
      Insufficient granularity
      Non-intuitive Concepts
      Expensive repetitive computations
      Non-scalable solution
    9. Concept-based representations
      Human-edited Thesauri (e.g. WordNet)
      Source: editors , concepts: words, mapping: manual
      Corpus-based Thesauri (e.g. co-occurrence)
      Source: corpus , concepts: words , mapping: automatic
      Ontology mapping (e.g. KeyConcept)
      Source: ontology , concepts: ontology node(s) , mapping: automatic
      Latent analysis (e.g. LSA, pLSA, LDA)
      Source: corpus , concepts: word distributions , mapping: automatic
      Is it possible to devise a
      concept-based representation, that is scalable, computationally
      feasible, and uses intuitive
      and granular concepts?
      Insufficient granularity
      Non-intuitive Concepts
      Expensive repetitive computations
      Non-scalable solution
    10. Explicit Semantic Analysis
      Gabrilovich and Markovitch (2005,2006,2007)
    11. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      World War II
      Panthera
      Jane Fonda
      Island
      concept
    12. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      Every Wikipedia article represents a concept
      Panthera
      Cat [0.92]
      Leopard [0.84]
      Article words are associated with the concept(TF.IDF)
      Roar [0.77]
      concept
    13. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      Every Wikipedia article represents a concept
      Panthera
      Cat [0.92]
      Leopard [0.84]
      Article words are associated with the concept(TF.IDF)
      Roar [0.77]
    14. Explicit Semantic Analysis (ESA)
      Wikipedia is viewed as an ontology - a collection of ~1M concepts
      Every Wikipedia article represents a concept
      Article words are associated with the concept(TF.IDF)
      Panthera
      The semantics of a word is the vector of its associations with Wikipedia concepts
      Cat [0.92]
      Leopard [0.84]
      Panthera
      [0.92]
      Cat
      [0.95]
      Jane Fonda
      [0.07]
      Cat
      Roar [0.77]
    15. Explicit Semantic Analysis (ESA)
      The semantics of a text fragment is the average vector (centroid) of the semantics of its words
      In practice – disambiguation…
      Mouse (computing)
      [0.81]
      MickeyMouse[0.81]
      Game Controller
      [0.64]
      Button
      [0.93]
      Game Controller
      [0.32]
      Mouse (rodent)
      [0.91]
      John Steinbeck
      [0.17]
      Mouse (computing)
      [0.95]
      Mouse (rodent)
      [0.56]
      Dick Button
      [0.84]
      Mouse (computing)
      [0.84]
      Drag- and-drop
      [0.91]
      button
      mouse
      mouse button
      mouse button
    16. MORAG*: An ESA-based information retrieval algorithm
      *MORAG: Flail in Hebrew
      “Concept-based feature generation and selection for information retrieval”, AAAI-2008
    17. Enrich documents/queries
      ESA
      IR
      Query
      Constraint: use only the
      strongest concepts
    18. Problem: documents (in)coherence
      TREC document LA120790-0036
      REFERENCE BOOKS SPEAK VOLUMES TO KIDS;
      With the school year in high gear, it's a good time to consider new additions to children's home reference libraries…
      …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16…
      …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books…
      …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea…
      Document is judged relevant for topic 411 due to one relevant passage in it
      Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based?
      Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
    19. Solution: split to passages
      ESA
      IR
      Query
      ConceptScore(d) =
      ConceptScore(full-doc) +
      max ConceptScore(passage)
      passaged
      Index both full document and passages.
      Best performance achieved by fixed-length overlapping sliding windows.
    20. Morag ranking
      Score(q,d) =
      ConceptScore(q,d) +
      (1-)KeywordScore(q,d)
      IR
      Query
    21. ESA-based retrieval example
      • Shipwreck
      • Treasure
      • Maritime archaeology
      • Marine salvage
      • History of the British Virgin Islands
      • Wrecking (shipwreck)
      • Key West, Florida
      • Flotsam and jetsam
      • Wreck diving
      • Spanish treasure fleet
      “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

      • Scuba diving
      • Wreck diving
      • RMS Titanic
      • USS Hoel (DD-533)
      • Shipwreck
      • Underwater archaeology
      • USS Maine (ACR-1)
      • Maritime archaeology
      • Tomb Raider II
      • USS Meade (DD-602)
      salvaging shipwreck treasure
    22. Problem: irrelevant docs retrieved
      • Estonia
      • Economy of Estonia
      • Estonia at the 2000 Summer Olympics
      • Estonia at the 2004 Summer Olympics
      • Estonia national football team
      • Estonia at the 2006 Winter Olympics
      • Baltic Sea
      • Eurozone
      • TiitVähi
      • Military of Estonia
      TREC topic #434
      !
      I
      I
      Estonia economy
      “Olympic News In Brief: Cycling win for Estonia.
      Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "

      ??
      ??
      • Estonia at the 2000 Summer Olympics
      • Estonia at the 2004 Summer Olympics
      • 2006 Commonwealth Games
      • Estonia at the 2006 Winter Olympics
      • 1992 Summer Olympics
      • Athletics at the 2004 Summer Olympics
      • 2000 Summer Olympics
      • 2006 Winter Olympics
      • Cross-country skiing 2006 Winter Olympics
      • New Zealand at the 2006 Winter Olympics
    23. “Economy” is not mentioned, but TF·IDF of “Estonia” is strong enough to trigger this concept on its own…
    24. Selection could remove noisy ESA concepts
      However, IR task provides no training data…
      Problem: selecting query features
      Focus on query concepts - Query is short and noisy, while FS at indexing lacks context
      Utility function U(+|-) requires target measure >> training set
      U
      f
      =ESA(q)
      Filter
      f’
    25. Solution: Pseudo Relevance Feedback
      Use BOW results as positive / negative examples
    26. ESA feature selection methods
      IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features
      RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples
      IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search
      • Estonia
      • Economy of Estonia
      • Estonia at the 2000 Summer Olympics
      • Estonia at the 2004 Summer Olympics
      • Estonia national football team
      • Estonia at the 2006 Winter Olympics
      • Baltic Sea
      • Eurozone
      • TiitVähi
      • Military of Estonia
      ESA-based retrieval – FS example
      • Monetary Policy
      • Euro
      • Economy of Europe
      • Nordic Countries
      • Prime Minister of Estonia
      Estonia economy
      “Olympic News In Brief: Cycling win for Estonia.
      Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "





      Broad
      features
      Noise
      features
      RV adds
      features
      Useful ones “bubble up”




      • Neoliberalism
      • Estonia at the 2000 Summer Olympics
      • Estonia at the 2004 Summer Olympics
      • 2006 Commonwealth Games
      • Estonia at the 2006 Winter Olympics
      • 1992 Summer Olympics
      • Athletics at the 2004 Summer Olympics
      • 2000 Summer Olympics
      • 2006 Winter Olympics
      • Cross-country skiing 2006 Winter Olympics
      • New Zealand at the 2006 Winter Olympics
    27. Morag evaluation
      Testing over TREC-8 and Robust-04 datasets (528K documents, 50 web-like queries)
      Feature selection is highly effective
    28. Morag evaluation
      Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines
      Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…
    29. Morag evaluation
      Optimal (“Oracle”) selection analysis shows much more potential for Morag
    30. Morag evaluation
      Pseudo-relevance proves to be a good approximation of actual relevance
    31. Conclusion
      Morag: a new methodology for concept-based information retrieval
      Documents and query are enhanced by Wikipedia concepts
      Informative features are selected using pseudo-relevance feedback
      The generated features improve the performance of BOW-based systems
    32. Thank you!

    + Ofer EgoziOfer Egozi, 4 months ago

    custom

    652 views, 0 favs, 0 embeds more stats

    My master's thesis seminar at the Technion, summari more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 652
      • 652 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 35
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories