A data driven approach to query expansion in question answering


Published on

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.

As part of an investigation, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions.

These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The structure of this talk Examine IR performance cap by: trying out a few different IR engines, and working out which are the toughest questions
  • We use used a simple, linear clearly defined QA system built at Sheffield, that’s been entered into previous TREC QA track conferences for experimentation There are three steps; processing of a question, including anaphora resolution, perhaps dealing with targets in question series performing some IR to get texts relevant to the question using logic to get a suitable answer out from the retrieved texts Any failures early on will cap the performance of a later component. This gave us a need to assess performance.
  • Coverage – amount of questions where IR engine brings at returns one document with the answer in Redundancy – number of answer-containing documents found for a question TREC gives answers after each competition; a list of expressions that match answers, and the IDs of documents that judges have found useful. Due to the size of the corpora, these aren’t comprehensive lists; so, it’s easy to get a false negative for (say) redundancy when a document that’s actually helpful but not assessed by TREC turns up. Next, we can match documents in a couple of ways; lenient, where the answer text is found (though the context may be completely wrong), and Strict, where the retrieved doc not only contains the answer text but also is in a document that TREC judges marked as helpful.
  • Because of the necessarily linear system design, IR component problems limit AE. If we can’t provide any useful documents, then we’ve little chance of getting the right answer out. Paragraph vs. document level retrieval; paragraph provides less noise, but it hard to do. Document level gives a huge coverage gain, but then chokes answer extraction. Did some work with AE part, found that about 20 paragraphs was right
  • Coverage between half and two-thirds Rarely more than one in twenty documents are actually useful
  • Is the problem with our IR implementation? Could another generic component work? We tested a few options. Which questions are tripping us up? Do they have common factors – grammatical features, expected answer type (a person’s name, a date, a number) – is one particular group failing? How can we tap into these tough questions?
  • Used java Scripted runs in a certain environment – e.g. number of documents to be found Post-process the results of retrievals to score them by a variety of metrics
  • No noticeable performance changes happened with alternative generic IR components Alternatives seem slightly worse off than the original with this configuration Tuning generic IR parameters seems unlikely to yield large QA performance boosts
  • There’s still a large body of difficult questions Many are uniformly tough If we’re to examine a concrete set of harder questions, a definition’s required An average redundancy measure, derived from multiple engines and configurations (e.g. para, doc, lenient, strict) is worked out for every question All questions with average redundancy below a threshold are difficult Threshold as low as zero still provides a good sample to work with
  • To work out how these difficult questions should be answered, we consulted trec answer lists details of useful documents regular expressions for each answer Any unanswerable questions were removed from the difficult list
  • One we know where the answers are – the documents that have them, and the paragraph inside those documents – we can examine surrounding words for context Using these words as extensions may improve coverage How do we find out if this is true, and which ones help?
  • Stick to the usable set of questions. OQ readily available for comparison. OQ also acts as a canary for validating IR parameters of a run– if it’s performance isn’t below the difficult question threshold, something’s gone wrong.
  • We started out by looking at questions that were impossible for answer extraction, because no texts were found for them in the IR stage All extension words that bring useful documents to the fore are useful Three-quarters of tough questions can be made accessible by query extension with context-based words Shows a possibility for lifting the limit on AE performance significantly
  • Adding the name of the capital of the country in question immediately brought useful documents up Adding the name of the country alongside its adjective also helped
  • Adding these military-type words is helpful Also adding a term related to events in the target’s past is helpful This unit may not have fared so well during the scope of news articles in the corpora – decimated!
  • Pertainyms – variations on the parts of speech of a location. E.g., the adjective – describing something from a country – or the title of it. Greenwood 2004 investigates relations between these pertainyms and their effects on search performance Col, Gen both brought the answers up from the index Col, Gen and other titles excluded by some stoplists. Brought in a whitelist of words to make sure these were made into extension candidates Military was also helpful in the 82nd Airborne question
  • Now we have a set of words that a perfect expansion algorithm should provide. Comparing these with an expansion algorithms output eliminates the need to re-run IR for the whole set of expansion candidates. This sometimes took us over half a day using a reasonable system, so the time saving is considerable.
  • Basic relevance feedback will execute an intial information retrieval, and use features of this to pick expansion terms. We chose to use a term-frequency based measure, selecting common words from the initially retrieved texts (IRTs) The number of documents examind to find expansion words is ‘r’
  • Used a trio of metrics. Firstly, the coverage of the terms found in IRTs over the available set of helpful words. Next, the amount of IRTs that had any useful words in at all. For example, retrieving 20 documents, if only 1 has any helpful words, this metric is 5% Finally, the intersection between words chosen for relevance feedback and those that are actually helpful gives a direct evaluation of the extension algorithm.
  • Examined an initial retrieval to see how helpful the IR data could be. Not many of the helpful words occurred (under 20%) Only around a third of documents contained any useful words – the rest only provided noise. The single figure percentages of the intersection between extensions used and those helpful gives a negative outlook for term frequency based RF. Finally, adding massive amounts of noise – up to 98% for the 2004 set – will push helpful documents out.
  • Testing this particular relevance feedback method shows that, as predicted by the very low occurrence of helpful words in the extensions, performance was low. In fact, consistenly lower than when using no query extension at all – due to the excess noise introduced. This supports the hypothesis that TF-based RF is not helpful in IR for QA.
  • The particular implementation using default configurations of general-purpose IR engines isn’t too important. Now we can predict how well an extension algorithm will work without performing a full retrieval. Term frequency based relevance feedback, in the circumstances described, cannot help IR for QA. There are linguistic relationships between query tems and useful query expansions, that with further work can be exploited to raise coverage
  • A data driven approach to query expansion in question answering

    1. 1. A Data Driven Approach toQuery Expansion inQuestion Answering Leon Derczynski, Robert Gaizauskas, Mark Greenwood and Jun Wang Natural Language Processing Group Department of Computer Science University of Sheffield, UK
    2. 2. Summary Introduce a system for QA Find that its IR component limits system performance  Explore alternative IR components  Identify which questions cause IR to stumble Using answer lists, find extension words that make these questions easier Show how knowledge of these words can make rapidly accelerate the development of query expansion methods Show why one simple relevance feedback technique cannot improve IR for QA
    3. 3. How we do QA Question answering system follows a linear procedure to get from question to answers  Pre-processing  Text retrieval  Answer Extraction Performance at each stage affects later results
    4. 4. Measuring QA Performance Overall metrics  Coverage  Redundancy TREC provides answers  Regular expressions for matching text  IDs of documents deemed helpful Ways of assessing correctness  Lenient: the document text contains an answer  Strict: further, the document ID is listed by TREC
    5. 5. Assessing IR Performance Low initial system performance Analysed each component in the system  Question pre-processing correct  Coverage and redundancy checked in IR part
    6. 6. IR component issues Only 65% of questions generate any text to be prepared for answer extraction IR failings cap the entire system performance Need to balance the amount of information retrieved for AE Retrieving more text boosts coverage, but also introduces excess noise
    7. 7. Initial performance Lucene statistics Question year Coverage Redundancy 2004 63.6% 1.62 2005 56.6% 1.15 2006 56.8% 1.18 Using strict matching, at paragraph level
    8. 8. Potential performance inhibitors IR Engine  Is Lucene causing problems?  Profile some alternative engines Difficult questions  Identify which questions cause problems  Examine these:  Common factors  How can they be made approachable?
    9. 9. Information Retrieval Engines AnswerFinder uses a modular framework, including an IR plugin for Lucene Indri and Terrier are two public domain IR engines, which have both been adapted to perform TREC tasks  Indri – based on the Lemur toolkit and INQUERY engine  Terrier – developed in Glasgow for dealing with terabyte corpora Plugins are created for Indri and Terrier, which are then used as replacement IR components Automated testing of overall QA performance done using multiple IR engines
    10. 10. IR Engine performance Engine Coverage RedundancyIndri 55.2% 1.15Lucene 56.8% 1.18Terrier 49.3% 1.00With n=20; strict retrieval; TREC 2006 question set; paragraph-level texts. • Performance between engines does not seem to vary significantly • Non-QA-specific IR Engine tweaking possibly not a great avenue for performance increases
    11. 11. Identification of difficultquestions Coverage of 56.8% indicates that for over 40% of questions, no documents are found. Some questions are difficult for all engines How to define a “difficult” question? Calculate average redundancy (over multiple engines) for each question in a set Questions with average redundancy less than a certain threshold are deemed difficult A threshold of zero is usually enough to find a sizeable dataset
    12. 12. Examining the answer data TREC answer data provides hints to what documents an IR engine ideal for QA should retrieve  Helpful document lists  Regular expressions of answers Some questions are marked by TREC as having no answer; these are excluded from the difficult question set
    13. 13. Making questions accessible Given the answer bearing documents and answer text, it’s easy to extract words from answer-bearing paragraphs For example, where the answer is “baby monitor”: The inventor of the baby monitor found this device almost accidentally These surrounding words may improve coverage when used as query extensions How can we find out which extension words are most helpful?
    14. 14. Rebuilding the question set Only use answerable difficult questions For each question:  Add original question to the question set as a control  Find target paragraphs in “correct” texts  Build a list of all words in that paragraph, except: answers, stop words, and question words  For each word:  Create a sub-question which consists of the original question, extended by that word
    15. 15. Rebuilding the question setExample: Single factoid question: Q + E  How tall is the Eiffel tower? + height Question in a series: Q + T + E  Where did he play in college? + Warren Moon + NFL
    16. 16. Do data-driven extensionshelp? Base performance is at or below the difficult question threshold (typically zero) Any extension that brings performance above zero is deemed a “helpful word” From the set of difficult questions, 75% were made approachable by using a data-driven extension If we can add these terms accurately to questions, the cap on answer extraction performance is raised
    17. 17. Do data-driven extensionshelp? Question Where did he play in college? Target Warren Moon Base redundancy is zero Extensions  Football Redundancy: 1  NFL Redundancy: 2.5 Adding some generic related words improves performance
    18. 18. Do data-driven extensionshelp? Question Who was the nominal leader after the overthrow? Target Pakistani government overthrown in 1999 Base redundancy is zero Extensions  Islamabad Redundancy: 2.5  Pakistan Redundancy: 4  Kashmir Redundancy: 4 Location based words can raise redundancy
    19. 19. Do data-driven extensionshelp? Question Who have commanded the division? Target 82nd Airborne Division Base redundancy is zero Question expects a list of answers Extensions  Col Redundancy: 2  Gen Redundancy: 3  officer Redundancy: 1  decimated Redundancy: 1 The proper names for ranks help; this can be hinted at by “Who” Events related to the target may suggest words Possibly not a victorious unit!
    20. 20. Observations on helpful words Inclusion of pertainyms has a positive effect on performance, agreeing with more general observations in Greenwood (2004) Army ranks stood out highly Use of an always-include list Some related words help, though there’s often no deterministic relationship between them and the questions
    21. 21. Measuring automatedexpansion Known helpful words are also the target set of words that any expansion method should aim for Once the target expansions are known, measuring automated expansion becomes easier No need to perform IR for every candidate expanded query (some runs over AQUAINT took up to 14 hours on a 4-core 2.3GHz system) Rapid evaluation permits faster development of expansion techniques
    22. 22. Relevance feedback in QA Simple RF works by using features of an initial retrieval to alter a query We picked the highest frequency words in the “initially retrieved texts”, and used them to expand a query The size of the IRT set is denoted r Previous work (Monz 2003) looked at relevance feedback using a small range of values for r Different sizes of initial retrievals are used, between r=5 and r=50
    23. 23. Rapidly evaluating RF Three metrics show how a query expansion technique performs:  Percentage of all helpful words found in IRT  This shows the intersection between words in initially retrieved texts, and the helpful words.  Percentage of texts containing helpful words  If this is low, then the IR system does not retrieve many documents containing helpful words, given the initial query  Percentage of expansion terms that are helpful  This is a key statistic; the higher this is, the better performance is likely to be
    24. 24. Relevance feedbackpredictions RF selects some words to be added on to a query, based on an initial search. 2004 2005 2006 Helpful words found in IRT 4.2% 18.6% 8.9% IRT containing helpful words 10.0% 33.3% 34.3% RF words that are “helpful” 1.25% 1.67% 5.71% Less than 35% of the documents used in relevance feedback actually contain helpful words Picking helpful words out from initial retrievals is not easy, when there’s so much noise Due to the small probability of adding helpful words, relevance feedback is likely not to make difficult questions accessible. Adding noise to the query will drown out otherwise helpful documents for non-difficult questions
    25. 25. Relevance feedback results Coverage at n docs r=5 r=50 Baseline 10 34.7% 28.4% 43.4% 20 44.4% 39.8% 55.3% Only 1.25% - 5.71% of the words that relevance feedback chose were actually helpful; the rest only add noise Performance using TF-based relevance feedback is consistently lower than the baseline Hypothesis of poor performance is supported
    26. 26. Conclusions IR engine performance for QA does not vary wildly Identifying helpful words provides a tool for assessing query expansion methods TF-based relevance feedback cannot be generally effective in IR for QA Linguistic relationships exist that can help in query expansion
    27. 27. Any questions?