ConstantinOrasanResearch Group in Computational Linguistics,University of Wolverhampton, UKhttp://www.wlv.ac.uk/~in6093/From TREC to Watson: is open domain question answering a solved problem?
Structure of the talk4 July 2011Constantin Orasan - KEPT 2011Brief introduction to QAVideo 1: Where are we now – IBM WatsonThe structure of a QA systemVideo 2: Watson vs. humansOverview of WatsonQA from the point of view of users/companiesConclusions
Information overload4 July 2011Constantin Orasan - KEPT 2011“Getting information off the Internet is like taking a drink from a fire hydrant”Mitchell Kapor
What is question answering?4 July 2011Constantin Orasan - KEPT 2011A way to address the problem of information overloadQuestion answering aims at identifying the answer to a question posed in natural languagein a large collection of documentsThe information provided by QA is more focused than information retrievalThe output can be the exact answer or a text snippet which contains the answerThe domain took off as a result of the introduction of QA track in TREC, whilst cross-lingual QA as a result of CLEF
Types of QA systems4 July 2011Constantin Orasan - KEPT 2011open-domain QA systems: can answer any question from any collection+ can potentially answer any question- very low accuracy (especially in cross-lingual settings)canned QA systems: rely on a very large repository of questions for which the answer is known+ very little language processing necessary- limited to the answers in the databaseclosed-domain QA systems: are built for very specific domains and exploit expert knowledge in them+ very high accuracy- can require extensive language processing and limited to one domain
Evolution of QA domain4 July 2011Constantin Orasan - KEPT 2011Early QA systems date as back as 1960s and were mainly front ends to databaseshad limited usability Open-domain QA emerged as a result of the increasing amount of data availableto answer a question need to find and extract the answerdeveloped last 1990s as a result of the QA track at Text REtrieval Conferencesemphasis on factoid questions, but other types of questions were also exploredCLEF competitions have encouraged development of cross-lingual systems.
Where are we now?4 July 2011Constantin Orasan - KEPT 2011IBM and the Jeopardy ChallengeJeopardy! is an American quiz show where participants are given clues and need to guess the question (e.g. if the clue is The Father of Our Country; he didn't really chop down a cherry tree the contestant would respond Who is George Washington?)Watson is a QA system developed by IBMhttp://www.youtube.com/watch?v=FC3IryWr4c8
Structure of an open domain QA system4 July 2011Constantin Orasan - KEPT 2011A typical open domain QA system consists of:Question processorDocument processorAnswer extractor (and validation)Can have components for cross-lingual processingHas access to several external resources
Question processor4 July 2011Constantin Orasan - KEPT 2011Produces an interpretation of the questionDetermines the Question Type (e.g. factoid, definition, procedure, etc.)Determines the Expected Answer Type (EAT)On the basis of the question it produces a queryDetermines syntactic and semantic relations between the words from the questionsExpands the query with synonymsMay perform translation of the keywords in the query in the case of cross-lingual QA
Expected answer type calculation4 July 2011Constantin Orasan - KEPT 2011Relies on the existence of an answer type taxonomyThis taxonomy can be made open-domain by linking to general ontologies such as WordNetThe EAT can be determined using rule-based as well as machine learning approachesWho is the president of Romania?Where is Paris?Knowledge of domain can greatly improve the identification of EAT and help deal with ambiguities
Query formulation4 July 2011Constantin Orasan - KEPT 2011Produces a query from the questionAs a list of keywordsAs a list of phrasesIdentifies entities present in the questionProduce variants of the query by introducing morphological, lexical and semantic variationsDomain knowledge is very important for identification of entities and generation of valid variations andvital in cross-lingual scenarios
Document processing4 July 2011Constantin Orasan - KEPT 2011Uses the query produced in the previous step to retrieve paragraphs which may contain the answerIt is largely domain independent as it relies on text retrieval enginesRanks results, but this is largely independent of the QA taskFor limited collections of texts it is possible to enrich the index with various linguistic information which can help further processingWhen the domain is known, characteristics of the input files can improve the retrieval (e.g. presence of metadata)
Answer extraction4 July 2011Constantin Orasan - KEPT 2011Uses a variety of techniques to identify the answer of a questionThe answer should have the type of EATVery often rely on previously created patterns (e.g. When was the telephone invented? can be answered if there is a sentence that matches the pattern The telephone was invented in <date>), Many patterns can express the same answer (e.g. the telephone, invented in <date>)Relations identified in the question between the expected answer and entities from the question can be exploited by patterns
Answer extraction (II)4 July 2011Constantin Orasan - KEPT 2011Potential answers are ranked according to functions which are usually learned from the dataThe ranking and validation of answers can be done using external sources such as the InternetQA for well defined domains can rely on better patternsThe functions learned usually work well only on the type of data used for training
Open domain QA - evaluation4 July 2011Constantin Orasan - KEPT 2011Great coverage, but low accuracyFor example:EPHYRA QA system in TRAC 2007 reports an accuracy of 0.20 for factoid questions (Schlaefer et al. 2007)OpenEphyra was used for a cross-lingual Romanian – English QA system and we obtained 0.11 accuracy for factoid questions (Dornescu et al. 2008) – the best performing system for all cross-lingual QA tasks in CLEF 2008The results are not directly comparable (different QA engines, tuned differently, different collections, different tasks)But does it make sense to do open domain question answering?
How did Watson perform?4 July 2011Constantin Orasan - KEPT 2011http://www.youtube.com/watch?v=Puhs2LuO3Zc
How was this achieved?4 July 2011Constantin Orasan - KEPT 2011Starting point the Practical Intelligent Question Answering Technology (PIQUANT) developed by IBM to participate in TRECHas been under development at IBM for more than 6 years by a team of 4 full time researchersWas one of the top three to five in many TRECsPIQUANT was performing around 0.33 on the TREC dataPIQUANT used a standard architecture for QA
How was this achieved? (II)4 July 2011Constantin Orasan - KEPT 2011Lots of extra work was put in the system: a core team of 20 researchers working for almost 4 yearsPIQUANT system was enriched with a large number of modules for language processingThe processing was parallelised heavilyLots of components were developed to deal with specific problems (lots of experts)Watson tries to combine deep and shallow knowledgeHad access to large data sets and very good hardware
Overview of Watson’s structure4 July 2011Constantin Orasan - KEPT 2011
Hardware used4 July 2011Constantin Orasan - KEPT 2011Watson is a workload optimized system designed for complex analytics, made possible by integrating massively parallel POWER7 processors and the IBM DeepQA software to answer Jeopardy! questions in under three seconds. Watson is made up of a cluster of ninety IBM Power 750 servers (plus additional I/O, network and cluster controller nodes in 10 racks) with a total of 2880 POWER7 processor cores and 16 Terabytes of RAM. Each Power 750 server uses a 3.5 GHz POWER7 eight core processor, with four threads per core. The POWER7 processor's massively parallel processing capability is an ideal match for Watson's IBM DeepQA software which is embarrassingly parallel (that is a workload that is easily split up into multiple parallel tasks).According to John Rennie, Watson can process 500 gigabytes, the equivalent of a million books, per second. IBM's master inventor and senior consultant Tony Pearson estimated Watson's hardware cost at about $3 million and with 80 TeraFLOPs would be placed 94th on the Top 500 Supercomputers list.From: http://en.wikipedia.org/wiki/Watson_(computer)
Speed of answer4 July 2011Constantin Orasan - KEPT 2011In Jeopardy! an answer needs to be provided in 3-5 secondsIn initial experiments with running Watson on a single processor an answer was obtained in about 2 hoursThe system was implemented using Apache UIMA Asynchronous ScaleoutMassively parallel architectureIndexes used to answer the questions had to be pre-processed using Hadoop
Watson was not only NLP4 July 2011Constantin Orasan - KEPT 2011Betting strategyhttp://www.youtube.com/watch?v=vA9aqAd2iso
To sum up, Watson is:4 July 2011Constantin Orasan - KEPT 2011An amazing engineering projectA massive investmentResearch in many domains of NLPA big PR stuntA way to improve the IBM position in text analyticsBut it is not really a technology ready to be deployedBut was it a real progress in open-domain QA?
So is open domain QA a solved problem?Can we really solve open domain QA?Do we really need open domain QA?Do we care?
QA from user perspectiveReal user questions
Are rarely open domain
Can rarely be formulated in one go
Do not always contain answers from only one source
Companies
Have very well defined needs
Have access to previously asked questions
Need very high accuracy
Most of them cannot afford to invest millions of dollars The QALL-ME projectQuestion Answering Learning technologies in a multiLingual and Multimodal Environment (QALL-ME) – FP6 funded project on Multilingual and Multimodal Question AnsweringFBK, Trento, Italy – coordinatorUniversity of Wolverhampton, UKDFKI, GermanyUniversity of Alicante, SpainComdata, ItalyUbiest, ItalyWaycom, Italyhttp://qallme.fbk.euHas established an infrastructure for multilingual and multimodal question answering
The QALL-ME projectdemonstrators in domain of tourism – can answer questions in the domain of cinema/movies and accommodation. E.g. What movies can I see in Wolverhampton this week? How can I get to Novotel Hotel, Wolverhampton?the questions can be asked in any of the four languages in the consortiumsmall scale demonstrator built for Romanian
QALL-ME framework4 July 2011Constantin Orasan - KEPT 2011
The QALL-ME ontology4 July 2011Constantin Orasan - KEPT 2011All the reasoning and processing is done using a domain ontologyThe ontology also provides the means of achieving cross-lingual QADetermines the way data is stored in the databaseOntologies need to be developed for each domain
30Part of the tourism ontology
Evaluation of the QALL-ME prototype4 July 2011Constantin Orasan - KEPT 2011For the cinema domain the accuracy ranged between 60% to 85% depending on the languageThe system was tested on real questions posed by the users which were completely independent from the ones used to develop the systemThe error were mainly caused by wrongly identified named entities, missing patterns and mistakes of the entailment engineIn an commercial environment this system can be revised every day in order to obtain much higher performance
Closed domain QA for commercial companies4 July 2011Constantin Orasan - KEPT 2011Closed domain QA has a certain appeal with companiesThese companies normally have large databases of questions and answers from customersThe domain can be very clearly definedIn some cases the systems needed are actually canned QA systems

From TREC to Watson: is open domain question answering a solved problem?

  • 1.
    ConstantinOrasanResearch Group inComputational Linguistics,University of Wolverhampton, UKhttp://www.wlv.ac.uk/~in6093/From TREC to Watson: is open domain question answering a solved problem?
  • 2.
    Structure of thetalk4 July 2011Constantin Orasan - KEPT 2011Brief introduction to QAVideo 1: Where are we now – IBM WatsonThe structure of a QA systemVideo 2: Watson vs. humansOverview of WatsonQA from the point of view of users/companiesConclusions
  • 3.
    Information overload4 July2011Constantin Orasan - KEPT 2011“Getting information off the Internet is like taking a drink from a fire hydrant”Mitchell Kapor
  • 4.
    What is questionanswering?4 July 2011Constantin Orasan - KEPT 2011A way to address the problem of information overloadQuestion answering aims at identifying the answer to a question posed in natural languagein a large collection of documentsThe information provided by QA is more focused than information retrievalThe output can be the exact answer or a text snippet which contains the answerThe domain took off as a result of the introduction of QA track in TREC, whilst cross-lingual QA as a result of CLEF
  • 5.
    Types of QAsystems4 July 2011Constantin Orasan - KEPT 2011open-domain QA systems: can answer any question from any collection+ can potentially answer any question- very low accuracy (especially in cross-lingual settings)canned QA systems: rely on a very large repository of questions for which the answer is known+ very little language processing necessary- limited to the answers in the databaseclosed-domain QA systems: are built for very specific domains and exploit expert knowledge in them+ very high accuracy- can require extensive language processing and limited to one domain
  • 6.
    Evolution of QAdomain4 July 2011Constantin Orasan - KEPT 2011Early QA systems date as back as 1960s and were mainly front ends to databaseshad limited usability Open-domain QA emerged as a result of the increasing amount of data availableto answer a question need to find and extract the answerdeveloped last 1990s as a result of the QA track at Text REtrieval Conferencesemphasis on factoid questions, but other types of questions were also exploredCLEF competitions have encouraged development of cross-lingual systems.
  • 7.
    Where are wenow?4 July 2011Constantin Orasan - KEPT 2011IBM and the Jeopardy ChallengeJeopardy! is an American quiz show where participants are given clues and need to guess the question (e.g. if the clue is The Father of Our Country; he didn't really chop down a cherry tree the contestant would respond Who is George Washington?)Watson is a QA system developed by IBMhttp://www.youtube.com/watch?v=FC3IryWr4c8
  • 8.
    Structure of anopen domain QA system4 July 2011Constantin Orasan - KEPT 2011A typical open domain QA system consists of:Question processorDocument processorAnswer extractor (and validation)Can have components for cross-lingual processingHas access to several external resources
  • 9.
    Question processor4 July2011Constantin Orasan - KEPT 2011Produces an interpretation of the questionDetermines the Question Type (e.g. factoid, definition, procedure, etc.)Determines the Expected Answer Type (EAT)On the basis of the question it produces a queryDetermines syntactic and semantic relations between the words from the questionsExpands the query with synonymsMay perform translation of the keywords in the query in the case of cross-lingual QA
  • 10.
    Expected answer typecalculation4 July 2011Constantin Orasan - KEPT 2011Relies on the existence of an answer type taxonomyThis taxonomy can be made open-domain by linking to general ontologies such as WordNetThe EAT can be determined using rule-based as well as machine learning approachesWho is the president of Romania?Where is Paris?Knowledge of domain can greatly improve the identification of EAT and help deal with ambiguities
  • 11.
    Query formulation4 July2011Constantin Orasan - KEPT 2011Produces a query from the questionAs a list of keywordsAs a list of phrasesIdentifies entities present in the questionProduce variants of the query by introducing morphological, lexical and semantic variationsDomain knowledge is very important for identification of entities and generation of valid variations andvital in cross-lingual scenarios
  • 12.
    Document processing4 July2011Constantin Orasan - KEPT 2011Uses the query produced in the previous step to retrieve paragraphs which may contain the answerIt is largely domain independent as it relies on text retrieval enginesRanks results, but this is largely independent of the QA taskFor limited collections of texts it is possible to enrich the index with various linguistic information which can help further processingWhen the domain is known, characteristics of the input files can improve the retrieval (e.g. presence of metadata)
  • 13.
    Answer extraction4 July2011Constantin Orasan - KEPT 2011Uses a variety of techniques to identify the answer of a questionThe answer should have the type of EATVery often rely on previously created patterns (e.g. When was the telephone invented? can be answered if there is a sentence that matches the pattern The telephone was invented in <date>), Many patterns can express the same answer (e.g. the telephone, invented in <date>)Relations identified in the question between the expected answer and entities from the question can be exploited by patterns
  • 14.
    Answer extraction (II)4July 2011Constantin Orasan - KEPT 2011Potential answers are ranked according to functions which are usually learned from the dataThe ranking and validation of answers can be done using external sources such as the InternetQA for well defined domains can rely on better patternsThe functions learned usually work well only on the type of data used for training
  • 15.
    Open domain QA- evaluation4 July 2011Constantin Orasan - KEPT 2011Great coverage, but low accuracyFor example:EPHYRA QA system in TRAC 2007 reports an accuracy of 0.20 for factoid questions (Schlaefer et al. 2007)OpenEphyra was used for a cross-lingual Romanian – English QA system and we obtained 0.11 accuracy for factoid questions (Dornescu et al. 2008) – the best performing system for all cross-lingual QA tasks in CLEF 2008The results are not directly comparable (different QA engines, tuned differently, different collections, different tasks)But does it make sense to do open domain question answering?
  • 16.
    How did Watsonperform?4 July 2011Constantin Orasan - KEPT 2011http://www.youtube.com/watch?v=Puhs2LuO3Zc
  • 17.
    How was thisachieved?4 July 2011Constantin Orasan - KEPT 2011Starting point the Practical Intelligent Question Answering Technology (PIQUANT) developed by IBM to participate in TRECHas been under development at IBM for more than 6 years by a team of 4 full time researchersWas one of the top three to five in many TRECsPIQUANT was performing around 0.33 on the TREC dataPIQUANT used a standard architecture for QA
  • 18.
    How was thisachieved? (II)4 July 2011Constantin Orasan - KEPT 2011Lots of extra work was put in the system: a core team of 20 researchers working for almost 4 yearsPIQUANT system was enriched with a large number of modules for language processingThe processing was parallelised heavilyLots of components were developed to deal with specific problems (lots of experts)Watson tries to combine deep and shallow knowledgeHad access to large data sets and very good hardware
  • 19.
    Overview of Watson’sstructure4 July 2011Constantin Orasan - KEPT 2011
  • 20.
    Hardware used4 July2011Constantin Orasan - KEPT 2011Watson is a workload optimized system designed for complex analytics, made possible by integrating massively parallel POWER7 processors and the IBM DeepQA software to answer Jeopardy! questions in under three seconds. Watson is made up of a cluster of ninety IBM Power 750 servers (plus additional I/O, network and cluster controller nodes in 10 racks) with a total of 2880 POWER7 processor cores and 16 Terabytes of RAM. Each Power 750 server uses a 3.5 GHz POWER7 eight core processor, with four threads per core. The POWER7 processor's massively parallel processing capability is an ideal match for Watson's IBM DeepQA software which is embarrassingly parallel (that is a workload that is easily split up into multiple parallel tasks).According to John Rennie, Watson can process 500 gigabytes, the equivalent of a million books, per second. IBM's master inventor and senior consultant Tony Pearson estimated Watson's hardware cost at about $3 million and with 80 TeraFLOPs would be placed 94th on the Top 500 Supercomputers list.From: http://en.wikipedia.org/wiki/Watson_(computer)
  • 21.
    Speed of answer4July 2011Constantin Orasan - KEPT 2011In Jeopardy! an answer needs to be provided in 3-5 secondsIn initial experiments with running Watson on a single processor an answer was obtained in about 2 hoursThe system was implemented using Apache UIMA Asynchronous ScaleoutMassively parallel architectureIndexes used to answer the questions had to be pre-processed using Hadoop
  • 22.
    Watson was notonly NLP4 July 2011Constantin Orasan - KEPT 2011Betting strategyhttp://www.youtube.com/watch?v=vA9aqAd2iso
  • 23.
    To sum up,Watson is:4 July 2011Constantin Orasan - KEPT 2011An amazing engineering projectA massive investmentResearch in many domains of NLPA big PR stuntA way to improve the IBM position in text analyticsBut it is not really a technology ready to be deployedBut was it a real progress in open-domain QA?
  • 24.
    So is opendomain QA a solved problem?Can we really solve open domain QA?Do we really need open domain QA?Do we care?
  • 25.
    QA from userperspectiveReal user questions
  • 26.
  • 27.
    Can rarely beformulated in one go
  • 28.
    Do not alwayscontain answers from only one source
  • 29.
  • 30.
    Have very welldefined needs
  • 31.
    Have access topreviously asked questions
  • 32.
  • 33.
    Most of themcannot afford to invest millions of dollars The QALL-ME projectQuestion Answering Learning technologies in a multiLingual and Multimodal Environment (QALL-ME) – FP6 funded project on Multilingual and Multimodal Question AnsweringFBK, Trento, Italy – coordinatorUniversity of Wolverhampton, UKDFKI, GermanyUniversity of Alicante, SpainComdata, ItalyUbiest, ItalyWaycom, Italyhttp://qallme.fbk.euHas established an infrastructure for multilingual and multimodal question answering
  • 34.
    The QALL-ME projectdemonstratorsin domain of tourism – can answer questions in the domain of cinema/movies and accommodation. E.g. What movies can I see in Wolverhampton this week? How can I get to Novotel Hotel, Wolverhampton?the questions can be asked in any of the four languages in the consortiumsmall scale demonstrator built for Romanian
  • 35.
    QALL-ME framework4 July2011Constantin Orasan - KEPT 2011
  • 36.
    The QALL-ME ontology4July 2011Constantin Orasan - KEPT 2011All the reasoning and processing is done using a domain ontologyThe ontology also provides the means of achieving cross-lingual QADetermines the way data is stored in the databaseOntologies need to be developed for each domain
  • 37.
    30Part of thetourism ontology
  • 38.
    Evaluation of theQALL-ME prototype4 July 2011Constantin Orasan - KEPT 2011For the cinema domain the accuracy ranged between 60% to 85% depending on the languageThe system was tested on real questions posed by the users which were completely independent from the ones used to develop the systemThe error were mainly caused by wrongly identified named entities, missing patterns and mistakes of the entailment engineIn an commercial environment this system can be revised every day in order to obtain much higher performance
  • 39.
    Closed domain QAfor commercial companies4 July 2011Constantin Orasan - KEPT 2011Closed domain QA has a certain appeal with companiesThese companies normally have large databases of questions and answers from customersThe domain can be very clearly definedIn some cases the systems needed are actually canned QA systems
  • 40.
    Interactive QA4 July2011Constantin Orasan - KEPT 2011It is easy to ask: Where can I eat paella tonight?but what about: What mobile phones are smart phones with a camera, have GPS, have touch screen, are produced by HTC or Nokia, but definitely not Samsung, have 1G internal memory ….Interactivity is needed to deal with ambiguities, wrong answers, user feedback, etc.
  • 41.
    Interactive QA4 July2011Constantin Orasan - KEPT 2011“process where the user is a continual part of the information loop”At intersection of Dialogue Systems and Question AnsweringInitiate dialogue with the user in cases where there are too many or too few answers, or there is some ambiguity in the requestCan suggest follow up questionsProvide a more natural way of locating a needed informationCompanies are interested in developing IQA systems as a way for providing customer services
  • 42.
    Hypothetical IQA session[1]SYS: Hi![2] USER: Can you find me a smart phone with a camera?[3] SYS: There are [NUMBER][big number] of smart phones featuring a camera, would you like to search for some [BRAND]?[4] USER: No, which have GPS?[5] SYS: [STILL BIG NUMBER] have GPS, are you interested in ones having TOUCH SCREEN?[Suggesting new constraints][6] USER: Yes, it would be great.[7] SYS: [NUMBER] of Nokia phones, [NUMBER] HTC phones, [NUMBER] Samsung phones, [NUMBER] of other brands.[8] USER: Ok, what about HTC phones?[9] SYS: [NUMBER] of them have [feature], [NUMBER] of them have [feature].[10] USER: What are the ones with [feature]?[11] SYS: Here you are: [RESULTS].26/05/2011Knowledge acquisition from Wikipedia for IQA35
  • 43.
    Answers from morethan one source4 July 2011Constantin Orasan - KEPT 2011Many complex questions need to compose the answer to a question from several sources:List questions: List all the cantons in Switzerland which border GermanySentiment questions:What features people like in Vista?This is part of the new trend in “deep QA”Even though users probably really need such answers, the technology is still at the stage of research projects
  • 44.
    To sum up…4 July 2011Constantin Orasan - KEPT 2011Some researchers believe that search is dead and “deep QA” is the futureThis was largely fuelled by IBM’s Watson’s winning the Jeopardy! Watson is a fantastic QA system, but it does not solve the problem of open domain QAFor real applications we still want to focus on very well defined domainsWe still want to have the user in the loop to facilitate asking questionsWatson may have revived the interest in QA
  • 45.
    Watson is notalways right4 July 2011Constantin Orasan - KEPT 2011but it kind of knows this ….http://www.youtube.com/watch?v=7h4baBEi0iA
  • 46.
    Thank you foryour attention4 July 2011Constantin Orasan - KEPT 2011