Building watson


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building watson

  1. 1. BUILDING WATSONThe Science behind an Answer
  2. 2. CONTENTSIntroductionQA ApproachBasic ArchitectureHow Watson AnswersAlgorithmsApplicationsFuture UsesConclusionReferences
  3. 3. IntroductionWhat is Watson? “Watson is an artificial intelligence computer system capable of answering questions posed in natural language, developed in IBMs DeepQA project” – Wikipedia “An application of advanced Natural Language Processing, Information Retrieval, Knowledge Representation and Reasoning, and Machine Learning technologies to the field of open domain question answering” – IBM o the field of open domain question answering” – IBM “DeepQA is an effective and extensible architecture that can be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of question answering (QA)” – AI MagazineSMVIT, Dept. of CSE 2012 Building Watson 3/37
  4. 4. What is Watson?About Watson• Project started in 2007, lead Dr. David Ferrucci• Initial goal: create a system able to process natural language & extract knowledge faster than any other computer or human• Jeopardy! was chosen because it‟s a huge challenge for a computer to find the questions to such “human” answers under time pressure• Watson was NOT online!• Watson weighs the probability of his answer being right – doesn‟t ring the buzzer if he‟s not confident enough• A marvel in Strategic GameplaySMVIT, Dept. of CSE 2012 Building Watson 4/37
  5. 5. What is Watson?Hardware• 90 x IBM Power 750 servers• 2880 POWER7 cores• POWER7 3.55 GHz chip• 500 GB per sec on-chip bandwidth• 10 Gb Ethernet network• 15 Terabytes of memory• 20 Terabytes of disk, clustered• Can operate at 80 Teraflops• Linux provides a scalable, open platform, optimized to exploit POWER7 performance• 10 racks include servers, networking, shared disk system, cluster controllersSMVIT, Dept. of CSE 2012 Building Watson 5/37
  6. 6. A Generic Framework for QA
  7. 7. Corpus or QA Framework document collection The majority of current question answering systems designed to answer factoid Document Retrieval questions consist of three distinct components: Top n text I. Question analysis Question II. Document or passage Q Analysis segments or sentences retrieval III. Answer extraction. Answer ExtractionSMVIT, Dept. of CSE 2012 Building Watson A 7
  8. 8. Basic ArchitectureSMVIT, Dept. of CSE 2012 Building Watson 8/37
  9. 9. How Watson Answers a Question? In Four not so Simple Steps Intro I II III IV 0 Processing Time 1 1 sec 22 sec 3 3 sec Solution 4SMVIT, Dept. of CSE 2012 Building Watson 9/37
  10. 10. How Watson Answers a Question? 1010010101000101010101010101010110010101110 The first person mentioned by name in 1010100001010100101011101000011010010110101 Characters in “The Man in the Iron Mask” is this hero of Classic 0000101010111010101010100101010101000010100 Literature 101001010101001010101010101000101010100 a previous book by the same author.SMVIT, Dept. of CSE 2012 Building Watson 10/37
  11. 11. Question Analysis
  12. 12. Step I – Question AnalysisAs the first component in a QA Determining the Expectedsystem it could easily be argued Answer Typethat question analysis is the mostimportant part. Any mistakesmade at this stage are likely torender useless any furtherprocessing of a question. Query FormationSMVIT, Dept. of CSE 2012 Building Watson 12/37
  13. 13. Step I – Question AnalysisThe first person mentioned by name in“The Man in the Iron Mask”isthis hero of a previous book by the sameauthor.• Parsing – Subject, Verb, Object D‟Artagnan• Identify the rolesSMVIT, Dept. of CSE 2012 Building Watson 13/37
  14. 14. Determining the Expected Answer Type Machine learning techniques to classify a question. We can train our system on thousands of tagged question corpus, Provided by cognitive computation group at the department of computer science, university of Illinois to determine the expected answer.SMVIT, Dept. of CSE 2012 Building Watson 14/37
  15. 15. Query Formation The question analysis component of a QA system is usually responsible for formulating a query from a natural language questions to maximise the performance of the IR engine used by the document retrieval component of the QA system. Who is the president of India ? Search <> for name <> biography <> person – president <> place - IndiaSMVIT, Dept. of CSE 2012 Building Watson 15/37
  16. 16. Document Retrieval
  17. 17. Document Retrieval The text collection over which a QA system works tend to be so large that it is impossible to process whole of it to retrieve the answer. The task of the document retrieval module is to select a small set from the collection which can be practically handled in the later stages. The Taj Mahal completed around 1648 is a mausoleum located in Agra, India, that was built under Mughal Emperor Shah Jahan in memory of his favourite wife, Mumtaz Mahal.SMVIT, Dept. of CSE 2012 Building Watson 17/37
  18. 18. Document RetrievalSMVIT, Dept. of CSE 2012 Building Watson 18/37
  19. 19. Answer Extraction
  20. 20. Answer Extraction Responsible for ranking the sentences and giving a relative probability estimate to each one. It also registers the frequency of each individual phrase chunk marked by the NE recognizer for a given question class at a given rank. Sentence Sense/Sematic Similarity Ranking • Statistics to compute information content value. Answer • Probability to a concept in Frequency taxonomy based on the occurrence of target concept in Combined a given corpus. Answer ScoreSMVIT, Dept. of CSE 2012 Building Watson 20/37
  21. 21. Answer Extraction : Keyword Evidence In May 1898 Portugal celebrated In May, Gary arrived in India the 400th anniversary of this after he celebrated his explorer‟s arrival in India. anniversary in Portugal. arrived in celebrated Keyword Matching celebrated In May Keyword Matching In May 1898 400th anniversary Keyword Matching anniversary Portugal Keyword Matching in Portugal Evidence suggests “Gary” is the answer arrival in BUT the system must learn that India Keyword Matching India keyword matching may be weak relative to other types of evidence explorer Gary SMVIT, Dept. of CSE 2012 Building Watson 21/3721
  22. 22. Answer Extraction : Deeper Evidence On 27th May 1498, Vasco da Gama landed On 27th May 1498, Vasco da Gama landed In May 1898 Portugal celebrated On 27th May BeachVasco da Gama landed in Kappad 1498, On Kappad th Beach 1498, Vasco da in in Kappad of May the 27 Beach the 400th anniversary of this Gama landed in Kappad Beach explorer‟s arrival in India. • Search Far and Wide • Explore many hypotheses celebrated • Find Judge Evidence landed in Portugal • Many inference algorithms Temporal May 1898 400th anniversary 27th May 1498 Reasoning Date Math Statistical arrival in Paraphrasing Para- Stronger phrases GeoSpatial Kappad Beach evidence can India Reasoning be much Geo-KB harder to find explorer Vasco da Gama and score. The evidence is still not 100% certain. SMVIT, Dept. of CSE 2012 Building Watson 22/3722 22
  23. 23. Algorithms Explained
  24. 24. Sense Ranking Algorithm A system that started with a query 2 to 8 words long, and precisely selected several documents that were relevant to the original query. These documents can be used by an IR system to expand the original query, and choose more documents that accurately match the users expectations. It is given by :SMVIT, Dept. of CSE 2012 Building Watson 27/37
  25. 25. WordNet - SynsetThe aim here is to find for each synset in WordNet an article which is aboutthe same concept as the synset, or decide that none exists (for the „car‟synset the match would be the „Automobile‟ article). Aligning the tworesources in this way allows both to be enriched.Let T={ordered set of M(qi) ∀ ἰ ε [1,m] } in increasing order of d(q). FunctionӨi is the distance of the ith element in T then the alignment score An alignment score of 1.0 signifies perfect alignment while a score of - 1.0 signifies reverse order of occurrence.SMVIT, Dept. of CSE 2012 Building Watson 28/37
  26. 26. Answer Confidence Threshold• To Evaluate each candidate answer‟s probable threshold• Statistics based on evidences• Retrieved supporting evidence is also routed to the deep evidence scoring components, which evaluate the candidate answer in the context of the supporting evidence. Chile shares its longest land border with this country Bolivia Source… Argentina Passage… LocationSMVIT, Dept. of CSE 2012 0 0.2 Building0.4 Watson 0.6 0.8 1 29/37
  27. 27. SMVIT, Dept. of CSE 2012 Building Watson 30/37
  28. 28. SMVIT, Dept. of CSE 2012 Building Watson 31/37
  29. 29. Applications of Watson… Real life applications of Watson in Hospitals and Health Centers : If each hospital had a Watson it could diagnose a patients problems in a matter of seconds . In Call Centres and Business Analytics : Call Centres can be manned by Watson . It can also be used for analysis of business or financial supports.SMVIT, Dept. of CSE 2012 Building Watson 32/37
  30. 30. Applications (contd)… Natural language and unstructured sources of information : Watson contains to examine hypotheses, ask for evidence, and evaluate the hypothesis based on the evidence. Implications in Artificial Intelligence: Focus on cognitive computing which aims at simulating the fundamental behaviour of human brain which is to process information and respond on event-driven basis instead of a clock-driven basis.SMVIT, Dept. of CSE 2012 Building Watson 33/37
  31. 31. Future Uses of Watson… CommercializationWatsons data crunching capability to help suggest treatment options anddiagnoses to doctors. IBM intends to use Watson in other information intensivefields as well, like telecom, financial services, government etc. CompetitionOn December 2011 Microsoft and GE announced a healthcare partnership toutilize technology in healthcare .SMVIT, Dept. of CSE 2012 Building Watson 34/37
  32. 32. Conclusion• Research in broader field of AI• Efficient integration and ablation studies of probabilistic components• Result of Rapid experimentation• 5500 independent experiments in 3 years• Won 64 per cent of the games• Advance QA research• Many more future aspects and purposesSMVIT, Dept. of CSE 2012 Building Watson 35/37
  33. 33. References• Ferrucci, D., and Lally, A. 2004. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment.• Maybury, Mark, ed. 2004. New Directions in QuestionAnswering. Menlo Park, CA: AAAI Press•••••
  34. 34. THANK YOU