Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Overview of Hybrid Architecture in Project HaloJesse Wang, Peter ClarkMarch 18, 2013
Status of HybridArchitectureGoals, Modularity, Dispatcher, Evaluation                                        2
Hybrid System Near Term Goals                                                        CYC•   Setup the infrastructure to co...
Focus Areas of Hybrid Framework (until mid2013)  Modularity  • Loose coupling, high cohesion, data exchange    protocols  ...
Hybrid System Core Components                                CYC     TEQA               Find-A-                Value Chapt...
Infrastructure: Dispatchers               CYC                   TEQA     AURA                                             ...
Dispatcher Features•   Asynchronous batch mode and single/experiment mode•   Parallel dispatching to reasoners    o   Very...
Question-Answering via Suggested Questions•   Similar features as Live/Direct QA•   Aggregate suggested questions’ answers...
Question and Answer Handling•   Handling and parsing reasoner’s returned results    o   Customized programming•   Informat...
Automatic Evaluation: Status as of 2013.3                                       User overall   AutoEval Overall    120•   ...
Hybrid PerformanceHow we evaluate and how can improveoverall system performance                                      11
Caveats: Question Generation and Selection•   Generated by a small group of SMEs (senior biology students)•   In natural l...
Question Set Facets                             Question Types                                                            ...
Caveat: Evaluation Criteria•   We provided a clear guideline, but still subjective    o   A(4.0) = correct, complete answe...
Evaluation ExampleQ: What is the maximum number of different atoms a carbon atom can bind at once?                        ...
More Evaluation Samples (Snapshot)                                     16
Reasoner Quality Overview 160                              Answer Counts Over Rating 140 120                              ...
Performance Number    Reasoner Performance on                           Reasoner Performance        All Ratings (0..4)    ...
Answers Over Question Types     Count of Answered Questions                                 Answer Overall RatingHAVE-RELA...
Answer Distribution Over Chapters   4.00                                    Answer Quality Over Chapters                  ...
Answers on Questions with E/V Answer Type                   Exact/Various Answer Count        50        40                ...
Improve Performance: Hybrid Solver – Combine!•   Random selector (dumbest, baseline)    o   Total question answered correc...
Performance (F1) with Hybrid Solvers                      Performance of Solvers               on Good Answers (Good: Rati...
Conclusion•   Each reasoner has its own strength and weakness    o   Some aspects not handled well by AURA & CYC    o   Lo...
Future and Discussions                     26
Near Future Plans•   Include SQDB-based answers as a “Solver”    o   Help alleviate question interpretation problems by re...
Further Technical Directions (2013.6+)  Get More, Better Reasoners  Machine learning, Evidence  combination  • Extract and...
Open *Data*   Requirements  • Clear Semantics, Common Format (standard), Easy to    Access, Persistent (available)   Data ...
Open *Services*   Two Categories   • Pure machine/algorithms based   • Human-computation (social, crowd sourcing)   Requir...
Open *Environment*   Definition  • AI development environment to facilitate    collaboration, efficiency and scalability  ...
Thank You!For having the opportunity for Q&A Backup slides next                                       32
IBM Watson’s “DeepQA” Hybrid Architecture                                            33
DeepQA Answer Merging And Ranking Module                                           34
Wolfram Alpha Hybrid Architecture•   Data Curation•   Computation•   Linguistic components•   Presentation                ...
36
37
Answer Distribution (Density)                                                    Answer Distribution                      ...
Data Table for Answer Quality Distribution                                             39
Work Performed•   Created web-based dispatcher infrastructure    o   For both Live Direct QA and Live Suggested Questions ...
First Evaluation via Halo Evaluation Framework•   We sent individual QA result set to UW students for evaluation•   First ...
Live Direct QA Dispatcher Service         What does ribosome make?                           Ask a question               ...
Live Suggested QA Dispatcher Service                                       43
Batch QA Dispatcher Service                              44
Live solver Service Dispatchers                                  45
Direct Live QA: What does ribosome make?                                           46
Direct Live QA: What does ribosome make?                                           47
Suggested Questions Dispatcher                                 48
Results for Suggested Question Dispatcher                                            49
50Batch Mode QA Dispatcher
Batch QA Progress Bar                        51
Suggested questions database browser                                       52
Faceted Search on Suggested Questions                                        53
Tuning the Suggested Question RecommendationAccomplished                      Not Yet Implemented• Indexed suggested quest...
Parsing, Indexing and RankingIn-place                           NYI• New local concept extraction     • More sentence feat...
Browse Hybrid system                       56
WIP: Ranking Experiments (Ablation Study) Features                  Only     Without    Only     W/O                      ...
Automatic Evaluation of IR Results•   Inexpensive, consistent results for tuning    o   Always using human judgments would...
First UW Students’ Evaluation on AutoEval•   Notations:    o   0 = right on. 100% is right, 0% is wrong.    o   -1 = false...
Results: Auto-Evaluation Validity Verification  10.9                                                    Threshold at 50%0....
The “Easy” QA set *•    Task: automatic evaluate if retrieved sentences contain the answer•    Scoring: Max score, Mean Av...
“Easy” QA Set Auto-Evaluation                                            Result      0.9      0.8      0.7      0.6      0...
Best Upper Bound for Hard Set as of TodayWith weighting on Answer Text, Answer Concepts, QuestionText, Question Concepts, ...
Sharing the Data and Knowledge•   Information We Want, and each solver may also want•   Everyone’s result•   Everyone’s co...
More Timeline Details for First IntegrationWe are in control                    Partners• AURA                            ...
Rounds of Improvements       Infrastructure (module &       service)       • Integrate solver       • Data I/O            ...
OpenHalo                            AURA                                    SILK                                     QA   ...
Upcoming SlideShare
Loading in …5
×

Hybrid system architecture overview

2,173 views

Published on

This is the deck for Science Advisory Board review of our recent progress in setting up a basic infrastructure -- hybrid system architecture to facilitate automatic question answering in Project Halo -- Vulcan's long-range strong AI effort to attack a key problem in the field of AI research.

Published in: Technology, Education
  • Be the first to comment

Hybrid system architecture overview

  1. 1. Overview of Hybrid Architecture in Project HaloJesse Wang, Peter ClarkMarch 18, 2013
  2. 2. Status of HybridArchitectureGoals, Modularity, Dispatcher, Evaluation 2
  3. 3. Hybrid System Near Term Goals CYC• Setup the infrastructure to communicate with existing reasoners AURA AURA TEQA• Reliably dispatch questions and collect answers CYC• Create related tools and resources Question generation/selection, answer TEQA o evaluation, report analysis, etc.• Experiment ways to choose the answers from Dispatcher available reasoners – as hybrid solver 3
  4. 4. Focus Areas of Hybrid Framework (until mid2013) Modularity • Loose coupling, high cohesion, data exchange protocols Dispatching • Send requests and handle the responses Evaluation • Ability to get ratings on answers, and report results 4
  5. 5. Hybrid System Core Components CYC TEQA Find-A- Value Chapt 7 In AURA IR? Campbell DirectQA Filtered Yellow Outline: New or Updated Set ofQuestions SQs: suggested questions SQA: QA with suggested questions TEQA: Textual Entailment QA IR: Information Retrieval 5
  6. 6. Infrastructure: Dispatchers CYC TEQA AURA IR Dispatcher Live Single QA Suggested QA Batch QA 6
  7. 7. Dispatcher Features• Asynchronous batch mode and single/experiment mode• Parallel dispatching to reasoners o Very functional UI: Live progress indicator, view question file, logs o Exception and error handling • Retry question when server is busy• Batch service can continue to finish even if the client dies o Cancel/stop the batch process also available• Input and output support both XML and CSV/TSV formats o Pipeline support: accept Question-Selector input• Configurable dispatchers, select reasoners o Collect answers and compute basic statistics 7
  8. 8. Question-Answering via Suggested Questions• Similar features as Live/Direct QA• Aggregate suggested questions’ answers as a solver• Unique features: o Interactively browse suggested questions database o Filter on certain facets o Using Q/A concepts, question types, etc. to improve relevance o Automatic comparison of filtered and non-filtered results by chapters 8
  9. 9. Question and Answer Handling• Handling and parsing reasoner’s returned results o Customized programming• Information on execution: details and summary• Report generation o Automatic evaluation• Question Selector o Support multiple facets/filters o Question banks o Dynamic UI to pick questions o Hidden tags support 9
  10. 10. Automatic Evaluation: Status as of 2013.3 User overall AutoEval Overall 120• Automatic result evaluation features • Web UI/service to use 100 • Algorithms to score exact and variable answers – brevity/clarity 80 – relevance: correctness + completeness – overall score • Generate reports 60 – Summary & details – Graph plot• 40Improving evaluation result accuracy • Using: basic text processing tricks (stop words, stemming, trigram 20 similarity, etc.), location of answer, length of answer, bio concepts, counts of concepts, chapters referred, question types, answer type • Experiments and analysis (several rounds, W.I.P.) 0 10
  11. 11. Hybrid PerformanceHow we evaluate and how can improveoverall system performance 11
  12. 12. Caveats: Question Generation and Selection• Generated by a small group of SMEs (senior biology students)• In natural language, without textbook (only syllabus) 12
  13. 13. Question Set Facets Question Types Chapter Distribution 12 0 PROPERTY 4 HOW 5% WHY 7% 5% 11 5 HAVE- HOW-MANY WHAT-IS-A RELATIONSHIP 4% 3% 7% WHERE 5% 10 6 WHAT-DOES-X-DOIS-IT-TRUE-THAT HAVE- 7 3% 9% SIMILARITIES 9 2% 8 Other 9% X-OR-Y 2% FUNCTION-OF-X 1% HAVE- E DIFFERENCES V FIND-A-VALUE 1% 46% 13
  14. 14. Caveat: Evaluation Criteria• We provided a clear guideline, but still subjective o A(4.0) = correct, complete answer, no major weakness o B(3.0) = correct, complete answer with small cosmetic issues o C(2.0) = partially correct or complete answers, with some big issues o D(1.0) = somewhat relevant answer or information, or poor presentation o F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers• Only 3 users to rate the answers, under tight timeline User Preferences 3 2.5 2 Aura 1.5 Cyc 1 Text QA 0.5 0 7 15 23 14
  15. 15. Evaluation ExampleQ: What is the maximum number of different atoms a carbon atom can bind at once? 15
  16. 16. More Evaluation Samples (Snapshot) 16
  17. 17. Reasoner Quality Overview 160 Answer Counts Over Rating 140 120 Aura Cyc 100 Text QA 80 60 40 20 0 0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.00 17
  18. 18. Performance Number Reasoner Performance on Reasoner Performance All Ratings (0..4) on "Good" (>= 3.0) Answers0.600 0.400 0.3500.500 0.3000.400 0.250 Aura Aura0.300 0.200 Cyc Cyc Text QA Text QA 0.1500.200 0.1000.100 0.0500.000 0.000 Precision Recall F1 Precision Recall F1 18
  19. 19. Answers Over Question Types Count of Answered Questions Answer Overall RatingHAVE-RELATIONSHIP HAVE-RELATIONSHIP HAVE-SIMILARITIES HAVE-SIMILARITIESHAVE-DIFFERENCES HAVE-DIFFERENCES Text QA IS-IT-TRUE-THAT Cyc Text QA IS-IT-TRUE-THAT Aura Cyc X-OR-Y X-OR-Y Aura WHAT-IS-A WHAT-IS-A WHAT-DOES-X-DO WHAT-DOES-X-DO PROPERTY PROPERTY HOW-MANY HOW-MANY HOW HOW 36 FIND-A-VALUE FIND-A-VALUE 0 5 10 15 20 0.00 1.00 2.00 3.00 4.00 19
  20. 20. Answer Distribution Over Chapters 4.00 Answer Quality Over Chapters Text QA Text QA 3.50 Text QA 0Aura 4 Cyc Cyc Aura Cyc 3.00 Aura Text QA Cyc 5 6 Text QA 2.50 Text QA Aura Cyc Aura 7 8 2.00 Aura Cyc 1.50 Text QA 9 10 Aura 1.00 Aura Cyc Cyc Text QA Aura Aura 11 12 0.50 Cyc Text QA Cyc Cyc Text QA 0.00 Text QA 0 4 5 6 7 8 9 10 11 12 Aura 3.13 3.67 1.83 2.33 0.58 1.83 1.00 0.50 Cyc 1.75 2.17 1.00 1.67 3.17 1.11 1.83 2.67 Text QA 2.21 2.27 1.23 2.67 2.89 1.20 1.28 1.97 2.06 2.50 20
  21. 21. Answers on Questions with E/V Answer Type Exact/Various Answer Count 50 40 45 40 30 E 20 25 V 10 5 5 13 0 Aura Cyc Text QA Exact/Variou Answer Quality 3.00 2.50 2.00 1.50 E 1.00 V 0.50 0.00 Aura Cyc Text QA 21
  22. 22. Improve Performance: Hybrid Solver – Combine!• Random selector (dumbest, baseline) o Total question answered correctly should beat the best solver• Priority selector (less dumb) o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) * o Expected performance: better than best individual• Trained selector: Feature and rule-based selector (smarter) o Decision-Tree (CTree…) learning over Q-Type, Chapter, … o Expected performance: slightly better than above• Theoretical best selector: MAX – the upper limit (smartest) o Suppose we can always pick the best performing reasoner 22
  23. 23. Performance (F1) with Hybrid Solvers Performance of Solvers on Good Answers (Good: Rating >= 3.0)0.3000.2500.200 Aura Cyc Text QA0.150 Random Priority0.100 D-Tree Max0.0500.000 Aura Cyc Text QA Random Priority D-Tree Max 24
  24. 24. Conclusion• Each reasoner has its own strength and weakness o Some aspects not handled well by AURA & CYC o Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …• Aggregated performance easily beats the best individual (Text QA) o Random solver does a good job (F1: mean=0.609): F1MAX – F1random ~ 2.5%• Little room for better performance via answer selection o F1MAX – F1D-Tree ~ 0.5% o Better focus on MORE and/or BETTER solvers 25
  25. 25. Future and Discussions 26
  26. 26. Near Future Plans• Include SQDB-based answers as a “Solver” o Help alleviate question interpretation problems by reasoners• Include Information Retrieval-based answers as a “Solver” o Help understand the extra power reasoners can have over search• Improvement evaluation mechanism• Extract more features from questions and answers to enable a better solver, and see how close we can get to the upper limit (MAX)• Improve question selector to support multiple sources and automatic update/merge of question metadata• Find ways to handle question bank evolution 27
  27. 27. Further Technical Directions (2013.6+) Get More, Better Reasoners Machine learning, Evidence combination • Extract and use more features to select best answers • Evidence collection and weighing Analytics & tuning • Easier to explore individual results and diagnose failures • Support to tune and optimize performance over target question-answer datasets Inter-solver communication • Support shared data, shared answers • Subgoaling • Allow reasoners to call each other for subgoals 28
  28. 28. Open *Data* Requirements • Clear Semantics, Common Format (standard), Easy to Access, Persistent (available) Data Sources • Questions bank, training sets, knowledge base, protocol for intermediate and final data exchange Open Data Access Layer • Design and implement protocols and services for data I/O 29
  29. 29. Open *Services* Two Categories • Pure machine/algorithms based • Human-computation (social, crowd sourcing) Requirements • Communicate with open data, generate meta data, • More reliable, scalable, reusable Goal: Process and refine data • Convert raw, noisy, inaccurate data  refined, structured, useful 30
  30. 30. Open *Environment* Definition • AI development environment to facilitate collaboration, efficiency and scalability Operation • like MMPOG, each “player” gets credits: contribution, resource consumption; interests, loans; ratings… Opportunities • self-organized projects, growth potential, encourage collaboration, grand prize 31
  31. 31. Thank You!For having the opportunity for Q&A Backup slides next 32
  32. 32. IBM Watson’s “DeepQA” Hybrid Architecture 33
  33. 33. DeepQA Answer Merging And Ranking Module 34
  34. 34. Wolfram Alpha Hybrid Architecture• Data Curation• Computation• Linguistic components• Presentation 35
  35. 35. 36
  36. 36. 37
  37. 37. Answer Distribution (Density) Answer Distribution 16 14 12 Count of Answers 10 8 Text QA Cyc 6 Aura 4 2 0 0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.00 Average User Rating 38
  38. 38. Data Table for Answer Quality Distribution 39
  39. 39. Work Performed• Created web-based dispatcher infrastructure o For both Live Direct QA and Live Suggested Questions o Batch mode to handle larger amount• Built a web UI for UW student to rate answers of questions (HEF) o Coherent UI, duplicate removal, queued tasks• Established automatic ways for result evaluation and comparison• Applied first versions of file exchange format and protocols• Employed initial file and data exchange formats and protocols• Setup faceted browsing and search (retrieval) UI o And web services for 3rd party consumption• Carried out many rounds of relevance studies and analysis 40
  40. 40. First Evaluation via Halo Evaluation Framework• We sent individual QA result set to UW students for evaluation• First round hybrid system evaluation: o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answered o Aura QA: 1 best, 9 good, 14/60 answered; o Aura SQA: 4 best (3 ties), 7 good, 8/60 answered o Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answered o Best scenario: 41/60 answered o Note: Cyc Live was not included o * SQA (Answering via suggested questions) 41
  41. 41. Live Direct QA Dispatcher Service What does ribosome make? Ask a question Waiting for answers Answers returned? 42
  42. 42. Live Suggested QA Dispatcher Service 43
  43. 43. Batch QA Dispatcher Service 44
  44. 44. Live solver Service Dispatchers 45
  45. 45. Direct Live QA: What does ribosome make? 46
  46. 46. Direct Live QA: What does ribosome make? 47
  47. 47. Suggested Questions Dispatcher 48
  48. 48. Results for Suggested Question Dispatcher 49
  49. 49. 50Batch Mode QA Dispatcher
  50. 50. Batch QA Progress Bar 51
  51. 51. Suggested questions database browser 52
  52. 52. Faceted Search on Suggested Questions 53
  53. 53. Tuning the Suggested Question RecommendationAccomplished Not Yet Implemented• Indexed suggested questions • Parsing the questions database • More experiment (heuristics) – Concept, question, answers on retrieval/ranking criteria• Created a web service for – manual upload new set of suggested questions • Get SME generate training• Extracted chapter information data to evaluate from answer text (TEXT) – Automatic• Analyzed question types • More feature extraction – Pattern-based• Experimented with some basic retrieval criteria 54
  54. 54. Parsing, Indexing and RankingIn-place NYI• New local concept extraction • More sentence features service – Content type: Questions, figures, header, reg• Concept extracted and in index ular, review…• Both sentences and paragraphs – Previous and next concepts are in index – Count of concepts• Basic sentence type identified – Clauses• Chapter and section – Universal truth information in – Relevance or not • Question parsing• Several ways of ranking evaluated • More refining on ranking • Learning to Rank ?? 55
  55. 55. Browse Hybrid system 56
  56. 56. WIP: Ranking Experiments (Ablation Study) Features Only Without Only W/O (Easy) (Easy) (Hard) (Hard) Sentence Text 139/201 31/146 Sentence Concept 79/201 13/146 Prev/Next Sentence - - Concept Locality info - - (Chapter, etc.) Stopword list - - Stemming comparison - - Other features (type…) - - Weighting (variations) 57
  57. 57. Automatic Evaluation of IR Results• Inexpensive, consistent results for tuning o Always using human judgments would be expensive and somehow inconsistent• Quick turnover• With both “easy” and “difficult” question-answer sets• Validated by UW students to be trustworthy o 95% accuracy on average with threshold 58
  58. 58. First UW Students’ Evaluation on AutoEval• Notations: o 0 = right on. 100% is right, 0% is wrong. o -1 = false positive. It means we gave it a high score (>50%), but the retrieved text does NOT contain or imply answer o +1 = false negative. It means we gave it a low score (<50%), but the retrieved text actually DOES contain or imply the answer• We gave each of 4 students o 15 questions, 15*5=75 sentences and scores to rank o 5 of the questions are the same, 10 are unique to each student o 23/45 questions from “hard” set, 22/45 from “easy” set 59
  59. 59. Results: Auto-Evaluation Validity Verification 10.9 Threshold at 50%0.8 Threshold at 80%0.70.6 0.5 0.4 0.3 0.2 0.1 0 Threshold at 80% 1 2 Threshold at 50% 3 4 60
  60. 60. The “Easy” QA set *• Task: automatic evaluate if retrieved sentences contain the answer• Scoring: Max score, Mean Average Precision (MAP)• Result using Max (with threshold at 80%): o 193 regular questions and 8 yes/no questions (via concepts overlap) • Only with sentence text: 139 (69.2%) • Peter’s test set: 149 (74.1%) • Peter’s more refined: 158 (78.6%) • (Lower) Upper bound for IR: 170 (84.2%) • Jesse’s best: ?? * The evaluation is for IR portion ONLY, no answer pinpointing 61
  61. 61. “Easy” QA Set Auto-Evaluation Result 0.9 0.8 0.7 0.6 0.5 0.4 Result 0.3 0.2 0.1 0 Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper Bound 62
  62. 62. Best Upper Bound for Hard Set as of TodayWith weighting on Answer Text, Answer Concepts, QuestionText, Question Concepts, matching over Sentence Text, Concepts, andConcepts from Previous and Next Sentences, and sentence type…Comparison with keyword overlap, concept overlap, stopwords removaland smart stemming techniques… 64
  63. 63. Sharing the Data and Knowledge• Information We Want, and each solver may also want• Everyone’s result• Everyone’s confidence on results• Everyone’s supporting evidence o From textbook sentences, reviews, homework section, figures… o From related web material, e.g. biology WikiPedia o From common world knowledge, ParaPara, WordNet, …• Training data – for offline use 66
  64. 64. More Timeline Details for First IntegrationWe are in control Partners• AURA • Cyc – Now – ? Hopefully before EOY 2012• Text • JHU – before 12/7 – ?? Hopefully before EOY 2012• Vulcan IR Baseline • ReVerb – before 12/15 – ??? EOM January 2013• Initial Hybrid System Output – Before 12/21 – Without unified data format – With limited (possibly outdated) suggested questions 67
  65. 65. Rounds of Improvements Infrastructure (module & service) • Integrate solver • Data I/O Tricks (algorithms & data) • Refine Hybrid Strategy • Heuristic + Machine Learning Analysis (evaluation) • Evaluation with humans • With each solver + hybrid system 68
  66. 66. OpenHalo AURA SILK QA CYC QA Vulcan Hybrid System Other TEQA QA Data Service Collaboration 69

×