Hybrid system architecture overview

991
-1

Published on

This is the deck for Science Advisory Board review of our recent progress in setting up a basic infrastructure -- hybrid system architecture to facilitate automatic question answering in Project Halo -- Vulcan's long-range strong AI effort to attack a key problem in the field of AI research.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
991
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • We’ve been debating to see if it is necessary to evaluate a separate Information Retrieval module for comparison purpose – see how well an Information Retrieval-based module can do as a baseline and how much better we can add on to it – our value added.
  • Hybrid system architecture overview

    1. 1. Overview of Hybrid Architecture in Project HaloJesse Wang, Peter ClarkMarch 18, 2013
    2. 2. Status of HybridArchitectureGoals, Modularity, Dispatcher, Evaluation 2
    3. 3. Hybrid System Near Term Goals CYC• Setup the infrastructure to communicate with existing reasoners AURA AURA TEQA• Reliably dispatch questions and collect answers CYC• Create related tools and resources Question generation/selection, answer TEQA o evaluation, report analysis, etc.• Experiment ways to choose the answers from Dispatcher available reasoners – as hybrid solver 3
    4. 4. Focus Areas of Hybrid Framework (until mid2013) Modularity • Loose coupling, high cohesion, data exchange protocols Dispatching • Send requests and handle the responses Evaluation • Ability to get ratings on answers, and report results 4
    5. 5. Hybrid System Core Components CYC TEQA Find-A- Value Chapt 7 In AURA IR? Campbell DirectQA Filtered Yellow Outline: New or Updated Set ofQuestions SQs: suggested questions SQA: QA with suggested questions TEQA: Textual Entailment QA IR: Information Retrieval 5
    6. 6. Infrastructure: Dispatchers CYC TEQA AURA IR Dispatcher Live Single QA Suggested QA Batch QA 6
    7. 7. Dispatcher Features• Asynchronous batch mode and single/experiment mode• Parallel dispatching to reasoners o Very functional UI: Live progress indicator, view question file, logs o Exception and error handling • Retry question when server is busy• Batch service can continue to finish even if the client dies o Cancel/stop the batch process also available• Input and output support both XML and CSV/TSV formats o Pipeline support: accept Question-Selector input• Configurable dispatchers, select reasoners o Collect answers and compute basic statistics 7
    8. 8. Question-Answering via Suggested Questions• Similar features as Live/Direct QA• Aggregate suggested questions’ answers as a solver• Unique features: o Interactively browse suggested questions database o Filter on certain facets o Using Q/A concepts, question types, etc. to improve relevance o Automatic comparison of filtered and non-filtered results by chapters 8
    9. 9. Question and Answer Handling• Handling and parsing reasoner’s returned results o Customized programming• Information on execution: details and summary• Report generation o Automatic evaluation• Question Selector o Support multiple facets/filters o Question banks o Dynamic UI to pick questions o Hidden tags support 9
    10. 10. Automatic Evaluation: Status as of 2013.3 User overall AutoEval Overall 120• Automatic result evaluation features • Web UI/service to use 100 • Algorithms to score exact and variable answers – brevity/clarity 80 – relevance: correctness + completeness – overall score • Generate reports 60 – Summary & details – Graph plot• 40Improving evaluation result accuracy • Using: basic text processing tricks (stop words, stemming, trigram 20 similarity, etc.), location of answer, length of answer, bio concepts, counts of concepts, chapters referred, question types, answer type • Experiments and analysis (several rounds, W.I.P.) 0 10
    11. 11. Hybrid PerformanceHow we evaluate and how can improveoverall system performance 11
    12. 12. Caveats: Question Generation and Selection• Generated by a small group of SMEs (senior biology students)• In natural language, without textbook (only syllabus) 12
    13. 13. Question Set Facets Question Types Chapter Distribution 12 0 PROPERTY 4 HOW 5% WHY 7% 5% 11 5 HAVE- HOW-MANY WHAT-IS-A RELATIONSHIP 4% 3% 7% WHERE 5% 10 6 WHAT-DOES-X-DOIS-IT-TRUE-THAT HAVE- 7 3% 9% SIMILARITIES 9 2% 8 Other 9% X-OR-Y 2% FUNCTION-OF-X 1% HAVE- E DIFFERENCES V FIND-A-VALUE 1% 46% 13
    14. 14. Caveat: Evaluation Criteria• We provided a clear guideline, but still subjective o A(4.0) = correct, complete answer, no major weakness o B(3.0) = correct, complete answer with small cosmetic issues o C(2.0) = partially correct or complete answers, with some big issues o D(1.0) = somewhat relevant answer or information, or poor presentation o F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers• Only 3 users to rate the answers, under tight timeline User Preferences 3 2.5 2 Aura 1.5 Cyc 1 Text QA 0.5 0 7 15 23 14
    15. 15. Evaluation ExampleQ: What is the maximum number of different atoms a carbon atom can bind at once? 15
    16. 16. More Evaluation Samples (Snapshot) 16
    17. 17. Reasoner Quality Overview 160 Answer Counts Over Rating 140 120 Aura Cyc 100 Text QA 80 60 40 20 0 0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.00 17
    18. 18. Performance Number Reasoner Performance on Reasoner Performance All Ratings (0..4) on "Good" (>= 3.0) Answers0.600 0.400 0.3500.500 0.3000.400 0.250 Aura Aura0.300 0.200 Cyc Cyc Text QA Text QA 0.1500.200 0.1000.100 0.0500.000 0.000 Precision Recall F1 Precision Recall F1 18
    19. 19. Answers Over Question Types Count of Answered Questions Answer Overall RatingHAVE-RELATIONSHIP HAVE-RELATIONSHIP HAVE-SIMILARITIES HAVE-SIMILARITIESHAVE-DIFFERENCES HAVE-DIFFERENCES Text QA IS-IT-TRUE-THAT Cyc Text QA IS-IT-TRUE-THAT Aura Cyc X-OR-Y X-OR-Y Aura WHAT-IS-A WHAT-IS-A WHAT-DOES-X-DO WHAT-DOES-X-DO PROPERTY PROPERTY HOW-MANY HOW-MANY HOW HOW 36 FIND-A-VALUE FIND-A-VALUE 0 5 10 15 20 0.00 1.00 2.00 3.00 4.00 19
    20. 20. Answer Distribution Over Chapters 4.00 Answer Quality Over Chapters Text QA Text QA 3.50 Text QA 0Aura 4 Cyc Cyc Aura Cyc 3.00 Aura Text QA Cyc 5 6 Text QA 2.50 Text QA Aura Cyc Aura 7 8 2.00 Aura Cyc 1.50 Text QA 9 10 Aura 1.00 Aura Cyc Cyc Text QA Aura Aura 11 12 0.50 Cyc Text QA Cyc Cyc Text QA 0.00 Text QA 0 4 5 6 7 8 9 10 11 12 Aura 3.13 3.67 1.83 2.33 0.58 1.83 1.00 0.50 Cyc 1.75 2.17 1.00 1.67 3.17 1.11 1.83 2.67 Text QA 2.21 2.27 1.23 2.67 2.89 1.20 1.28 1.97 2.06 2.50 20
    21. 21. Answers on Questions with E/V Answer Type Exact/Various Answer Count 50 40 45 40 30 E 20 25 V 10 5 5 13 0 Aura Cyc Text QA Exact/Variou Answer Quality 3.00 2.50 2.00 1.50 E 1.00 V 0.50 0.00 Aura Cyc Text QA 21
    22. 22. Improve Performance: Hybrid Solver – Combine!• Random selector (dumbest, baseline) o Total question answered correctly should beat the best solver• Priority selector (less dumb) o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) * o Expected performance: better than best individual• Trained selector: Feature and rule-based selector (smarter) o Decision-Tree (CTree…) learning over Q-Type, Chapter, … o Expected performance: slightly better than above• Theoretical best selector: MAX – the upper limit (smartest) o Suppose we can always pick the best performing reasoner 22
    23. 23. Performance (F1) with Hybrid Solvers Performance of Solvers on Good Answers (Good: Rating >= 3.0)0.3000.2500.200 Aura Cyc Text QA0.150 Random Priority0.100 D-Tree Max0.0500.000 Aura Cyc Text QA Random Priority D-Tree Max 24
    24. 24. Conclusion• Each reasoner has its own strength and weakness o Some aspects not handled well by AURA & CYC o Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …• Aggregated performance easily beats the best individual (Text QA) o Random solver does a good job (F1: mean=0.609): F1MAX – F1random ~ 2.5%• Little room for better performance via answer selection o F1MAX – F1D-Tree ~ 0.5% o Better focus on MORE and/or BETTER solvers 25
    25. 25. Future and Discussions 26
    26. 26. Near Future Plans• Include SQDB-based answers as a “Solver” o Help alleviate question interpretation problems by reasoners• Include Information Retrieval-based answers as a “Solver” o Help understand the extra power reasoners can have over search• Improvement evaluation mechanism• Extract more features from questions and answers to enable a better solver, and see how close we can get to the upper limit (MAX)• Improve question selector to support multiple sources and automatic update/merge of question metadata• Find ways to handle question bank evolution 27
    27. 27. Further Technical Directions (2013.6+) Get More, Better Reasoners Machine learning, Evidence combination • Extract and use more features to select best answers • Evidence collection and weighing Analytics & tuning • Easier to explore individual results and diagnose failures • Support to tune and optimize performance over target question-answer datasets Inter-solver communication • Support shared data, shared answers • Subgoaling • Allow reasoners to call each other for subgoals 28
    28. 28. Open *Data* Requirements • Clear Semantics, Common Format (standard), Easy to Access, Persistent (available) Data Sources • Questions bank, training sets, knowledge base, protocol for intermediate and final data exchange Open Data Access Layer • Design and implement protocols and services for data I/O 29
    29. 29. Open *Services* Two Categories • Pure machine/algorithms based • Human-computation (social, crowd sourcing) Requirements • Communicate with open data, generate meta data, • More reliable, scalable, reusable Goal: Process and refine data • Convert raw, noisy, inaccurate data  refined, structured, useful 30
    30. 30. Open *Environment* Definition • AI development environment to facilitate collaboration, efficiency and scalability Operation • like MMPOG, each “player” gets credits: contribution, resource consumption; interests, loans; ratings… Opportunities • self-organized projects, growth potential, encourage collaboration, grand prize 31
    31. 31. Thank You!For having the opportunity for Q&A Backup slides next 32
    32. 32. IBM Watson’s “DeepQA” Hybrid Architecture 33
    33. 33. DeepQA Answer Merging And Ranking Module 34
    34. 34. Wolfram Alpha Hybrid Architecture• Data Curation• Computation• Linguistic components• Presentation 35
    35. 35. 36
    36. 36. 37
    37. 37. Answer Distribution (Density) Answer Distribution 16 14 12 Count of Answers 10 8 Text QA Cyc 6 Aura 4 2 0 0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.00 Average User Rating 38
    38. 38. Data Table for Answer Quality Distribution 39
    39. 39. Work Performed• Created web-based dispatcher infrastructure o For both Live Direct QA and Live Suggested Questions o Batch mode to handle larger amount• Built a web UI for UW student to rate answers of questions (HEF) o Coherent UI, duplicate removal, queued tasks• Established automatic ways for result evaluation and comparison• Applied first versions of file exchange format and protocols• Employed initial file and data exchange formats and protocols• Setup faceted browsing and search (retrieval) UI o And web services for 3rd party consumption• Carried out many rounds of relevance studies and analysis 40
    40. 40. First Evaluation via Halo Evaluation Framework• We sent individual QA result set to UW students for evaluation• First round hybrid system evaluation: o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answered o Aura QA: 1 best, 9 good, 14/60 answered; o Aura SQA: 4 best (3 ties), 7 good, 8/60 answered o Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answered o Best scenario: 41/60 answered o Note: Cyc Live was not included o * SQA (Answering via suggested questions) 41
    41. 41. Live Direct QA Dispatcher Service What does ribosome make? Ask a question Waiting for answers Answers returned? 42
    42. 42. Live Suggested QA Dispatcher Service 43
    43. 43. Batch QA Dispatcher Service 44
    44. 44. Live solver Service Dispatchers 45
    45. 45. Direct Live QA: What does ribosome make? 46
    46. 46. Direct Live QA: What does ribosome make? 47
    47. 47. Suggested Questions Dispatcher 48
    48. 48. Results for Suggested Question Dispatcher 49
    49. 49. 50Batch Mode QA Dispatcher
    50. 50. Batch QA Progress Bar 51
    51. 51. Suggested questions database browser 52
    52. 52. Faceted Search on Suggested Questions 53
    53. 53. Tuning the Suggested Question RecommendationAccomplished Not Yet Implemented• Indexed suggested questions • Parsing the questions database • More experiment (heuristics) – Concept, question, answers on retrieval/ranking criteria• Created a web service for – manual upload new set of suggested questions • Get SME generate training• Extracted chapter information data to evaluate from answer text (TEXT) – Automatic• Analyzed question types • More feature extraction – Pattern-based• Experimented with some basic retrieval criteria 54
    54. 54. Parsing, Indexing and RankingIn-place NYI• New local concept extraction • More sentence features service – Content type: Questions, figures, header, reg• Concept extracted and in index ular, review…• Both sentences and paragraphs – Previous and next concepts are in index – Count of concepts• Basic sentence type identified – Clauses• Chapter and section – Universal truth information in – Relevance or not • Question parsing• Several ways of ranking evaluated • More refining on ranking • Learning to Rank ?? 55
    55. 55. Browse Hybrid system 56
    56. 56. WIP: Ranking Experiments (Ablation Study) Features Only Without Only W/O (Easy) (Easy) (Hard) (Hard) Sentence Text 139/201 31/146 Sentence Concept 79/201 13/146 Prev/Next Sentence - - Concept Locality info - - (Chapter, etc.) Stopword list - - Stemming comparison - - Other features (type…) - - Weighting (variations) 57
    57. 57. Automatic Evaluation of IR Results• Inexpensive, consistent results for tuning o Always using human judgments would be expensive and somehow inconsistent• Quick turnover• With both “easy” and “difficult” question-answer sets• Validated by UW students to be trustworthy o 95% accuracy on average with threshold 58
    58. 58. First UW Students’ Evaluation on AutoEval• Notations: o 0 = right on. 100% is right, 0% is wrong. o -1 = false positive. It means we gave it a high score (>50%), but the retrieved text does NOT contain or imply answer o +1 = false negative. It means we gave it a low score (<50%), but the retrieved text actually DOES contain or imply the answer• We gave each of 4 students o 15 questions, 15*5=75 sentences and scores to rank o 5 of the questions are the same, 10 are unique to each student o 23/45 questions from “hard” set, 22/45 from “easy” set 59
    59. 59. Results: Auto-Evaluation Validity Verification 10.9 Threshold at 50%0.8 Threshold at 80%0.70.6 0.5 0.4 0.3 0.2 0.1 0 Threshold at 80% 1 2 Threshold at 50% 3 4 60
    60. 60. The “Easy” QA set *• Task: automatic evaluate if retrieved sentences contain the answer• Scoring: Max score, Mean Average Precision (MAP)• Result using Max (with threshold at 80%): o 193 regular questions and 8 yes/no questions (via concepts overlap) • Only with sentence text: 139 (69.2%) • Peter’s test set: 149 (74.1%) • Peter’s more refined: 158 (78.6%) • (Lower) Upper bound for IR: 170 (84.2%) • Jesse’s best: ?? * The evaluation is for IR portion ONLY, no answer pinpointing 61
    61. 61. “Easy” QA Set Auto-Evaluation Result 0.9 0.8 0.7 0.6 0.5 0.4 Result 0.3 0.2 0.1 0 Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper Bound 62
    62. 62. Best Upper Bound for Hard Set as of TodayWith weighting on Answer Text, Answer Concepts, QuestionText, Question Concepts, matching over Sentence Text, Concepts, andConcepts from Previous and Next Sentences, and sentence type…Comparison with keyword overlap, concept overlap, stopwords removaland smart stemming techniques… 64
    63. 63. Sharing the Data and Knowledge• Information We Want, and each solver may also want• Everyone’s result• Everyone’s confidence on results• Everyone’s supporting evidence o From textbook sentences, reviews, homework section, figures… o From related web material, e.g. biology WikiPedia o From common world knowledge, ParaPara, WordNet, …• Training data – for offline use 66
    64. 64. More Timeline Details for First IntegrationWe are in control Partners• AURA • Cyc – Now – ? Hopefully before EOY 2012• Text • JHU – before 12/7 – ?? Hopefully before EOY 2012• Vulcan IR Baseline • ReVerb – before 12/15 – ??? EOM January 2013• Initial Hybrid System Output – Before 12/21 – Without unified data format – With limited (possibly outdated) suggested questions 67
    65. 65. Rounds of Improvements Infrastructure (module & service) • Integrate solver • Data I/O Tricks (algorithms & data) • Refine Hybrid Strategy • Heuristic + Machine Learning Analysis (evaluation) • Evaluation with humans • With each solver + hybrid system 68
    66. 66. OpenHalo AURA SILK QA CYC QA Vulcan Hybrid System Other TEQA QA Data Service Collaboration 69
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×