Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

QUANT-Question Answering Benchmark Curator

35 views

Published on

These slides were presented at SEMANTiCS 2019 in Karlsruhe, Germany. The corresponding paper can be found here: https://link.springer.com/chapter/10.1007/978-3-030-33220-4_25

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

QUANT-Question Answering Benchmark Curator

  1. 1. QUANT-Question Answering Benchmark Curator Ria Hari Gusmita, Rricha Jalota, Daniel Vollmers, Jan Reineke, Axel-Cyrille Ngonga Ngomo, and Ricardo Usbeck March 24, 2020 Gusmita et al QUANT March 24, 2020 1 / 33
  2. 2. Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specific Analysis 5 Conclusion & Future Work Gusmita et al QUANT March 24, 2020 2 / 33
  3. 3. Motivation Drawback in evaluating Question Answering systems over knowledge bases Mainly based on benchmark datasets (benchmarks) Challenge in maintaining high-quality and benchmarks Gusmita et al QUANT March 24, 2020 3 / 33
  4. 4. Motivation Challenge in maintaining high-quality and benchmarks Change of the underlying knowledge base DBpedia 2016-04 DBpedia 2016-10 http://dbpedia.org/resource/Surfing http://dbpedia.org/resource/Surfer http://dbpedia.org/ontology/seatingCapacity http://dbpedia.org/property/capacity http://dbpedia.org/property/portrayer http://dbpedia.org/ontology/portrayer http://dbpedia.org/property/establishedDate http://dbpedia.org/ontology/foundingDate Gusmita et al QUANT March 24, 2020 4 / 33
  5. 5. Motivation Challenge in maintaining high-quality and benchmarks Metadata annotation errors Gusmita et al QUANT March 24, 2020 5 / 33
  6. 6. Motivation Degradation QALD benchmarks against various versions of DBpedia Gusmita et al QUANT March 24, 2020 6 / 33
  7. 7. Contribution QUANT, a framework for the intelligent creation and curation of QA benchmarks Definition Given B, D, and Q as benchmark, dataset, and questions respectively S represents QUANT’s suggestions ith version of a QA benchmark Bi as a pair (Di , Qi ) Given a query qij ∈ Qi with zero results on Dk with k > i S : qij −→ qij QUANT aims to ensure that queries from Bi can be reused for Bk to speed up the curation process as compared to the existing one Gusmita et al QUANT March 24, 2020 7 / 33
  8. 8. What QUANT supports 1 Creation of SPARQL queries 2 The validity of benchmark metadata 3 Spelling and grammatical correctness of questions Gusmita et al QUANT March 24, 2020 8 / 33
  9. 9. Approach Architecture Gusmita et al QUANT March 24, 2020 9 / 33
  10. 10. Approach Smart suggestions 1 SPARQL suggestion 2 Metadata suggestion 3 Multilingual Questions and Keywords Suggestion Gusmita et al QUANT March 24, 2020 10 / 33
  11. 11. Smart suggestion 1. How SPARQL suggestion module works Gusmita et al QUANT March 24, 2020 11 / 33
  12. 12. 1. SPARQL suggestion Missing prefix The original SPARQL query SELECT ? s WHERE { r e s : New_Delhi dbo : country ? s . } Gusmita et al QUANT March 24, 2020 12 / 33
  13. 13. 1. SPARQL suggestion Missing prefix The original SPARQL query SELECT ? s WHERE { r e s : New_Delhi dbo : country ? s . } The suggested SPARQL query PREFIX dbo : <http :// dbpedia . org / ontology/> PREFIX r e s : <http :// dbpedia . org / r e s o u r c e/> SELECT ? s WHERE { r e s : New_Delhi dbo : country ? s . } Gusmita et al QUANT March 24, 2020 13 / 33
  14. 14. 1. SPARQL suggestion Predicate change The original SPARQL query SELECT ? date WHERE { ? website r d f : type onto : Software . ? website onto : r e l e a s e D a t e ? date . ? website r d f s : l a b e l "DBpedia" . } Gusmita et al QUANT March 24, 2020 14 / 33
  15. 15. 1. SPARQL suggestion Predicate change The suggested SPARQL query SELECT ? date WHERE { ? website r d f : type onto : Software . ? website r d f s : l a b e l "DBpedia" . ? website dbp : l a t e s t R e l e a s e D a t e ? date . } Gusmita et al QUANT March 24, 2020 15 / 33
  16. 16. 1. SPARQL suggestion Predicate missing The original SPARQL query SELECT ? u r i WHERE { ? s u b j e c t r d f s : l a b e l "Tom␣Hanks" . ? s u b j e c t f o a f : homepage ? u r i } Gusmita et al QUANT March 24, 2020 16 / 33
  17. 17. 1. SPARQL suggestion Predicate missing The original SPARQL query SELECT ? u r i WHERE { ? s u b j e c t r d f s : l a b e l "Tom␣Hanks" . ? s u b j e c t f o a f : homepage ? u r i } The suggested SPARQL query The predicate foaf:homepage is missing in ?subject foaf:homepage ?uri Gusmita et al QUANT March 24, 2020 17 / 33
  18. 18. 1. SPARQL suggestion Entity change The original SPARQL query SELECT ? u r i WHERE { ? u r i r d f : type yago : C ap it als In Eu ro pe } Gusmita et al QUANT March 24, 2020 18 / 33
  19. 19. 1. SPARQL suggestion Entity change The original SPARQL query SELECT ? u r i WHERE { ? u r i r d f : type yago : C ap it als In Eu ro pe } The suggested SPARQL query SELECT ? u r i WHERE { ? u r i r d f : type yago : WikicatCapitalsInEurope } Gusmita et al QUANT March 24, 2020 19 / 33
  20. 20. 2. Metadata suggestion Gusmita et al QUANT March 24, 2020 20 / 33
  21. 21. 3. Multilingual questions and keywords suggestion Question with missing keywords and translations Gusmita et al QUANT March 24, 2020 21 / 33
  22. 22. 3. Multilingual questions and keywords suggestion Generated keywords: state, united, states, america, highest, density Utilizing Trans Shell tool→Generated keywords translations suggestion Gusmita et al QUANT March 24, 2020 22 / 33
  23. 23. 3. Multilingual questions and keywords suggestion Suggested Question Translations Gusmita et al QUANT March 24, 2020 23 / 33
  24. 24. Evaluation Three goals of the evaluation: 1 QUANT vs manual curation Graduate students curated 50 questions using QUANT and another 50-question manually 23 minutes vs 278 minutes 2 Effectiveness of smart suggestions 10 expert users got involved in creating a new joint benchmark, called QALD-9, with 653 questions 3 QUANT’s capability to provide a high-quality benchmark dataset The inter-rater agreement between each two users amounts up to 0.83 on average Group Inter-rater Agreement 1st Two-Users 0.97 2nd Two-Users 0.72 3rd Two-Users 0.88 4th Two-Users 0.77 5th Two-Users 0.96 Average 0.83 Gusmita et al QUANT March 24, 2020 24 / 33
  25. 25. Evaluation Users acceptance rate in % User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10 List of users 0 10 20 30 40 50 60 70 80 90 100 Acceptanceratein% acceptance rate per user QUANT provided 2380 suggestions and user acceptance rate on average is 81% The top 4 acceptance-rate are for QALD-7 and QALD-8 Gusmita et al QUANT March 24, 2020 25 / 33
  26. 26. Evaluation Number of accepted suggestions from all users User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10 List of users 0 100 200 300 400 500 Numberofacceptedsuggestion SPARQL Query Question Translations Out of Scope Onlydbo Keywords Translations Hybrid Answer Type Aggregation Most users accepted suggestion for out-of-scope metadata Keyword and question translation suggestions yielded the second and third highest acceptance rates. Gusmita et al QUANT March 24, 2020 26 / 33
  27. 27. Evaluation Number of users who accepted QUANT’s suggestions for each question’s attribute. Aggregation Answer Type Hybrid Keywords Translations Only Dbo Out of Scope Question Translations SPARQL Query Name of attributes 0 10 20 30 40 50 60 70 80 90 100 110 Numberofusersacceptedsuggestionin% Percentage 83.75% of the users accepted QUANT’s smart suggestions on average Hybrid and SPARQL suggestions were only accepted by 2 and 5 users respectively. Gusmita et al QUANT March 24, 2020 27 / 33
  28. 28. Evaluation Number of suggestions provided by users User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10 List of users 0 10 20 30 40 Numberofprovidedsuggestions SPARQL Query Question Translations Out of Scope Onlydbo Keywords Translations Hybrid Answer Type Aggregation Answer type, onlydbo, out-of-scope, and SPARQL query metadata were attributes whose value redefined by users Gusmita et al QUANT March 24, 2020 28 / 33
  29. 29. QALD-specific Analysis There are 1924 questions where 1442 questions are training data and 482 questions are test data Gusmita et al QUANT March 24, 2020 29 / 33
  30. 30. QALD-specific Analysis Duplication removal resulted 655 unique questions Removing 2 semantically similar questions produced 653 questions Using QUANT with 10 expert users, we got 558 total benchmark questions → increase QALD-8 size by 110.6% The new benchmark formed QALD-9 dataset Distribution of unique questions in all QALD versions Gusmita et al QUANT March 24, 2020 30 / 33
  31. 31. Conclusion QUANT’s evaluation highlights the need for better datasets and their maintenance QUANT speeds up the curation process by up to 91%. Smart suggestions motivate users to engage in more attribute corrections than if there were no hints Gusmita et al QUANT March 24, 2020 31 / 33
  32. 32. Future Work There is a need to invest more time into SPARQL suggestions as only 5 users accepted them We plan to support more file formats based on our internal library Gusmita et al QUANT March 24, 2020 32 / 33
  33. 33. Thank you for your attention! Ria Hari Gusmita ria.hari.gusmita@uni-paderborn.de https://github.com/dice-group/QUANT DICE Group at Paderborn University https: //dice-research.org/team/profiles/gusmita/ Gusmita et al QUANT March 24, 2020 33 / 33

×