NLP & DBpedia


Published on

Using BabelNet in Bridging the Gap Between Natural Language Queries and Linked Data Concepts

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

NLP & DBpedia

  1. 1. Using BabelNet in Bridging the Gap Between Natural Language Queries and Linked Data Concepts Khadija Elbedweihy, Stuart N. Wrigley, Fabio Ciravegna and and Ziqi Zhang OAK Research Group, Department of Computer Science, University of Sheffield, UK
  2. 2. Outline • Motivation and Problem Statement • Natural Language Query Approach • Approach Steps • Evaluation • Results and Discussion
  3. 3. Motivation – Semantic Search • Wikipedia states that Semantic Search: “seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results” • Semantic search evaluations reported user preference for free natural language as a query approach (simple, fast & flexible) as opposed to controlled or view-based inputs.
  4. 4. Problem Statement • Complete freedom increases difficulty of matching query terms with the underlying data and ontologies. • Word sense disambiguation (WSD) is core to the solution. Question: “How tall is ..... ?”: property height – tall is polysemous, should be first disambiguated: – great in vertical dimension; tall people; tall buildings, etc. – too improbable to admit of belief; a tall story, … • Another difficulty: Named Entity (NE) recognition and disambiguation.
  5. 5. Approach • Free-NL semantic search approach, matching user query terms with the underlying ontology using: 1) An extended-Lesk WSD approach. 2) A NE recogniser. 3) A set of advanced string similarity algorithms and ontology-based heuristics to match disambiguated query terms to ontology concepts and properties.
  6. 6. Extended-Lesk WSD approach • WordNet is predominant, however its granularity is a problem for achieving high performance in WSD. • BabelNet is a very large multilingual ontology with widecoverage obtained from both WordNet and Wikipedia. • For disambiguation, bags are extended with senses’ glosses and different lexical and semantic relations. • Include synonyms, hyponyms , hypernyms , attribute, see also and similar to relations.
  7. 7. Extended-Lesk WSD approach • Information added from a Wikipedia page (W), mapped to a WordNet synset includes: 1.labels; page “Play (theatre)”  add play and theatre 2. set of pages redirecting to W; Playlet redirects to Play 3. set of pages linked from W; links in the page Play (theatre) include literature, comedy, etc. • Synonyms of synset S, associated with Wikipedia page W: WordNet synonyms of S in addition to lemmas of wikipedia information of W".
  8. 8. Extended-Lesk WSD approach Feature P R F1 Baseline Synonyms Syn + hypo Syn + gloss examples (WN) Syn + gloss examples (Wiki) Syn + gloss examples (WN + Wiki) Syn + hyper Syn + semRel Syn + hypo + gloss(WN) Syn + hypo + gloss(WN) + hyper Syn + hypo + gloss(WN) + hyper + semRel Syn+hypo+gloss(WN)+hyper+semRel+relGlosses 58.09 59.14 62.16 61.97 61.14 60.21 60.36 59.65 64.92 65.28 65.45 69.76 57.98 59.03 62.07 61.86 61.02 60.10 60.26 59.54 64.81 65.18 65.33 69.66 58.03 59.09 62.12 61.92 61.08 60.16 60.31 59.59 64.86 65.23 65.39 69.71 • Sentences with less than seven words: f-measure of 81.34%
  9. 9. Approach – Steps 1. Recognition and disambiguation of Named Entities. 2. Parsing and Disambiguation of the NL query. 3. Matching query terms with ontology concepts and properties. 4. Generation of candidate triples. 5. Integration of triples and generation of SPARQL queries.
  10. 10. 1.Recognition and disambiguation of Named Entities • Named entities recognised using AlchemyAPI. • AlchemyAPI had the best recognition performance in NERD evaluation of SOA NE recognizers. • AlchemyAPI exhibits poor disambiguation performance • Each NE is disambiguated using our BabelNet-based WSD approach.
  11. 11. 1.Recognition and disambiguation of Named Entities • Example: “In which country does the Nile start?” • Matches of Nile in BabelNet include: – – – – (singer) (TV series) (band) • Match selected (Nile: river): overlapping terms between sense and query (geography, area, culture, continent) more than other senses.
  12. 12. 2.Parsing and Disambiguation of the NL query • Stanford Parser used to gather lemmas and POS tags. • Proper nouns identified by the parser and not recognized by AlchemyAPI are disambiguated and added to the recognized entities. • Example: “In which country does the Nile start?” – The algorithm does not miss the entity Nile, although it was not recognized by AlchemyAPI.
  13. 13. 2.Parsing and Disambiguation of the NL query • Example: “Which software has been developed by organizations founded in California?” Output: Word software POS NP position 1 developed organizations founded develop organize find VBN NNS VBN 2 3 4 California • Lemma software California NP 5 Equivalent output generated using keywords or phrases.
  14. 14. 3.Matching Query Terms with Ontology Concepts & Properties • Noun phrases, nouns and adjectives are matched with concepts and properties. • Verbs are matched only with properties. • Candidate ontology matches ordered using: Jaro-Winkler and Double Metaphone string similarity algorithms. • Jaro-Winkler threshold to accept a match is set to 0.791, shown in literature to be the best threshold value.
  15. 15. 3.Matching Query Terms with Ontology Concepts & Properties • Matching process uses the following in order: 1. query term (e.g., created) 2. lemma (e.g., create) 3. derivationally related forms (creator) • If no matches, disambiguate query term and use expansion terms in order: 1. synonyms 2. hyponyms 3. hypernyms 4. semantic relations (e.g., height as an attribute for tall)
  16. 16. 4. Generation of Candidate Query Triples • Structure of the ontology (taxonomy of classes and domain and range of properties) used to link matched concepts and properties and recognized entities to generate query triples. Three-Terms Rule • Each three consecutive terms matched with set of templates. E.g., “Which television shows were created by Walt Disney?” • Template (concept-property-instance) generates triples: ?television_show <dbo:creator> <res:Walt_Disney> ?television_show <dbp:creator> <res:Walt_Disney> ?television_show <dbo:creativeDirector> <res:Walt_Disney>
  17. 17. Three-Terms Rule Examples of templates used in three-terms rule: • concept-property-instance – airports located in California – actors born in Germany • instance-property-instance – Was Natalie Portman born in the United States? • property-concept-instance – birthdays of actors of television show Charmed
  18. 18. Two-Terms Rule Two-Terms Rule, used when: 1) There is fewer than three derived terms 2) No match between query terms and three-term template 3) Matched template did not generate candidate triples E.g., “In which films directed by Garry Marshall was Julia Roberts starring?” <Garry Marshall, Julia Roberts, starring> : matched to a three-terms template but does not generate triples.
  19. 19. Two-Terms Rule Two-Terms Rule Question: “what is the area code of Berlin?” • Template (property-instance) generates the triples: <res:Berlin> <dbp:areaCode> ?area_code <res:Berlin> <dbo:areaCode> ?area_code
  20. 20. Comparatives Comparatives Scenarios: 1) Comparative used with a numeric datatype property: e.g., “companies with more than 500,000 employees” ?company <dbp:numEmployees> ?employee ?company <dbp:numberOfEmployees> ?employee ?company a <dboCompany> FILTER (?employee > 500000)
  21. 21. Comparatives 2) Comparative is used with a concept: e.g., “places with more than 2 caves” • Generate the same triples for places with caves: ?place a <>. ?cave a <>. ?place ?rel1 ?cave. ?cave ?rel1 ?place. • Add the aggregate restriction: GROUP BY ?place HAVING (COUNT(?cave)>2).
  22. 22. Comparatives 3) Comparative is used with an object property e.g., “countries with more than 2 official languages” • Similarly, generate the same triples for country and official language and add the restriction: GROUP BY ?country HAVING (COUNT(?official_language) > 2) 4) Generic Comparatives e.g., “Which mountains are higher than the Nanga Parbat?”
  23. 23. Generic Comparatives • Difficulty: identify the property referred to by the comparative term. 1) Select best relation according to query context. – Identify all numeric datatype properties associated with the concept “mountain”, include: “latS, longD, prominence, firstAscent, elevation, longM, …” 2) Disambiguate synsets of all properties and use WSD approach to identify the most related synset to the query. – property elevation is correctly selected
  24. 24. 5. Integration of Triples and Generation of SPARQL Queries • Generated triples integrated to produce SPARQL query. • Query term positions used to order the generated triples. • Triples originating from the same query term are executed in order until an answer is found. • Duplicates are removed while merging the triples. • SELECT and WHERE clauses added in addition to any aggregate restrictions or solution modifiers.
  25. 25. Evaluation • Test data from 2nd Open Challenge at QALD-2. • Results produced by QALD-2 evaluation tool. • Very promising results: 76% of questions answered correct. Approach Answered Correct Precision Recall F1 BELA QAKiS Alexandria 31 35 25 17 11 5 0.62 0.39 0.43 0.73 0.37 0.46 0.67 0.38 0.45 SenseAware SemSeK MHE 54 80 97 41 32 30 0.51 0.44 0.36 0.53 0.48 0.4 0.52 0.46 0.38
  26. 26. Discussion • Design choices affected by priority for precision or recall: 1. Query Relaxation e.g., “Give me all actors starring in Last Action Hero” – Restricting results to actors harms recall – Not all entities in LD are typed, let alone correctly typed – Query relaxation favors recall but affects precision e.g. “How many films did Leonardo DiCaprio star in?” – Return TV series rather than only films such as res:Parenthood (1990 TV series). • Decision: favor precision; keep restriction when specified.
  27. 27. Discussion 2. Best or All Matches e.g., “software by organizations founded in California” – Properties matched: foundation and foundationPlace – Using only best match (foundation ) does not generate all results  affects recall. – Using all properties (may not be relevant to the query) would harm precision. • Decision: use all matches; with high value for the similarity threshold; perform checks against the ontology structure to assure relevant matches are only used.
  28. 28. Discussion 3. Query Expansion • Can be useful for recall, when the query term is not sufficient to return all answers. • Example: use “website” and “homepage” if any of them used in a query and both have matches in the ontology. • Quality of expansion terms influenced by WSD approach; wrong sense identification will lead to noisy list of terms. • Decision: perform query expansion only when no matches found in the ontology for a term; or no results generated using the identified matches.
  29. 29. Questions Questions?
  30. 30. Additional Slides Additional Slides