Qedia - Natural Language Queries on DBPedia


Published on

Qedia - Natural Language Queries on DBPedia

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Qedia - Natural Language Queries on DBPedia

  1. 1. Qedia – Natural Language Queries on DBPedia Andreea-Georgiana Zbranca, Diana Andreea Gorea, Lucian Bentea Faculty of Computer Science, “A.I. Cuza” University, Ia¸i, Romania s Abstract. In this paper we present an application that allows users to query DBPedia through natural language, which is more intuitive than plain SPARQL. 1 Introduction We present an application that is able to translate natural language phrases, which conform to a certain basic grammar, into SPARQL queries that are then run on the DBPedia knowledge base. For tagging the parts of speech of the phrase, we have used a lexical analyzer implemented by Ian Barber and avail- able at http://phpir.com/part-of-speech-tagging. The syntactical analysis is achieved with respect to a basic grammar that we describe in the following section. The resulting parse tree can also be interpreted as the RDF graph cor- responding to the given phrase. Furthermore, there are three types of phrases that we allow to be used as a natural language query, which we also describe below – one which is missing the subject, one which is missing the object and one which is missing both. Based on these categories of phrases, we are able to automatically generate the corresponding SPARQL queries which we run on the DBPedia end-point. In order to obtain further statistics, a SPARQL query has also been used. To increase flexibility, the graphical interface has been implemented in two versions – a Web page version using the Zend framework and requiring Apache or a similar local server to be running, and a Desktop version using the PHP- GTK 2 library. Also, in order to run the queries from within PHP, the ARC library has been used, which is freely available to download from http://arc. semsol.org/. The results returned by each query are displayed both in tabular and in text form, along with other statistics. We also mention that the main RDF vocabularies used by DBPedia are also automatically included with each SPARQL query. 2 Parsing a Phrase 2.1 Algorithm The query will be in natural language. The sentence will be transformed into an RDF triplet Subject-Predicate-Object. Identifying the parts of sentence, the natural language query can be transformed into a SPARQL query. As input we
  2. 2. get a phrase and we obtain three arrays: nouns (meaning also adjectives and adverbs), parents (corresponding to the tree grammar parsing) and verbs that connect the nouns. First step is to obtain the parts of speech of the phrase and after that to find out the part of sentence and build the three arrays. To build this parser we first used an algorithm already implemented by Ian Barber. This system use a corpus, with words hand tagged for part of speech. Some examples of taggers are: NN for noun, VB for verb, VBD for verb past tense, JJ for adjective. In his code I removed some words that are unnecessary in the following steps. For example I removed the word the that is determinant for noun. The output of this algorithm is the phrase with tagged with its parts of speech, e.g. Input: The quick brown fox jumped over the lazy dog. Output: The/DT quick/JJ brown/JJ fox/NN jumped/VBD over/IN the/DT lazy/JJ dog/NN. According to the algorithm, the tagger was trained by analysing a corpus and noting the frequencies of the different tags for a given word. More informations and also the algorithm that we used for this step, can be found at: http:// phpir.com/part-of-speech-tagging. In the next step we have as input the phrase tagged according to the Ian Barber algorithm and we print the three arrays from above. To parse the phrase we used a simple grammar and built the tree parse of the phrase. As a general structure all our valid phrases must conform to the following basic grammar: Prop = Beg S P C Beg = What | What does | What do S = noun | S P.atr C = noun | adjective | adverb | C P.atr P.Atr = that P C P = verb where the terminals are What, What does, What do, noun, adjective, adverb, verb and everything else is a non-terminal. An example of a phrase that conforms to this grammar is the following: What animal that has the color that is gray eats leaves that belong to the species that is Eucalyptus? The parse tree that we aim to generate is basically the RDF graph of this phrase and is depicted in Figure 1. We get the phrase and we removed from the tags all the line breaks. We then built an array of pairs of the form (word, tag). After that we verify the tag and if it is a noun, adjective or adverb, we build our first array that will contain only nouns, adjectives and adverbs. In the same way we obtain the array with verbs. For building the parent array we go through the elements one by one and we verify whether they are root nodes. When we find the root we search for the
  3. 3. animal has eats color leaves is belong gray species is Eucalyptus Fig. 1. RDF graph (parse tree) for the phrase: What animal that has the color that is gray eats leaves that belong to the species that is Eucalyptus? predicate and split the phrase in two sub trees. According to our grammar the predicate is between the root and the other sub tree. If our phrase does not have a subject we put in our array the symbol * in the first position. If the phrase does not have an object we put in the array the symbol # in the last position. In each sub tree we verify step by step if the noun is followed by the word that and a verb, and that the child of this noun is the first noun after the verb with that in front. The parent of the root is 0. When we form the verbs array we verify what verb is between the child and his parent and put it into the array. On first position we put 0 because that corresponds to the root. 2.2 Accepted Types of Phrases In order to verify our project we used three types of phrases that can be trans- lated into SPARQL queries: 1. “What [property] has [subject]?” translated into: SELECT ?property WHERE { :[subject] dbpedia:property ?property } For example, the phrase “What abstract has Guitar?” generates the following parse arrays:
  4. 4. nouns-array: abstract guitar parents: 0 abstract verbs: 0 has and is translated into the SPARQL query: SELECT ?abstract WHERE { :Guitar dbpedia2:abstract ?abstract } 2. “What has [property] [object] ?” translated into: SELECT ?subject WHERE { ?subject dbpedia2:[property] "[object]"@en } For example, the phrase “What has name that is animal?” generates the following parse arrays: nouns-array: * name animal parents: 0 * name verbs: 0 has is and is translated into the SPARQL query: SELECT ?subject WHERE { ?subject dbpedia2:name "Animal"@en } 3. “What has [property] ?” translated into: SELECT ?subject ?object WHERE { ?subject dbpedia2:[property] ?object } For example, the phrase “What has regnum?” generates the following parse arrays: nouns-array: * regnum parents: 0 * verbs: 0 has and is translated into the SPARQL query: SELECT ?subject ?object WHERE { ?subject dbpedia2:regnum ?object }
  5. 5. In this case, where both the subject and object are missing, it is advised that we put a limit on the number of results returned by DBPedia, using the LIMIT keyword, as in: SELECT ?subject ?object WHERE { ?subject dbpedia2:regnum ?object } LIMIT 20 2.3 Statistics In order to obtain statistics, we go through the list of all nouns in the given phrase and for each noun X we query the number of languages in which its corresponding abstract data is translated, using: SELECT COUNT DISTINCT ?abstract WHERE { :X dbpedia2:abstract ?abstract } 2.4 ARC Queries The following example shows how SPARQL queries can be made from within PHP using the ARC library, which we also have used in our application. include_once(’./arc/ARC2.php’); $ssp = ARC2::getSPARQLScriptProcessor(); // define the script $scr = ’ ENDPOINT <http://dbpedia.org/sparql> PREFIX dbpedia2: <http://dbpedia.org/property/> PREFIX dbpedia: <http://dbpedia.org/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> $results = SELECT * WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag. } ’; // run the script $ssp->processScript($scr);
  6. 6. // display the results echo "nnQuery results:nn"; print_r($ssp->env[’vars’][’results’][’value’]); 3 Conclusions and Future Developments We presented a preliminary version of an application that allows users to query DBPedia using basic natural language phrases. There are several features that can be improved or new features that can be added. For instance, the basic grammar that we use to create the parse tree can be made more complex. Also, the three types of phrases that we allow as natural language queries can be made more complex and closer to the everyday speech – they sound rather artificial at the moment. Another feature that can be added is to allow you to query several end-points, not just DBPedia. The main problem is that each end-point may come with its own set of vocabularies, apart from the well-known skos, foaf, rdfs, etc. Thus, a further knowledge of each end-point is necessary before implementing natural language queries that can be run on it. As last remarks, in order to improve the lexical analysis step, a larger lexicon can be used. Also, the graphical interface can be made more user friendly as the previously mentioned features are implemented. References 1. The ARC open-source RDF system at http://arc.semsol.org. 2. Ian Barber’s part of speech lexical analyzer, freely available at http://phpir.com/ part-of-speech-tagging. 3. The DBPedia Wiki at http://dbpedia.org/About. 4. The SPARQL online query interface on DBPedia, at http://dbpedia.org/snorql.