SlideShare a Scribd company logo
Transformation between Query Languages
Miguel Cristian Greciano Raiskila
Tokyo National Institute of Informatics
Technische Universit¨at Darmstadt
Universidad Polit´ecnica de Madrid
mc.greciano@gmail.com
Yusuke Miyao
Tokyo National Institute of Informatics
Associate Professor
Master Thesis Tutor
yusuke@nii.ac.jp
Abstract
The Semantic Web, an extension of the
Web that provides easier ways to retrieve
data, has seen a major growth in the last
years. Not only do users of the Inter-
net desire information, they desire better,
quicker, easier and more efficient ways to
access that information and find answers
to their questions.
A clear example of the Semantic Web phe-
nomenon is the popularity of the Freebase
data set. It is a big collection of struc-
tured data harvested from many sources.
Freebase runs on a database infrastructure
created in-house by Metaweb that uses a
graph model. Because its data structure
is non-hierarchical, Freebase is open for
users to enter new objects and relation-
ships into the underlying graph, a great
advantage. Since 2008, Freebase imple-
ments RDF (Resource Description For-
mat), allowing Freebase to be used as
linked data and be queried by languages
such as SPARQL, which is also quite pop-
ular. SPARQL users normally create their
own SPARQL queries manually, or at least
semi-manually. There are even hundreds
of sample SPARQL queries on the web
with their associated natural language ut-
terance to hint users how to create or adapt
their own queries. It would be however
ideal if this whole step from natural lan-
guage to executing a query in the database
was automated.
In this paper we wish to address a key
feature in the desired automation between
natural language and query execution:
transformation between query languages.
Indeed in the example of Freebase and
SPARQL, natural languages utterances are
far in nature from RDF graphs, but closer
in nature to semantic trees. If we could
close the gap between trees and graphs by
transforming between different query lan-
guages, we would draw near to our final
goal of automation. In this paper we out-
line two algorithms that transform queries
between different languages, the SPARQL
→ λ-DCS one being the more relevant.
We also provide the necessary background
as to the reasons and utility for said algo-
rithms.
1 Introduction
This paper outlines the algorithms developed to
transform queries from one query language into
their equivalents in a different query language.
It also provides a brief but comprehensive back-
ground on the query languages and tools chosen
for this study, as well as the reasons to why these
were chosen for this study.
An obvious initial question is what kind of
benefits would transformation between query lan-
guages provide. It is true that equivalent queries
produce equivalent results (”Where was Barack
Obama born?” should return ”Honolulu” no mat-
ter which query language one is using), however,
different languages express different concepts for
equivalent queries, as well as being different in
efficiency and answer-retrieving speed for equiv-
alent queries. Just as one can express a database
containing the books in a library and associated
data both in, for example, programming languages
Java and C, the implementation of said database
will be obviously different due to the different
inherent nature between both programming lan-
guages. Java will use objects to represent the data
and C will use other data structures and care about
storage and memory differently.
Furthermore, transformation between query
languages can be very useful when a specific query
language is more suitable than another for cer-
tain natural language processing tasks. Some lan-
guages have a graph structure and others have
a tree structure. Associating syntactic trees that
arise from parsing a natural language sentence
with a query could be intuitively easier done if
the query itself has a tree structure. For ex-
ample, in their ”Semantic Parsing on Freebase
from Question-Answer Pairs” (Berant et al., 2013)
Jonathan Berant, Andrew Chou, Roy Frostig and
Percy Liang already suggested a way to map natu-
ral language utterances to queries on the Freebase
data set: through a tree-structured query language
they name Lambda Dependency-Based Composi-
tional Semantics (from now on λ-DCS).
An ambitious project that our team is currently
working on, the Question-Answering Project
(from now on, the QA project), intends to auto-
matically create these Freebase queries from natu-
ral language input. As mentioned in the abstract,
we already possess a good amount of utterance-
query pairs, however in order to train the inter-
preter we must be able to traverse from utterance-
semantic tree-graph-query and the reverse way
too.
The chosen query languages for the transforma-
tions are Prolog, SPARQL and λ-DCS. Section 3
contains a brief exposition of the syntax and pe-
culiarities of these languages. Section 4 outlines
the transformation algorithms between these lan-
guages. In section 5 we analyze the performance
of said algorithms. Section 6 explains the encoun-
tered problems and limitations to the algorithms
and the transformations, as well as some possible
solutions. Section 7 suggests future tasks to be
carried out from this project, and section 8 pro-
vides the conclusion to our work. Section 9 is an
Appendix containing some of the results of our
work: queries equivalently expressed in different
query languages.
As we will see, the λ-DCS → SPARQL trans-
formation is already implicit in the toolkit that
we have. However the major contribution of this
paper is the reverse SPARQL → λ-DCS trans-
formation. It is not a trivial task as λ-DCS
→ SPARQL is transforming trees into graphs,
whereas SPARQL → λ-DCS is the opposite: rans-
forming graphs into trees. Trees are by defini-
tion a particular case of graphs and thus less ex-
pressive, so in theory transforming from graphs
to trees should pose a significant challenge. The
SPARQL → λ-DCS has not been proposed yet
and is important for the QA project, which needs
to traverse from semantic trees to graphs (λ-DCS
→ SPARQL) and also from graphs to semantic
trees (SPARQL → λ-DCS) in order to execute
training.
The chosen experimental database is GeoQuery
1. It is a small database containing geographical
information about all of the states in the US (cities,
rivers, roads, highest and lowest points...). An ex-
tended geographical database with more data is
included in Freebase, however we chose to work
with GeoQuery because it is simple, intuitive,
small, comes in various formats and is free and
open source (despite Freebase being also free and
open source, there are of course data sets which
are not, so this advantage should not be consid-
ered a given). It is also relatively well-known.
One of the formats GeoQuery comes in is the Pro-
log format, and this is the main reason why we
chose to select Prolog as one of the query lan-
guages in this research. GeoQuery comes with
more than 880 functional Prolog queries as sam-
ples, all of them associated with natural language
questions such as ”What is the longest river in Col-
orado?”. The samples are first-order-logic queries,
i.e. they treat with quantities and sets, not propo-
sitions. Thus, we have a good starting source ma-
terial for our objective of transforming functional
queries in one language to their equivalents in an-
other language. In this case, the algorithms are de-
signed to transform first from Prolog to SPARQL,
and then from SPARQL to λ-DCS. We thus wish
the algorithms function properly in the GeoQuery
database, with the hope that such algorithms gen-
eralize well when transforming queries in larger
datasets like Freebase.
2 Related Work
Automated transformation between query lan-
guages is not a very common practice. Most of the
times queries are written manually for the purpose
of retrieving desired information. Consequently,
related work is scarce on the web. One can still
find similar attempts though: in the book ”Reason-
ing Web” there’s a subsection addressing transfor-
mation between SPARQL and GReQL. (A mann
et al., 2010)
As we will see in section 4.3, the transforma-
1
http://www.cs.utexas.edu/users/ml/
nldata/geoquery.html
tion λ-DCS → SPARQL is implicit in the SEM-
PRE toolkit, however the reverse transformation
SPARQL → λ-DCS is not. One of the algo-
rithms proposed in this paper attempts to exe-
cute this transformation. The other algorithm pro-
posed is for the Prolog → SPARQL transforma-
tion, which as far as the authors are concerned,
nobody else has attempted to develop. Thus, even
though Transformation between Query Languages
is not a new concept, it is still rare, and this paper
pioneers the transformations Prolog → SPARQL
and SPARQL → λ-DCS.
3 Query Languages Overview
Here we do not intend to explain all the intrica-
cies of the used languages. However we wish to
provide a simple background for each of them,
along with some basic definitions, so that the
reader can fully comprehend the algorithms de-
veloped in this work and the subsequent results of
the research. We also provide references to more
complete expositions and/or tutorials of these lan-
guages should the reader wish to deepen his un-
derstanding of these query languages.
3.1 Prolog
Prolog is a general purpose logic programming
language, with roots in first-order logic. Prolog
is declarative: the program logic is expressed in
terms of relations or relationships, the query of
which initiates a computation. The relationships
are thus arbitrary, i.e., the author decides how to
define said relationships, and there is no set of uni-
versal relationships. These relationships connect
Prolog variables with each other, as well as with
constants. An easy tool to interpret and execute
Prolog queries is SWI-Prolog. 2
Here is an overview of the GeoQuery database
in Prolog format. The database entries have the
following pattern:
# state(name, abbreviation,
capital, population, area,
state number, city1, city2,
city3, city4)
# city(state, state abbreviation,
name, population)
# river(name, length, [states
through which it flows])
# border(state,
2
http://www.swi-prolog.org/
state abbreviation, [states that
border it])
# highlow(state,
state abbreviation, highest point,
highest elevation, lowest point,
lowest elevation)
# mountain(state,
state abbreviation, name, height)
# road(number, [states it passes
through])
# lake(name, area, [states it is
in])
and here we provide some instances of said pat-
terns as an example:
# state(’arkansas’, ’ar’, ’little
rock’, 2286.0e+3, 53.2e+3,25,
’little rock’, ’fort smith’,
’north little rock’, ’pine
bluff’).
# state(’california’, ’ca’,
’sacramento’, 23.67e+6,
158.0e+3,31, ’los angeles’, ’san
diego’, ’san francisco’, ’san
jose’).
# state(’colorado’, ’co’,
’denver’, 2889.0e+3, 104.0e+3,38,
’denver’, ’colorado springs’,
’aurora’, ’lakewood’).
...
# river(’mississippi’, 3778,
[’minnesota’, ’wisconsin’,
’iowa’, ’illinois’, ’missouri’,
’kentucky’, ’tennessee’,
’arkansas’, ’mississippi’,
’louisiana’, ’louisiana’]).
# river(’missouri’, 3968,
[’montana’, ’north dakota’,
’south dakota’, ’iowa’,
’nebraska’, ’missouri’,
’missouri’]).
# river(’colorado’, 2333,
[’colorado’, ’utah’, ’arizona’,
’nevada’, ’california’]).
Apart from the entries in the database following
a known pattern, relationships have to be defined
in order to be understood by the Prolog interpreter.
Here are two examples of such definitions in Pro-
log:
# loc(cityid(City,St),
stateid(State)):-
city(State,St,City, ).
# const(V,V).
The first example, the relation ”loc”, indicates
that when ”loc” appears, if the input variables are
a city and a state, the interpreter has the informa-
tion available in the first, second and third proper-
ties of the corresponding city entry (State, St and
City properties). The second example, the rela-
tion ”const”, indicates that both inputs are to be
associated together. ”Const” is useful for defining
Prolog variables as constants.
And finally we present one of the Prolog query
samples contained in the GeoQuery database:
answer(A,(city(A),loc(A,B),
const(B,stateid(virginia)))).
This query will retrieve all cities in the state of
Virginia. As we can see, the query begs for A to
be returned, where A is a city, and A is located
in B, which corresponds to a state with constant
ID equal to Virginia. We will use this query as
the example input for the algorithms described in
section 4.
3.2 SPARQL
SPARQL (”SPARQL Protocol and RDF Query
Language”) is an RDF query language. In
other words, it is a semantic query language for
databases, able to retrieve and manipulate data
stored in Resource Description Framework (RDF)
format. It is recognized as one of the key tech-
nologies of the semantic web, and it has become
an official W3C Recommendation. A very instruc-
tive SPARQL tutorial can be found in the Apache
Jena homepage 3. We suggest either executing
SPARQL queries with the Apache Jena frame-
work, or with a Virtuoso server. 4 In this subsec-
tion we shall explain the very basics of SPARQL,
please refer to the aforementioned tutorial to actu-
ally learn SPARQL.
SPARQL is a very popular language to query
RDF graphs. Important datasets like Freebase are
stored in RDF format. RDF graphs basically con-
3
http://jena.apache.org/tutorials/
sparql.html
4
http://kidehen.typepad.com/kingsley_
idehens_typepad/
sist on a set of triples or statements – patterns
like Subject <Verb> Object or Entity1
<Relationship> Entity2. Here is an ex-
ample of said RDF graphs:
<state25> <type> ’state’ .
<state25> <name> ’mississippi’ .
...
...
<river1> <type> ’river’ .
<river1> <name> ’mississippi’ .
SparQL matches triple templates with the RDF
graphs and returns the triples that fit the blueprint.
The templates are expressed with SPARQL vari-
ables, which can be recognized because they start
with a ”?” symbol. Here we have two examples of
SPARQL queries:
SELECT ?x WHERE {
?x <name> ’mississippi’ .
}
SELECT ?x WHERE {
?x <type> ’state’ .
?x <name> ’mississippi’ .
}
As we can see, ?x is a SPARQL variable, and
in both cases it tries to match the elements that
appear in the left of RDF triples. It is also the vari-
able that is selected with the SELECT operator, it
is thus the variable to be queried and its values re-
turned as an answer to the query. The WHERE
block indicates the patterns that the graphs must
match. All patterns in the WHERE block must be
matched in order for the entity in the left be as-
sociated to ?x. Thus, if we execute both queries
on the RDF example above, the first query will
return <state25> and <river1> as answers,
but the second query will only return <state25>
as its answer, because only <state25> matches
both ?x <type> ’state’ and ?x <name>
’mississippi’, whereas <river1> only
matches the latter.
3.3 λ-DCS
Lambda dependency-based compositional seman-
tics (λ-DCS) is a new formal language for repre-
senting logical forms in semantic parsing. It was
developed by Percy Liang. (Liang, 2013) It at-
tempts to express logical forms in a simpler way
than Lambda Calculus. By eliminating variables
and making existential quantification implicit, λ-
DCS logical forms are generally more compact
than those in Lambda Calculus. Compared to the
graph structure of SPARQL, the tree structure in
λ-DCS, as well as the absence of variables, should
be very helpful when attempting to associate the
generated trees that are produced from parsing nat-
ural language utterances and the database queries
that will retrieve the answer.
To provide an insight on how λ-DCS is nota-
tionally simpler than lambda calculus, compare
the following expressions:
- Natural language utterance: “people who have
lived in Seattle”
– Logical form (lambda calculus):
λx.∃e.PlacesLived(x,e) ∧ Location(e,Seattle)
– Logical form (λ-DCS):
PlacesLived.Location.Seattle
– SEMPRE notation:
(!<name> (and (<type> ’people’)
(<livedin> (and (<type> ’state’)
(<name> ’seattle’))))
All express the same concept, however λ-DCS
lacks variables and thus has a much more sim-
plified expression compared to Lambda Calcu-
lus. If the reader is interested in a deeper un-
derstanding of Lambda Calculus, we recommend
Barendrengt’s ”Introduction to lambda calculus”
(Barendregt and Barendsen, 1984). SEMPRE is a
toolkit for training semantic parsers, which map
natural language utterances to denotations (an-
swers) via intermediate logical forms. It is the
toolkit Percy Liang developed in order to execute
λ-DCS queries. 5 We also used the SEMPRE
toolkit in this work, and when in this report we
refer to λ-DCS queries, we are actually referring
to the SEMPRE notation of them, not their logi-
cal form. The SEMPRE query above can be read
as: ”return the name of all entities of type ’peo-
ple’ that have lived in the entities of type ’state’
and name ’seattle’. The logical group nature of
this language is thus clearly manifest, with opera-
tions such as intersection (and) and union (or) be-
ing used.
When executing λ-DCS queries, the SEM-
5
http://nlp.stanford.edu/software/
sempre/
PRE toolkit automatically transforms them into an
equivalent SPARQL query which then executes in
a Virtuoso server. Thus the λ-DCS → SPARQL
transformation is implicit within the toolkit. In
this paper we attempt to perform the opposite
transformation, SPARQL → λ-DCS, which can
be extremely useful for the aforementioned QA
project.
4 Transformation algorithms
In this section we will describe the transformation
algorithms step by step. Because the algorithms
are much easier to understand given a specific ex-
ample, we will use a simple sample query asso-
ciated to the question ”What are the cities in Vir-
ginia?”
4.1 Prolog → SPARQL Algorithm
The first thing to note is that the GeoQuery
database is not provided in RDF format. Thus, we
first need to transform the database to RDF format
so that SPARQL and λ-DCS queries can be exe-
cuted on the GeoQuery database. There are many
different trivial ways to do this transformation. In
our case we chose to transform an entry into a ge-
ographical entity with associated properties, all of
the entries constituting different and independent
graphs (no linking between graphs). For example,
this entry:
# state(’arizona’,’az’,’phoenix’,
2718.0e+3,114.0e+3,48,’phoenix’,
’tucson’,’mesa’,’tempe’).
transforms to:
<state3> <type> ’state’ .
<state3> <name> ’arizona’ .
<state3> <abbreviation> ’az’ .
<state3> <capital> ’phoenix’ .
<state3> <population> 2718.0e+3 .
<state3> <area> 114.0e+3 .
<state3> <state number> 48 .
<state3> <city1> ’phoenix’ .
<state3> <city2> ’tucson’ .
<state3> <city3> ’mesa’ .
<state3> <city4> ’tempe’ .
Note that an extra property, <type>, is added
to clarify that the geographical entity <state>
is a state. Prolog identifies directly that the entry
in a database is a state, the RDF format does not,
however.
Now that the GeoQuery database is also in RDF
format, we can attempt to transform the sample
Prolog queries into SPARQL queries. As stated
before, the Prolog relationships are arbitrarily se-
lected by the GeoQuery database, and thus this
Prolog → SPARQL transformation will not be
universal, but specific to this particular case of
the GeoQuery database. Other Prolog relation-
ships would require other transformations. The
SPARQL → λ-DCS transformation that will be
proposed afterwards, however, is indeed intended
to be universal. We will first enumerate the steps
in abstract, and then explain how the algorithm ex-
ecutes on an example Prolog query.
1. Use the NLTK toolkit to create a tree from
the Prolog query
2. Identify the Prolog variables (single capital
letters in the leaves of the tree) and store them
in an empty variables dictionary
3. Identify the type of the Prolog variables and
store the type in the variables dictionary
3.1. Look for single-leaf nodes
3.2. Look for two-leaf ”const” labeled nodes
3.3. Leave the non-type-informing nodes for
the next step
4. With the type of the variables, indicate the
relationships between the Prolog variables in
SPARQL format. This step must interpret the
nodes of the tree that were not interpreted in
Step 3.
5. Organize all collected information in a cor-
rect RDF graph form. In this work we opted
to have a string that concatenated the infor-
mation progressively as the algorithm was
executed.
Now we will see the algorithm executed on an
example. Given the sample query in Prolog that
retrieves all the cities in the state of Virginia:
answer(A,(city(A),loc(A,B),
const(B,stateid(virginia)))).
the first step is to parse such query and create an
NLTK tree from it. This is how the tree looks like:
(Step 1)
Note that the label ”goals” has been added at an
unnamed node in the original Prolog query. The
name is arbitrary and is simply there so that all
nodes in the NLTK tree are labeled. The next step
is to identify the Prolog variables (Step 2) and their
type (Step 3), information required to interpret the
other Prolog relationships. Variables are single
capital letters located in the leaves of the tree. We
can obtain the type of the variable either from the
label of single-leaf subtrees - e.g. ”city(A)” tells
us A is a city - or from two-leaf subtrees with the
label ”const” - e.g. const(B,stateid(virginia)) tells
us B is a state, and its name is Virginia. We can
thus now create a variable dictionary containing
all the variables and their corresponding types:
varDict={’A’:’city’,’B’:’state’}
Finally, we are now able to interpret the other
subtrees or Prolog relationships like loc(A,B) -
which informs that city A is located in state B.
(Step 4) We would be unable to infer this without
the type of A and B (”loc” could refer to a river
A located in state B, a relationship with a different
name), so these subtrees can only be interpreted at
this stage. Now that all information about the Pro-
log relationships has been retrieved, it can now be
expressed as a SPARQL query: (Step 5)
SELECT ?city WHERE {
?xA <name> ?city .
?xA <type> "city" .
?xA <name> ?A .
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "virginia" .
?xA <state> ?B .
}
This is the equivalent query in SPARQL that re-
trieves all the cities in the state of Virginia. Prolog
variables are expressed with an extra ”x” in front
of them when they appear on the left in SPARQL
because they reference geographical entities. In
the right they reference String names, and thus
need to be differentiated. It also helps for clar-
ity purposes. A further post-processing of this
SPARQL query is possible, condensing for exam-
ple these four statements
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "virginia" .
?xA <state> ?B .
into just this one:
?xA <state> "virginia" .
However the SPARQL query that the algorithm
provides expresses all the information given in the
original Prolog query, and thus we chose to leave
it like it is.
4.2 SPARQL → λ-DCS Algorithm
From the SPARQL query we obtained with the
previous algorithm we will now attempt to create
an equivalent λ-DCS query. We will first present
the steps of the algorithm in abstract, and in the
next step we can see the algorithm applied to the
sample query and the result we eventually arrive
to.
1. Parse and interpret every line in the SPARQL
query, and create a variable dictionary
1.1. Identify the variables (they start with
”?”)
1.2. Assign relationships between variables
and/or constants
1.3. Add the reverse relationships (starting
with ”!”) in the target-variables
2. Traverse the variable dictionary to eliminate
the SPARQL variables
2.1. Start with the selected variable
2.2. transcribe its relationships
2.3. 0 relationships → [], more than 1 rela-
tionship → ”and” operator
2.4. select next SPARQL variable to tra-
verse, and repeat this step until all vari-
ables have been traversed
3. Add special options (”Count”, ”Limit”,
”Or”...) where appropiate
In the following page the reader can find this
algorithm applied iteration by iteration with the
sample query from the previous section.
EXECUTION OF THE ALGORITHM IN A
SAMPLE QUERY, STEP BY STEP
SELECT ?city WHERE {
?xA <name> ?city .
?xA <type> "city" .
?xA <name> ?A .
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "virginia" .
?xA <state> ?B .
}
Step 0: original SPARQL query
Step 1: Create a variable dictionary with relation-
ships
Step 2: Start from the selected variable (green)
Step 3: Continue with next variable (?xA). Sev-
eral relationships translate as an ”and” operator in
SEMPRE. The reverse relationship with the previ-
ous variable is always ignored.
Step 4: Next variables: ?B and ?A. ?A has no
relationships left after ignoring its reverse rela-
tionship. Thus, it translates as a general variable,
[] in SEMPRE.
Step 5: Last variable: ?xB. No variables left -
SEMPRE query finished.
Now we mention some clarifying comments to
the steps above. Once again we wish to identify all
the variables in the query, but this time the vari-
able dictionary will also contain all the relation-
ships of said variables with constants and other
variables. Note that, unlike in Prolog, ?xA and
?A are different variables in SPARQL. In our vari-
able dictionary we will also include the reverse
relationships between variables, expressed in λ-
DCS with an exclamation mark ”!” in front of
the relationship. For example, when we read the
line ?xA <state> ?B . we will include both
xA - <state> - ?B and ?B - !<state>
- ?xA in the variable dictionary. This will be
essential when traversing variables. The variable
dictionary created from our sample query is in
Step 1.
In order to eliminate the SPARQL variables we
will iterate the transcription of relationships as we
traverse the variable dictionary. We start at the se-
lected variable, retrieved from the query’s first line
SELECT ?city WHERE { and transcribe its
only property into λ-DCS format. (Step 2)
We now proceed to eliminate ?xA. We ignore
the reverse relationship that joins ?xA with ?city
(as this relationship has just been transcribed)
and focus on all the other relationships - ”type”,
”name” and ”state”. (Step 3) Because we have
more than one relationship to transcribe from ?xA,
we use the λ-DCS ”and” operator to express the
intersection of the groups expressed by these rela-
tionships.
We repeat this step until we have traversed all
variables in the dictionary. In the particular case
of variable ?A, (Step 4) its only relationship is
ignored because it was already expressed in the
previous iteration. This leaves no relationships to
transcribe and thus ?A is changed to [], the λ-DCS
operator that expresses an undefined variable. Af-
ter all the iterations we arrive at the final λ-DCS
expression: (Step 5)
This is the equivalent query in λ-DCS that re-
trieves all the cities in the state of Virginia. As can
be noted, Prolog and SPARQL variables have dis-
appeared and only a mathematical group expres-
sion remains.
As a last word, recall that this SPARQL to λ-
DCS algorithm is much more relevant, as it is in-
tended to be universal. SPARQL grammar and re-
lationships are not arbitrary like Prolog relation-
ships, thus one would expect this algorithm to per-
form well no matter the database and SPARQL
queries that are provided as input.
5 Results
Apart from the sample query that corresponds to
”What are the cities in Virginia?”, the Appendix
in Section 9 provides more examples of queries
transformed by the developed algorithms. In the
Appendix the reader can understand the different
nature of the different query languages by com-
paring equivalent queries, and appreciate Prolog
as a variable propositional language, SPARQL as
a graph language and λ-DCS as a variable-less
mathematical group language.
Note how for one of the queries, an equivalent
λ-DCS query was not possible to obtain with the
described algorithms. Besides, the Prolog query
and the SPARQL query there return different re-
sults. Flaws and limitations of the algorithms are
discussed in the next section.
Overall, the algorithms were able to success-
fully transform about 90% of the 880 sam-
ple queries provided in the GeoQuery database.
By ”successfully transform” we mean that these
queries have been correctly expressed in Prolog,
SPARQL and λ-DCS, providing equivalent results
when executed. The queries that were not suc-
cessfully transformed into some language along
with those whose transformation did not provide
equivalent results make the remaining 10%. The
algorithms’ coverage is thus pretty satisfactory,
especially considering that all basic queries can
be successfully transformed with these algorithms.
Other operators apart from the basic ones, e.g.
count, descending order, max, union/or..., were
also successfully interpreted by the algorithms.
The algorithms still admit however refinement,
improvement and extensions, since some opera-
tors are not yet included or are problematic. These
problems and limitations will be outlined in the
following section.
6 Encountered Problems and
Limitations
The development of the explained algorithms did
encounter some difficulties, and in some cases
we were unable to successfully transform certain
queries. Here we will mention and explain where
relevant these difficulties and how they were over-
come, if it was the case.
First we will address some of the prob-
lems encountered when treating with Prolog.
Due to Prolog’s arbitrary definition of rela-
tionships, it is obvious that in some cases
one could define a better relationship to ease
the transformation to SPARQL. For example,
instead of the relationships capital(A)
+ loc(A,B) it would be much better to
define the relationship capital(A,B),
which combines both and avoids having to
create a statement ?xStateCapitalOf
<capital> ?A . that contains an undefined
variable. In addition, some of the sample Geo-
Query queries contain redundancy that then
spreads as identical statements in SPARQL. As
seen in Query 2 of the Appendix, the Prolog
state(B),const(B,stateid(oregon))
could simply be expressed as
const(B,stateid(oregon)), without re-
dundancy. Finally, inconsistencies within the for-
mat of the Prolog GeoQuery database obviously
leads to problems when trying to test equivalent
queries. The property <lowest elevation>,
for example, is only defined for those states
that do not border the sea. Those states which
do border the sea are assumed to have lowest
elevation equal to zero, however the absence of
such a relationship leads to the inconsistencies
between Prolog and SPARQL queries expressed
in query Y of Table X, apart from requiring
an OPTIONAL operator in SPARQL which, as
we will address briefly, cannot be expressed in
λ-DCS.
The main problem when treating with SPARQL
was the absence of a good SPARQL parser, which
would greatly simplify interpreting nested boxes.
The algorithm thus far can only interpret a very
simple UNION nested block, but not for example
a query like this:
SELECT ?A WHERE {
SELECT ?B WHERE {
...
}
...
}
As a mathematical group and logical language,
λ-DCS has a wider coverage than SPARQL, but
only when one group or one variable is being
queried. A serious limitation of λ-DCS is that the
language is unable to query two variables or two
groups simultaneously. For example, a query to
retrieve the name and surname of all employees in
a company would look like this in SPARQL:
SELECT ?name ?surname WHERE {
?person <name> ?name .
?person <surname> ?surname .
}
This SPARQL query will return a table of two
columns, one column for the names and one col-
umn for the corresponding surnames. However,
due to having two variables to be retrieved, it
is impossible to express this in λ-DCS. λ-DCS
could retrieve a list of the names of the employ-
ees and a list of the surnames of the employees,
i.e. two separate lists, but not a single list with
name-surname pairs, which would be the equiv-
alent of the two-column answer from SPARQL.
This is obviously a big setback to transforming
any SPARQL query to an equivalent λ-DCS ex-
pression, as a huge strength of SPARQL is re-
trieving associated variables and properties in ta-
bles, and thus a large amount of SPARQL queries
will have more than one variable retrieved and
will be impossible to transform to λ-DCS. Fur-
thermore, the OPTIONAL operator in SPARQL
cannot be expressed as a logical mathematical
group, which means it cannot transform to λ-
DCS either. The SPARQL OPTIONAL operator
addresses the sparsity and irregularity of proper-
ties in RDF graph databases, allowing a query to
match a relationship whether it exists or not. For
example, a SPARQL query that retrieves the name
of the lowest point of a state and its correspond-
ing height, IF it exists in the database, would be
similar to this:
SELECT ?lowpoint ?height WHERE {
?xA <type> ’highlow’ .
?xA <lowest point> ?lowpoint .
OPTIONAL (?xA <lowest height>
?height)
}
The result would be a table with two columns:
one column for the name of the lowest point, and
one column for its corresponding height. If the
<lowest height> property is not found, the
name will still be retrieved and its correspond-
ing cell in the second column would be left blank.
This cannot be expressed in λ-DCS because of the
OPTIONAL operator and, as explained above, be-
cause of the existence of more than one variable to
be retrieved. It is true that the OPTIONAL oper-
ator’s main utility in SPARQL is most of the time
tied to retrieving more than one variable, so these
two limitations can generally be seen as one. It is
however, as explained above, a considerable limi-
tation to expressing SPARQL queries in λ-DCS, as
multivariable queries and OPTIONAL operators
are quite common in SPARQL queries. The only
way to tackle this problem would be to develop an
extension to λ-DCS that would effectively allow
for multiple logical groups to be retrieved simul-
taneously as well as allowing some properties of
said groups to be optional.
7 Future Work
As already stated throughout this paper, there is
yet big room for improvement in refining the pro-
posed algorithms, either by extensions to cover
more operators or using tools to better interpret the
input queries. For example, if a good SPARQL
parser were to be developed then interpreting
nested SPARQL blocks would become a much
more feasible task.
Another important step to take is to test the pro-
posed algorithms in other data sets and observe
their performance. The Prolog → SPARQL al-
gorithm obviously does not generalize well due
to Prolog’s arbitrary declarative nature, however
the SPARQL → λ-DCS algorithm is designed
to be universal. Thus the latter should definitely
be tested using sample SPARQL queries from
databases like Freebase or QALD 6 as input.
Finally, we eagerly await the deployment of the
aforementioned QA project that would make full
use of the proposed algorithms for its purposes.
The success of said project would broaden the per-
spective of the utility behind transforming queries
between different query languages.
8 Conclusion
In this work we carried out the transformation be-
tween the Prolog, SPARQL and λ-DCS query lan-
guages. We discovered that it is a feasible task
when treating with queries that originate from nat-
6
http://greententacle.techfak.
uni-bielefeld.de/˜cunger/qald/index.
php?x=home&q=5
ural language utterances or requests. We are sat-
isfied on how the proposed algorithms are able to
transform the big majority of basic queries suc-
cessfully, and we consider it would be worthy to
continue the work and refine the algorithms.
Furthermore, working on these transformations
has brought us a deeper understanding on the sim-
ilarities and differences of the target query lan-
guages, and how some adapt better to different
tasks. As one could intuitively think from the be-
ginning, there are also concepts and queries that
can not be expressed in all languages, and thus a
total coverage transformation is impossible. How-
ever, this should not be a setback to performing
said transformations where they are viable. In-
deed, we hope to see the QA project reach the full
potential of the transformations presented in this
paper.
9 APPENDIX: Queries in different query languages
Utterance: What are the cities in Virginia?
Prolog:
answer(A,(city(A),loc(A,B), const(B,stateid(virginia)))).
SPARQL:
SELECT ?city WHERE {
?xA <name> ?city .
?xA <type> "city" .
?xA <name> ?A .
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "virginia" .
?xA <state> ?B .
}
λ-DCS:
(!<name> (and (and (<type> ’city’) (<name> [])) (<state> (!<name>
(and (<type> ’state’) (<name> ’virginia’))))))
Query 1 (Used as example)
Utterance: What is the name of the highest point in Oregon?
Prolog:
answer(A,highest(A,(place(A),loc(A,B),state(B),const(B,stateid(oregon))))).
SPARQL:
SELECT ?A WHERE {
?xB <type> "state" .
?xB <name> ?B .
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "oregon" .
?xA <highest point> ?A .
?xA <highest elevation> ?height0 .
?xA <state> ?B .
}
ORDER BY DESC(?height0) LIMIT 1
λ-DCS:
(!<highest point> (argmax 1 1 (<state> (!<name> (and (and
(<type> ’state’) (<type> ’state’)) (<name> (!<name> (and (and
(<type> ’state’) (<type> ’state’)) (<name> ’oregon’)))))))
<highest elevation>))
Query 2
Utterance: What is the capital of Texas?
Prolog:
answer(A,(capital(A),loc(A,B),const(B,stateid(texas)))).
SPARQL:
SELECT ?A WHERE {
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "texas" .
?xStateCapitalOf <capital> ?A .
?xA <name> ?A .
?xA <state> ?B .
}
λ-DCS:
(and (!<capital> []) (!<name> (<state> (!<name> (and (<type> ’state’)
(<name> ’texas’))))))
Query 3
Utterance: How many states have a lower elevation than Arizona?
Prolog:
answer(A,count(B,(state(B),low point(B,C),lower(C,D),
low point(E,D),const(E,stateid(arizona))),A)).
SPARQL: (does not provide equivalent results as the Prolog query!)
SELECT (COUNT (?B) AS ?numberOFstate) WHERE {
?xB <type> "state" .
?xB <name> ?B .
?xC <state> ?B .
?xC <lowest point> ?C .
?xE <type> "state" .
?xE <name> ?E .
?xE <name> "arizona" .
OPTIONAL {?xC <lowest elevation> ?height0 . }
OPTIONAL {?xD <lowest elevation> ?height1 . }
FILTER ( IF ( BOUND(?height1), ?height0, 0 ) <
IF ( BOUND(?height1), ?height1, 0 ) )
?xD <state> ?E .
?xD <lowest point> ?D .
}
λ-DCS: Not possible! (See Section 6: Encountered problems and limitations)
Query 4
Utterance: What is the name of the lakes in Michigan?
Prolog:
answer(A,(lake(A),loc(A,B),const(B,stateid(michigan)))).
SPARQL:
SELECT ?lake WHERE {
?xA <name> ?lake .
?xA <type> "lake" .
?xA <name> ?A .
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "michigan" .
?xA <isin> ?B .
}
λ-DCS:
(!<name> (and (and (<type> ’lake’) (<name> [])) (<isin> (!<name> (and
(<type> ’state’) (<name> ’michigan’))))))
Query 5
Utterance: How many rivers flow through Colorado?
Prolog:
answer(A,count(B,(river(B),loc(B,C),const(C,stateid(colorado))),A)).
SPARQL:
SELECT (COUNT (?B) AS ?numberOFriver) WHERE {
?xB <type> "river" .
?xB <name> ?B .
?xC <type> "state" .
?xC <name> ?C .
?xC <name> "colorado" .
?xB <flowsthru> ?C .
}
λ-DCS:
(count (!<name> (and (<type> ’river’) (<flowsthru> (!<name> (and
(<type> ’state’) (<name> ’colorado’)))))))
Query 6
Utterance: What are the names of the highest points of the states bordering Mississippi?
Prolog:
answer(A,(high point(B,A),state(B),next to(B,C),const(C,stateid(mississippi)))).
SPARQL:
SELECT ?A WHERE {
?xB <type> "state" .
?xB <name> ?B .
?xC <type> "state" .
?xC <name> ?C .
?xC <name> "mississippi" .
?x0 <state> ?B .
?x0 <highest point> ?A .
?x1 <state> ?C .
?x1 <borderingstate> ?B .
}
λ-DCS:
(!<highest point> (<state> (and (!<name> (<type> ’state’))
(!<borderingstate> (<state> (!<name> (and (<type> ’state’) (<name>
’mississippi’))))))))
Query 7
Utterance: Give me all the cities in the USA
Prolog:
answer(A,(city(A),loc(A,B),const(B,countryid(usa)))).
SPARQL:
SELECT ?A WHERE {
?xA <type> "city" .
?xA <name> ?A .
}
λ-DCS:
(!<name> (<type> ’city’))
Query 8
References
Uwe A mann, Andreas Bartho, and Christian Wende. 2010. Reasoning Web. Semantic Technologies for Soft-
ware Engineering: 6th International Summer School 2010, Dresden, Germany, August 30-September 3, 2010.
Tutorial Lectures, volume 6325. Springer Science & Business Media.
Henk P Barendregt and Erik Barendsen. 1984. Introduction to lambda calculus. Nieuw archief voor wisenkunde,
4(2):337–372.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-
answer pairs. In EMNLP, pages 1533–1544.
Percy Liang. 2013. Lambda dependency-based compositional semantics. arXiv preprint arXiv:1309.4408.

More Related Content

What's hot

Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
Findwise
 
A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and Documentation
Christoph Lange
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
Constantin Stan
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
Jie Bao
 
Semantic Web Nature
Semantic Web NatureSemantic Web Nature
Semantic Web Nature
Constantin Stan
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search engines
Boris Galitsky
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
mbruemmer
 
Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13
Brian Ulicny
 
Bhagaban Mallik
Bhagaban MallikBhagaban Mallik
Using linguistic analysis to translate
Using linguistic analysis to translateUsing linguistic analysis to translate
Using linguistic analysis to translate
IJwest
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontology
IJwest
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
Mehwish Alam
 
Semantics
SemanticsSemantics
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
Shubhangi Tandon
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
Keerti Bhogaraju
 

What's hot (16)

Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and Documentation
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
 
Semantic Web Nature
Semantic Web NatureSemantic Web Nature
Semantic Web Nature
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search engines
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
 
Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13
 
Bhagaban Mallik
Bhagaban MallikBhagaban Mallik
Bhagaban Mallik
 
Using linguistic analysis to translate
Using linguistic analysis to translateUsing linguistic analysis to translate
Using linguistic analysis to translate
 
search engine
search enginesearch engine
search engine
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontology
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
Semantics
SemanticsSemantics
Semantics
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 

Viewers also liked

"Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué...
"Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué..."Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué...
"Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué...
José Pedro Alberti
 
IMAGES_Winter 2017_D
IMAGES_Winter 2017_DIMAGES_Winter 2017_D
IMAGES_Winter 2017_D
Jared Glover
 
Todd Goodall Qualtrics preso
Todd Goodall Qualtrics preso Todd Goodall Qualtrics preso
Todd Goodall Qualtrics preso
Todd Goodall
 
Multifamily Social Media Summit
Multifamily Social Media SummitMultifamily Social Media Summit
Multifamily Social Media Summit
Dylan Sellberg
 
Case Study 1
Case Study 1Case Study 1
Case Study 1
Jordan Schultz
 
August 2016 Connection Newsletter
August 2016 Connection NewsletterAugust 2016 Connection Newsletter
August 2016 Connection Newsletter
ComplianceSigns, LLC
 
Feb mar 2017 newsletter
Feb mar 2017 newsletterFeb mar 2017 newsletter
Feb mar 2017 newsletter
EpworthUMC
 
Identificacion estilos de aprendizaje
Identificacion estilos de aprendizajeIdentificacion estilos de aprendizaje
Identificacion estilos de aprendizaje
valen_MARIN
 
Subdrenaje o drenaje subterraneo
Subdrenaje o drenaje subterraneoSubdrenaje o drenaje subterraneo
Subdrenaje o drenaje subterraneo
Francis L Marquez C
 
Organizacion andreinarodriguez 20980822
Organizacion andreinarodriguez 20980822Organizacion andreinarodriguez 20980822
Organizacion andreinarodriguez 20980822
guillencindy
 
NSCZ_EI_Nuestras aulas cuentan con...
NSCZ_EI_Nuestras aulas cuentan con...NSCZ_EI_Nuestras aulas cuentan con...
NSCZ_EI_Nuestras aulas cuentan con...
mrch900
 
CV2017COMMKTAI_sp
CV2017COMMKTAI_spCV2017COMMKTAI_sp
CV2017COMMKTAI_sp
Renata Gimenes
 

Viewers also liked (12)

"Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué...
"Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué..."Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué...
"Vouchers" para el desarrollo de mercados de servicios empresariales: ¿de qué...
 
IMAGES_Winter 2017_D
IMAGES_Winter 2017_DIMAGES_Winter 2017_D
IMAGES_Winter 2017_D
 
Todd Goodall Qualtrics preso
Todd Goodall Qualtrics preso Todd Goodall Qualtrics preso
Todd Goodall Qualtrics preso
 
Multifamily Social Media Summit
Multifamily Social Media SummitMultifamily Social Media Summit
Multifamily Social Media Summit
 
Case Study 1
Case Study 1Case Study 1
Case Study 1
 
August 2016 Connection Newsletter
August 2016 Connection NewsletterAugust 2016 Connection Newsletter
August 2016 Connection Newsletter
 
Feb mar 2017 newsletter
Feb mar 2017 newsletterFeb mar 2017 newsletter
Feb mar 2017 newsletter
 
Identificacion estilos de aprendizaje
Identificacion estilos de aprendizajeIdentificacion estilos de aprendizaje
Identificacion estilos de aprendizaje
 
Subdrenaje o drenaje subterraneo
Subdrenaje o drenaje subterraneoSubdrenaje o drenaje subterraneo
Subdrenaje o drenaje subterraneo
 
Organizacion andreinarodriguez 20980822
Organizacion andreinarodriguez 20980822Organizacion andreinarodriguez 20980822
Organizacion andreinarodriguez 20980822
 
NSCZ_EI_Nuestras aulas cuentan con...
NSCZ_EI_Nuestras aulas cuentan con...NSCZ_EI_Nuestras aulas cuentan con...
NSCZ_EI_Nuestras aulas cuentan con...
 
CV2017COMMKTAI_sp
CV2017COMMKTAI_spCV2017COMMKTAI_sp
CV2017COMMKTAI_sp
 

Similar to master_thesis_greciano_v2

eureka09
eureka09eureka09
eureka09
tutorialsruby
 
eureka09
eureka09eureka09
eureka09
tutorialsruby
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET Journal
 
Sem facet paper
Sem facet paperSem facet paper
Sem facet paper
DBOnto
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web work
Paul Houle
 
Algebraic Data Types for Data Oriented Programming - From Haskell and Scala t...
Algebraic Data Types forData Oriented Programming - From Haskell and Scala t...Algebraic Data Types forData Oriented Programming - From Haskell and Scala t...
Algebraic Data Types for Data Oriented Programming - From Haskell and Scala t...
Philip Schwarz
 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces using
unyil96
 
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesExplanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Daniel Sonntag
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678
Editor IJARCET
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
IJwest
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
kevig
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
Ebenezer Daniel
 
INTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACE
INTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACEINTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACE
INTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACE
Mohamed Reda
 
.Net and Rdf APIs
.Net and Rdf APIs.Net and Rdf APIs
.Net and Rdf APIs
Recean Denis
 
Stay fresh
Stay freshStay fresh
Stay fresh
Ahmed Mohamed
 
Oop
OopOop
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
NohaGhoweil
 
Object relationship mapping and hibernate
Object relationship mapping and hibernateObject relationship mapping and hibernate
Object relationship mapping and hibernate
Joe Jacob
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
iosrjce
 
D017232729
D017232729D017232729
D017232729
IOSR Journals
 

Similar to master_thesis_greciano_v2 (20)

eureka09
eureka09eureka09
eureka09
 
eureka09
eureka09eureka09
eureka09
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
 
Sem facet paper
Sem facet paperSem facet paper
Sem facet paper
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web work
 
Algebraic Data Types for Data Oriented Programming - From Haskell and Scala t...
Algebraic Data Types forData Oriented Programming - From Haskell and Scala t...Algebraic Data Types forData Oriented Programming - From Haskell and Scala t...
Algebraic Data Types for Data Oriented Programming - From Haskell and Scala t...
 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces using
 
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesExplanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
 
INTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACE
INTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACEINTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACE
INTELLIGENT-MULTIDIMENSIONAL-DATABASE-INTERFACE
 
.Net and Rdf APIs
.Net and Rdf APIs.Net and Rdf APIs
.Net and Rdf APIs
 
Stay fresh
Stay freshStay fresh
Stay fresh
 
Oop
OopOop
Oop
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
Object relationship mapping and hibernate
Object relationship mapping and hibernateObject relationship mapping and hibernate
Object relationship mapping and hibernate
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 

master_thesis_greciano_v2

  • 1. Transformation between Query Languages Miguel Cristian Greciano Raiskila Tokyo National Institute of Informatics Technische Universit¨at Darmstadt Universidad Polit´ecnica de Madrid mc.greciano@gmail.com Yusuke Miyao Tokyo National Institute of Informatics Associate Professor Master Thesis Tutor yusuke@nii.ac.jp Abstract The Semantic Web, an extension of the Web that provides easier ways to retrieve data, has seen a major growth in the last years. Not only do users of the Inter- net desire information, they desire better, quicker, easier and more efficient ways to access that information and find answers to their questions. A clear example of the Semantic Web phe- nomenon is the popularity of the Freebase data set. It is a big collection of struc- tured data harvested from many sources. Freebase runs on a database infrastructure created in-house by Metaweb that uses a graph model. Because its data structure is non-hierarchical, Freebase is open for users to enter new objects and relation- ships into the underlying graph, a great advantage. Since 2008, Freebase imple- ments RDF (Resource Description For- mat), allowing Freebase to be used as linked data and be queried by languages such as SPARQL, which is also quite pop- ular. SPARQL users normally create their own SPARQL queries manually, or at least semi-manually. There are even hundreds of sample SPARQL queries on the web with their associated natural language ut- terance to hint users how to create or adapt their own queries. It would be however ideal if this whole step from natural lan- guage to executing a query in the database was automated. In this paper we wish to address a key feature in the desired automation between natural language and query execution: transformation between query languages. Indeed in the example of Freebase and SPARQL, natural languages utterances are far in nature from RDF graphs, but closer in nature to semantic trees. If we could close the gap between trees and graphs by transforming between different query lan- guages, we would draw near to our final goal of automation. In this paper we out- line two algorithms that transform queries between different languages, the SPARQL → λ-DCS one being the more relevant. We also provide the necessary background as to the reasons and utility for said algo- rithms. 1 Introduction This paper outlines the algorithms developed to transform queries from one query language into their equivalents in a different query language. It also provides a brief but comprehensive back- ground on the query languages and tools chosen for this study, as well as the reasons to why these were chosen for this study. An obvious initial question is what kind of benefits would transformation between query lan- guages provide. It is true that equivalent queries produce equivalent results (”Where was Barack Obama born?” should return ”Honolulu” no mat- ter which query language one is using), however, different languages express different concepts for equivalent queries, as well as being different in efficiency and answer-retrieving speed for equiv- alent queries. Just as one can express a database containing the books in a library and associated data both in, for example, programming languages Java and C, the implementation of said database will be obviously different due to the different inherent nature between both programming lan- guages. Java will use objects to represent the data and C will use other data structures and care about storage and memory differently. Furthermore, transformation between query languages can be very useful when a specific query
  • 2. language is more suitable than another for cer- tain natural language processing tasks. Some lan- guages have a graph structure and others have a tree structure. Associating syntactic trees that arise from parsing a natural language sentence with a query could be intuitively easier done if the query itself has a tree structure. For ex- ample, in their ”Semantic Parsing on Freebase from Question-Answer Pairs” (Berant et al., 2013) Jonathan Berant, Andrew Chou, Roy Frostig and Percy Liang already suggested a way to map natu- ral language utterances to queries on the Freebase data set: through a tree-structured query language they name Lambda Dependency-Based Composi- tional Semantics (from now on λ-DCS). An ambitious project that our team is currently working on, the Question-Answering Project (from now on, the QA project), intends to auto- matically create these Freebase queries from natu- ral language input. As mentioned in the abstract, we already possess a good amount of utterance- query pairs, however in order to train the inter- preter we must be able to traverse from utterance- semantic tree-graph-query and the reverse way too. The chosen query languages for the transforma- tions are Prolog, SPARQL and λ-DCS. Section 3 contains a brief exposition of the syntax and pe- culiarities of these languages. Section 4 outlines the transformation algorithms between these lan- guages. In section 5 we analyze the performance of said algorithms. Section 6 explains the encoun- tered problems and limitations to the algorithms and the transformations, as well as some possible solutions. Section 7 suggests future tasks to be carried out from this project, and section 8 pro- vides the conclusion to our work. Section 9 is an Appendix containing some of the results of our work: queries equivalently expressed in different query languages. As we will see, the λ-DCS → SPARQL trans- formation is already implicit in the toolkit that we have. However the major contribution of this paper is the reverse SPARQL → λ-DCS trans- formation. It is not a trivial task as λ-DCS → SPARQL is transforming trees into graphs, whereas SPARQL → λ-DCS is the opposite: rans- forming graphs into trees. Trees are by defini- tion a particular case of graphs and thus less ex- pressive, so in theory transforming from graphs to trees should pose a significant challenge. The SPARQL → λ-DCS has not been proposed yet and is important for the QA project, which needs to traverse from semantic trees to graphs (λ-DCS → SPARQL) and also from graphs to semantic trees (SPARQL → λ-DCS) in order to execute training. The chosen experimental database is GeoQuery 1. It is a small database containing geographical information about all of the states in the US (cities, rivers, roads, highest and lowest points...). An ex- tended geographical database with more data is included in Freebase, however we chose to work with GeoQuery because it is simple, intuitive, small, comes in various formats and is free and open source (despite Freebase being also free and open source, there are of course data sets which are not, so this advantage should not be consid- ered a given). It is also relatively well-known. One of the formats GeoQuery comes in is the Pro- log format, and this is the main reason why we chose to select Prolog as one of the query lan- guages in this research. GeoQuery comes with more than 880 functional Prolog queries as sam- ples, all of them associated with natural language questions such as ”What is the longest river in Col- orado?”. The samples are first-order-logic queries, i.e. they treat with quantities and sets, not propo- sitions. Thus, we have a good starting source ma- terial for our objective of transforming functional queries in one language to their equivalents in an- other language. In this case, the algorithms are de- signed to transform first from Prolog to SPARQL, and then from SPARQL to λ-DCS. We thus wish the algorithms function properly in the GeoQuery database, with the hope that such algorithms gen- eralize well when transforming queries in larger datasets like Freebase. 2 Related Work Automated transformation between query lan- guages is not a very common practice. Most of the times queries are written manually for the purpose of retrieving desired information. Consequently, related work is scarce on the web. One can still find similar attempts though: in the book ”Reason- ing Web” there’s a subsection addressing transfor- mation between SPARQL and GReQL. (A mann et al., 2010) As we will see in section 4.3, the transforma- 1 http://www.cs.utexas.edu/users/ml/ nldata/geoquery.html
  • 3. tion λ-DCS → SPARQL is implicit in the SEM- PRE toolkit, however the reverse transformation SPARQL → λ-DCS is not. One of the algo- rithms proposed in this paper attempts to exe- cute this transformation. The other algorithm pro- posed is for the Prolog → SPARQL transforma- tion, which as far as the authors are concerned, nobody else has attempted to develop. Thus, even though Transformation between Query Languages is not a new concept, it is still rare, and this paper pioneers the transformations Prolog → SPARQL and SPARQL → λ-DCS. 3 Query Languages Overview Here we do not intend to explain all the intrica- cies of the used languages. However we wish to provide a simple background for each of them, along with some basic definitions, so that the reader can fully comprehend the algorithms de- veloped in this work and the subsequent results of the research. We also provide references to more complete expositions and/or tutorials of these lan- guages should the reader wish to deepen his un- derstanding of these query languages. 3.1 Prolog Prolog is a general purpose logic programming language, with roots in first-order logic. Prolog is declarative: the program logic is expressed in terms of relations or relationships, the query of which initiates a computation. The relationships are thus arbitrary, i.e., the author decides how to define said relationships, and there is no set of uni- versal relationships. These relationships connect Prolog variables with each other, as well as with constants. An easy tool to interpret and execute Prolog queries is SWI-Prolog. 2 Here is an overview of the GeoQuery database in Prolog format. The database entries have the following pattern: # state(name, abbreviation, capital, population, area, state number, city1, city2, city3, city4) # city(state, state abbreviation, name, population) # river(name, length, [states through which it flows]) # border(state, 2 http://www.swi-prolog.org/ state abbreviation, [states that border it]) # highlow(state, state abbreviation, highest point, highest elevation, lowest point, lowest elevation) # mountain(state, state abbreviation, name, height) # road(number, [states it passes through]) # lake(name, area, [states it is in]) and here we provide some instances of said pat- terns as an example: # state(’arkansas’, ’ar’, ’little rock’, 2286.0e+3, 53.2e+3,25, ’little rock’, ’fort smith’, ’north little rock’, ’pine bluff’). # state(’california’, ’ca’, ’sacramento’, 23.67e+6, 158.0e+3,31, ’los angeles’, ’san diego’, ’san francisco’, ’san jose’). # state(’colorado’, ’co’, ’denver’, 2889.0e+3, 104.0e+3,38, ’denver’, ’colorado springs’, ’aurora’, ’lakewood’). ... # river(’mississippi’, 3778, [’minnesota’, ’wisconsin’, ’iowa’, ’illinois’, ’missouri’, ’kentucky’, ’tennessee’, ’arkansas’, ’mississippi’, ’louisiana’, ’louisiana’]). # river(’missouri’, 3968, [’montana’, ’north dakota’, ’south dakota’, ’iowa’, ’nebraska’, ’missouri’, ’missouri’]). # river(’colorado’, 2333, [’colorado’, ’utah’, ’arizona’, ’nevada’, ’california’]). Apart from the entries in the database following a known pattern, relationships have to be defined in order to be understood by the Prolog interpreter. Here are two examples of such definitions in Pro- log:
  • 4. # loc(cityid(City,St), stateid(State)):- city(State,St,City, ). # const(V,V). The first example, the relation ”loc”, indicates that when ”loc” appears, if the input variables are a city and a state, the interpreter has the informa- tion available in the first, second and third proper- ties of the corresponding city entry (State, St and City properties). The second example, the rela- tion ”const”, indicates that both inputs are to be associated together. ”Const” is useful for defining Prolog variables as constants. And finally we present one of the Prolog query samples contained in the GeoQuery database: answer(A,(city(A),loc(A,B), const(B,stateid(virginia)))). This query will retrieve all cities in the state of Virginia. As we can see, the query begs for A to be returned, where A is a city, and A is located in B, which corresponds to a state with constant ID equal to Virginia. We will use this query as the example input for the algorithms described in section 4. 3.2 SPARQL SPARQL (”SPARQL Protocol and RDF Query Language”) is an RDF query language. In other words, it is a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It is recognized as one of the key tech- nologies of the semantic web, and it has become an official W3C Recommendation. A very instruc- tive SPARQL tutorial can be found in the Apache Jena homepage 3. We suggest either executing SPARQL queries with the Apache Jena frame- work, or with a Virtuoso server. 4 In this subsec- tion we shall explain the very basics of SPARQL, please refer to the aforementioned tutorial to actu- ally learn SPARQL. SPARQL is a very popular language to query RDF graphs. Important datasets like Freebase are stored in RDF format. RDF graphs basically con- 3 http://jena.apache.org/tutorials/ sparql.html 4 http://kidehen.typepad.com/kingsley_ idehens_typepad/ sist on a set of triples or statements – patterns like Subject <Verb> Object or Entity1 <Relationship> Entity2. Here is an ex- ample of said RDF graphs: <state25> <type> ’state’ . <state25> <name> ’mississippi’ . ... ... <river1> <type> ’river’ . <river1> <name> ’mississippi’ . SparQL matches triple templates with the RDF graphs and returns the triples that fit the blueprint. The templates are expressed with SPARQL vari- ables, which can be recognized because they start with a ”?” symbol. Here we have two examples of SPARQL queries: SELECT ?x WHERE { ?x <name> ’mississippi’ . } SELECT ?x WHERE { ?x <type> ’state’ . ?x <name> ’mississippi’ . } As we can see, ?x is a SPARQL variable, and in both cases it tries to match the elements that appear in the left of RDF triples. It is also the vari- able that is selected with the SELECT operator, it is thus the variable to be queried and its values re- turned as an answer to the query. The WHERE block indicates the patterns that the graphs must match. All patterns in the WHERE block must be matched in order for the entity in the left be as- sociated to ?x. Thus, if we execute both queries on the RDF example above, the first query will return <state25> and <river1> as answers, but the second query will only return <state25> as its answer, because only <state25> matches both ?x <type> ’state’ and ?x <name> ’mississippi’, whereas <river1> only matches the latter. 3.3 λ-DCS Lambda dependency-based compositional seman- tics (λ-DCS) is a new formal language for repre- senting logical forms in semantic parsing. It was developed by Percy Liang. (Liang, 2013) It at-
  • 5. tempts to express logical forms in a simpler way than Lambda Calculus. By eliminating variables and making existential quantification implicit, λ- DCS logical forms are generally more compact than those in Lambda Calculus. Compared to the graph structure of SPARQL, the tree structure in λ-DCS, as well as the absence of variables, should be very helpful when attempting to associate the generated trees that are produced from parsing nat- ural language utterances and the database queries that will retrieve the answer. To provide an insight on how λ-DCS is nota- tionally simpler than lambda calculus, compare the following expressions: - Natural language utterance: “people who have lived in Seattle” – Logical form (lambda calculus): λx.∃e.PlacesLived(x,e) ∧ Location(e,Seattle) – Logical form (λ-DCS): PlacesLived.Location.Seattle – SEMPRE notation: (!<name> (and (<type> ’people’) (<livedin> (and (<type> ’state’) (<name> ’seattle’)))) All express the same concept, however λ-DCS lacks variables and thus has a much more sim- plified expression compared to Lambda Calcu- lus. If the reader is interested in a deeper un- derstanding of Lambda Calculus, we recommend Barendrengt’s ”Introduction to lambda calculus” (Barendregt and Barendsen, 1984). SEMPRE is a toolkit for training semantic parsers, which map natural language utterances to denotations (an- swers) via intermediate logical forms. It is the toolkit Percy Liang developed in order to execute λ-DCS queries. 5 We also used the SEMPRE toolkit in this work, and when in this report we refer to λ-DCS queries, we are actually referring to the SEMPRE notation of them, not their logi- cal form. The SEMPRE query above can be read as: ”return the name of all entities of type ’peo- ple’ that have lived in the entities of type ’state’ and name ’seattle’. The logical group nature of this language is thus clearly manifest, with opera- tions such as intersection (and) and union (or) be- ing used. When executing λ-DCS queries, the SEM- 5 http://nlp.stanford.edu/software/ sempre/ PRE toolkit automatically transforms them into an equivalent SPARQL query which then executes in a Virtuoso server. Thus the λ-DCS → SPARQL transformation is implicit within the toolkit. In this paper we attempt to perform the opposite transformation, SPARQL → λ-DCS, which can be extremely useful for the aforementioned QA project. 4 Transformation algorithms In this section we will describe the transformation algorithms step by step. Because the algorithms are much easier to understand given a specific ex- ample, we will use a simple sample query asso- ciated to the question ”What are the cities in Vir- ginia?” 4.1 Prolog → SPARQL Algorithm The first thing to note is that the GeoQuery database is not provided in RDF format. Thus, we first need to transform the database to RDF format so that SPARQL and λ-DCS queries can be exe- cuted on the GeoQuery database. There are many different trivial ways to do this transformation. In our case we chose to transform an entry into a ge- ographical entity with associated properties, all of the entries constituting different and independent graphs (no linking between graphs). For example, this entry: # state(’arizona’,’az’,’phoenix’, 2718.0e+3,114.0e+3,48,’phoenix’, ’tucson’,’mesa’,’tempe’). transforms to: <state3> <type> ’state’ . <state3> <name> ’arizona’ . <state3> <abbreviation> ’az’ . <state3> <capital> ’phoenix’ . <state3> <population> 2718.0e+3 . <state3> <area> 114.0e+3 . <state3> <state number> 48 . <state3> <city1> ’phoenix’ . <state3> <city2> ’tucson’ . <state3> <city3> ’mesa’ . <state3> <city4> ’tempe’ . Note that an extra property, <type>, is added to clarify that the geographical entity <state> is a state. Prolog identifies directly that the entry
  • 6. in a database is a state, the RDF format does not, however. Now that the GeoQuery database is also in RDF format, we can attempt to transform the sample Prolog queries into SPARQL queries. As stated before, the Prolog relationships are arbitrarily se- lected by the GeoQuery database, and thus this Prolog → SPARQL transformation will not be universal, but specific to this particular case of the GeoQuery database. Other Prolog relation- ships would require other transformations. The SPARQL → λ-DCS transformation that will be proposed afterwards, however, is indeed intended to be universal. We will first enumerate the steps in abstract, and then explain how the algorithm ex- ecutes on an example Prolog query. 1. Use the NLTK toolkit to create a tree from the Prolog query 2. Identify the Prolog variables (single capital letters in the leaves of the tree) and store them in an empty variables dictionary 3. Identify the type of the Prolog variables and store the type in the variables dictionary 3.1. Look for single-leaf nodes 3.2. Look for two-leaf ”const” labeled nodes 3.3. Leave the non-type-informing nodes for the next step 4. With the type of the variables, indicate the relationships between the Prolog variables in SPARQL format. This step must interpret the nodes of the tree that were not interpreted in Step 3. 5. Organize all collected information in a cor- rect RDF graph form. In this work we opted to have a string that concatenated the infor- mation progressively as the algorithm was executed. Now we will see the algorithm executed on an example. Given the sample query in Prolog that retrieves all the cities in the state of Virginia: answer(A,(city(A),loc(A,B), const(B,stateid(virginia)))). the first step is to parse such query and create an NLTK tree from it. This is how the tree looks like: (Step 1) Note that the label ”goals” has been added at an unnamed node in the original Prolog query. The name is arbitrary and is simply there so that all nodes in the NLTK tree are labeled. The next step is to identify the Prolog variables (Step 2) and their type (Step 3), information required to interpret the other Prolog relationships. Variables are single capital letters located in the leaves of the tree. We can obtain the type of the variable either from the label of single-leaf subtrees - e.g. ”city(A)” tells us A is a city - or from two-leaf subtrees with the label ”const” - e.g. const(B,stateid(virginia)) tells us B is a state, and its name is Virginia. We can thus now create a variable dictionary containing all the variables and their corresponding types: varDict={’A’:’city’,’B’:’state’} Finally, we are now able to interpret the other subtrees or Prolog relationships like loc(A,B) - which informs that city A is located in state B. (Step 4) We would be unable to infer this without the type of A and B (”loc” could refer to a river A located in state B, a relationship with a different name), so these subtrees can only be interpreted at this stage. Now that all information about the Pro- log relationships has been retrieved, it can now be expressed as a SPARQL query: (Step 5) SELECT ?city WHERE { ?xA <name> ?city . ?xA <type> "city" . ?xA <name> ?A . ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "virginia" . ?xA <state> ?B . }
  • 7. This is the equivalent query in SPARQL that re- trieves all the cities in the state of Virginia. Prolog variables are expressed with an extra ”x” in front of them when they appear on the left in SPARQL because they reference geographical entities. In the right they reference String names, and thus need to be differentiated. It also helps for clar- ity purposes. A further post-processing of this SPARQL query is possible, condensing for exam- ple these four statements ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "virginia" . ?xA <state> ?B . into just this one: ?xA <state> "virginia" . However the SPARQL query that the algorithm provides expresses all the information given in the original Prolog query, and thus we chose to leave it like it is. 4.2 SPARQL → λ-DCS Algorithm From the SPARQL query we obtained with the previous algorithm we will now attempt to create an equivalent λ-DCS query. We will first present the steps of the algorithm in abstract, and in the next step we can see the algorithm applied to the sample query and the result we eventually arrive to. 1. Parse and interpret every line in the SPARQL query, and create a variable dictionary 1.1. Identify the variables (they start with ”?”) 1.2. Assign relationships between variables and/or constants 1.3. Add the reverse relationships (starting with ”!”) in the target-variables 2. Traverse the variable dictionary to eliminate the SPARQL variables 2.1. Start with the selected variable 2.2. transcribe its relationships 2.3. 0 relationships → [], more than 1 rela- tionship → ”and” operator 2.4. select next SPARQL variable to tra- verse, and repeat this step until all vari- ables have been traversed 3. Add special options (”Count”, ”Limit”, ”Or”...) where appropiate In the following page the reader can find this algorithm applied iteration by iteration with the sample query from the previous section.
  • 8. EXECUTION OF THE ALGORITHM IN A SAMPLE QUERY, STEP BY STEP SELECT ?city WHERE { ?xA <name> ?city . ?xA <type> "city" . ?xA <name> ?A . ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "virginia" . ?xA <state> ?B . } Step 0: original SPARQL query Step 1: Create a variable dictionary with relation- ships Step 2: Start from the selected variable (green) Step 3: Continue with next variable (?xA). Sev- eral relationships translate as an ”and” operator in SEMPRE. The reverse relationship with the previ- ous variable is always ignored. Step 4: Next variables: ?B and ?A. ?A has no relationships left after ignoring its reverse rela- tionship. Thus, it translates as a general variable, [] in SEMPRE. Step 5: Last variable: ?xB. No variables left - SEMPRE query finished.
  • 9. Now we mention some clarifying comments to the steps above. Once again we wish to identify all the variables in the query, but this time the vari- able dictionary will also contain all the relation- ships of said variables with constants and other variables. Note that, unlike in Prolog, ?xA and ?A are different variables in SPARQL. In our vari- able dictionary we will also include the reverse relationships between variables, expressed in λ- DCS with an exclamation mark ”!” in front of the relationship. For example, when we read the line ?xA <state> ?B . we will include both xA - <state> - ?B and ?B - !<state> - ?xA in the variable dictionary. This will be essential when traversing variables. The variable dictionary created from our sample query is in Step 1. In order to eliminate the SPARQL variables we will iterate the transcription of relationships as we traverse the variable dictionary. We start at the se- lected variable, retrieved from the query’s first line SELECT ?city WHERE { and transcribe its only property into λ-DCS format. (Step 2) We now proceed to eliminate ?xA. We ignore the reverse relationship that joins ?xA with ?city (as this relationship has just been transcribed) and focus on all the other relationships - ”type”, ”name” and ”state”. (Step 3) Because we have more than one relationship to transcribe from ?xA, we use the λ-DCS ”and” operator to express the intersection of the groups expressed by these rela- tionships. We repeat this step until we have traversed all variables in the dictionary. In the particular case of variable ?A, (Step 4) its only relationship is ignored because it was already expressed in the previous iteration. This leaves no relationships to transcribe and thus ?A is changed to [], the λ-DCS operator that expresses an undefined variable. Af- ter all the iterations we arrive at the final λ-DCS expression: (Step 5) This is the equivalent query in λ-DCS that re- trieves all the cities in the state of Virginia. As can be noted, Prolog and SPARQL variables have dis- appeared and only a mathematical group expres- sion remains. As a last word, recall that this SPARQL to λ- DCS algorithm is much more relevant, as it is in- tended to be universal. SPARQL grammar and re- lationships are not arbitrary like Prolog relation- ships, thus one would expect this algorithm to per- form well no matter the database and SPARQL queries that are provided as input. 5 Results Apart from the sample query that corresponds to ”What are the cities in Virginia?”, the Appendix in Section 9 provides more examples of queries transformed by the developed algorithms. In the Appendix the reader can understand the different nature of the different query languages by com- paring equivalent queries, and appreciate Prolog as a variable propositional language, SPARQL as a graph language and λ-DCS as a variable-less mathematical group language. Note how for one of the queries, an equivalent λ-DCS query was not possible to obtain with the described algorithms. Besides, the Prolog query and the SPARQL query there return different re- sults. Flaws and limitations of the algorithms are discussed in the next section. Overall, the algorithms were able to success- fully transform about 90% of the 880 sam- ple queries provided in the GeoQuery database. By ”successfully transform” we mean that these queries have been correctly expressed in Prolog, SPARQL and λ-DCS, providing equivalent results when executed. The queries that were not suc- cessfully transformed into some language along with those whose transformation did not provide equivalent results make the remaining 10%. The algorithms’ coverage is thus pretty satisfactory, especially considering that all basic queries can be successfully transformed with these algorithms. Other operators apart from the basic ones, e.g. count, descending order, max, union/or..., were also successfully interpreted by the algorithms. The algorithms still admit however refinement, improvement and extensions, since some opera- tors are not yet included or are problematic. These problems and limitations will be outlined in the following section. 6 Encountered Problems and Limitations The development of the explained algorithms did encounter some difficulties, and in some cases we were unable to successfully transform certain queries. Here we will mention and explain where relevant these difficulties and how they were over- come, if it was the case.
  • 10. First we will address some of the prob- lems encountered when treating with Prolog. Due to Prolog’s arbitrary definition of rela- tionships, it is obvious that in some cases one could define a better relationship to ease the transformation to SPARQL. For example, instead of the relationships capital(A) + loc(A,B) it would be much better to define the relationship capital(A,B), which combines both and avoids having to create a statement ?xStateCapitalOf <capital> ?A . that contains an undefined variable. In addition, some of the sample Geo- Query queries contain redundancy that then spreads as identical statements in SPARQL. As seen in Query 2 of the Appendix, the Prolog state(B),const(B,stateid(oregon)) could simply be expressed as const(B,stateid(oregon)), without re- dundancy. Finally, inconsistencies within the for- mat of the Prolog GeoQuery database obviously leads to problems when trying to test equivalent queries. The property <lowest elevation>, for example, is only defined for those states that do not border the sea. Those states which do border the sea are assumed to have lowest elevation equal to zero, however the absence of such a relationship leads to the inconsistencies between Prolog and SPARQL queries expressed in query Y of Table X, apart from requiring an OPTIONAL operator in SPARQL which, as we will address briefly, cannot be expressed in λ-DCS. The main problem when treating with SPARQL was the absence of a good SPARQL parser, which would greatly simplify interpreting nested boxes. The algorithm thus far can only interpret a very simple UNION nested block, but not for example a query like this: SELECT ?A WHERE { SELECT ?B WHERE { ... } ... } As a mathematical group and logical language, λ-DCS has a wider coverage than SPARQL, but only when one group or one variable is being queried. A serious limitation of λ-DCS is that the language is unable to query two variables or two groups simultaneously. For example, a query to retrieve the name and surname of all employees in a company would look like this in SPARQL: SELECT ?name ?surname WHERE { ?person <name> ?name . ?person <surname> ?surname . } This SPARQL query will return a table of two columns, one column for the names and one col- umn for the corresponding surnames. However, due to having two variables to be retrieved, it is impossible to express this in λ-DCS. λ-DCS could retrieve a list of the names of the employ- ees and a list of the surnames of the employees, i.e. two separate lists, but not a single list with name-surname pairs, which would be the equiv- alent of the two-column answer from SPARQL. This is obviously a big setback to transforming any SPARQL query to an equivalent λ-DCS ex- pression, as a huge strength of SPARQL is re- trieving associated variables and properties in ta- bles, and thus a large amount of SPARQL queries will have more than one variable retrieved and will be impossible to transform to λ-DCS. Fur- thermore, the OPTIONAL operator in SPARQL cannot be expressed as a logical mathematical group, which means it cannot transform to λ- DCS either. The SPARQL OPTIONAL operator addresses the sparsity and irregularity of proper- ties in RDF graph databases, allowing a query to match a relationship whether it exists or not. For example, a SPARQL query that retrieves the name of the lowest point of a state and its correspond- ing height, IF it exists in the database, would be similar to this: SELECT ?lowpoint ?height WHERE { ?xA <type> ’highlow’ . ?xA <lowest point> ?lowpoint . OPTIONAL (?xA <lowest height> ?height) } The result would be a table with two columns: one column for the name of the lowest point, and one column for its corresponding height. If the <lowest height> property is not found, the name will still be retrieved and its correspond-
  • 11. ing cell in the second column would be left blank. This cannot be expressed in λ-DCS because of the OPTIONAL operator and, as explained above, be- cause of the existence of more than one variable to be retrieved. It is true that the OPTIONAL oper- ator’s main utility in SPARQL is most of the time tied to retrieving more than one variable, so these two limitations can generally be seen as one. It is however, as explained above, a considerable limi- tation to expressing SPARQL queries in λ-DCS, as multivariable queries and OPTIONAL operators are quite common in SPARQL queries. The only way to tackle this problem would be to develop an extension to λ-DCS that would effectively allow for multiple logical groups to be retrieved simul- taneously as well as allowing some properties of said groups to be optional. 7 Future Work As already stated throughout this paper, there is yet big room for improvement in refining the pro- posed algorithms, either by extensions to cover more operators or using tools to better interpret the input queries. For example, if a good SPARQL parser were to be developed then interpreting nested SPARQL blocks would become a much more feasible task. Another important step to take is to test the pro- posed algorithms in other data sets and observe their performance. The Prolog → SPARQL al- gorithm obviously does not generalize well due to Prolog’s arbitrary declarative nature, however the SPARQL → λ-DCS algorithm is designed to be universal. Thus the latter should definitely be tested using sample SPARQL queries from databases like Freebase or QALD 6 as input. Finally, we eagerly await the deployment of the aforementioned QA project that would make full use of the proposed algorithms for its purposes. The success of said project would broaden the per- spective of the utility behind transforming queries between different query languages. 8 Conclusion In this work we carried out the transformation be- tween the Prolog, SPARQL and λ-DCS query lan- guages. We discovered that it is a feasible task when treating with queries that originate from nat- 6 http://greententacle.techfak. uni-bielefeld.de/˜cunger/qald/index. php?x=home&q=5 ural language utterances or requests. We are sat- isfied on how the proposed algorithms are able to transform the big majority of basic queries suc- cessfully, and we consider it would be worthy to continue the work and refine the algorithms. Furthermore, working on these transformations has brought us a deeper understanding on the sim- ilarities and differences of the target query lan- guages, and how some adapt better to different tasks. As one could intuitively think from the be- ginning, there are also concepts and queries that can not be expressed in all languages, and thus a total coverage transformation is impossible. How- ever, this should not be a setback to performing said transformations where they are viable. In- deed, we hope to see the QA project reach the full potential of the transformations presented in this paper.
  • 12. 9 APPENDIX: Queries in different query languages Utterance: What are the cities in Virginia? Prolog: answer(A,(city(A),loc(A,B), const(B,stateid(virginia)))). SPARQL: SELECT ?city WHERE { ?xA <name> ?city . ?xA <type> "city" . ?xA <name> ?A . ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "virginia" . ?xA <state> ?B . } λ-DCS: (!<name> (and (and (<type> ’city’) (<name> [])) (<state> (!<name> (and (<type> ’state’) (<name> ’virginia’)))))) Query 1 (Used as example) Utterance: What is the name of the highest point in Oregon? Prolog: answer(A,highest(A,(place(A),loc(A,B),state(B),const(B,stateid(oregon))))). SPARQL: SELECT ?A WHERE { ?xB <type> "state" . ?xB <name> ?B . ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "oregon" . ?xA <highest point> ?A . ?xA <highest elevation> ?height0 . ?xA <state> ?B . } ORDER BY DESC(?height0) LIMIT 1 λ-DCS: (!<highest point> (argmax 1 1 (<state> (!<name> (and (and (<type> ’state’) (<type> ’state’)) (<name> (!<name> (and (and (<type> ’state’) (<type> ’state’)) (<name> ’oregon’))))))) <highest elevation>)) Query 2
  • 13. Utterance: What is the capital of Texas? Prolog: answer(A,(capital(A),loc(A,B),const(B,stateid(texas)))). SPARQL: SELECT ?A WHERE { ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "texas" . ?xStateCapitalOf <capital> ?A . ?xA <name> ?A . ?xA <state> ?B . } λ-DCS: (and (!<capital> []) (!<name> (<state> (!<name> (and (<type> ’state’) (<name> ’texas’)))))) Query 3 Utterance: How many states have a lower elevation than Arizona? Prolog: answer(A,count(B,(state(B),low point(B,C),lower(C,D), low point(E,D),const(E,stateid(arizona))),A)). SPARQL: (does not provide equivalent results as the Prolog query!) SELECT (COUNT (?B) AS ?numberOFstate) WHERE { ?xB <type> "state" . ?xB <name> ?B . ?xC <state> ?B . ?xC <lowest point> ?C . ?xE <type> "state" . ?xE <name> ?E . ?xE <name> "arizona" . OPTIONAL {?xC <lowest elevation> ?height0 . } OPTIONAL {?xD <lowest elevation> ?height1 . } FILTER ( IF ( BOUND(?height1), ?height0, 0 ) < IF ( BOUND(?height1), ?height1, 0 ) ) ?xD <state> ?E . ?xD <lowest point> ?D . } λ-DCS: Not possible! (See Section 6: Encountered problems and limitations) Query 4
  • 14. Utterance: What is the name of the lakes in Michigan? Prolog: answer(A,(lake(A),loc(A,B),const(B,stateid(michigan)))). SPARQL: SELECT ?lake WHERE { ?xA <name> ?lake . ?xA <type> "lake" . ?xA <name> ?A . ?xB <type> "state" . ?xB <name> ?B . ?xB <name> "michigan" . ?xA <isin> ?B . } λ-DCS: (!<name> (and (and (<type> ’lake’) (<name> [])) (<isin> (!<name> (and (<type> ’state’) (<name> ’michigan’)))))) Query 5 Utterance: How many rivers flow through Colorado? Prolog: answer(A,count(B,(river(B),loc(B,C),const(C,stateid(colorado))),A)). SPARQL: SELECT (COUNT (?B) AS ?numberOFriver) WHERE { ?xB <type> "river" . ?xB <name> ?B . ?xC <type> "state" . ?xC <name> ?C . ?xC <name> "colorado" . ?xB <flowsthru> ?C . } λ-DCS: (count (!<name> (and (<type> ’river’) (<flowsthru> (!<name> (and (<type> ’state’) (<name> ’colorado’))))))) Query 6
  • 15. Utterance: What are the names of the highest points of the states bordering Mississippi? Prolog: answer(A,(high point(B,A),state(B),next to(B,C),const(C,stateid(mississippi)))). SPARQL: SELECT ?A WHERE { ?xB <type> "state" . ?xB <name> ?B . ?xC <type> "state" . ?xC <name> ?C . ?xC <name> "mississippi" . ?x0 <state> ?B . ?x0 <highest point> ?A . ?x1 <state> ?C . ?x1 <borderingstate> ?B . } λ-DCS: (!<highest point> (<state> (and (!<name> (<type> ’state’)) (!<borderingstate> (<state> (!<name> (and (<type> ’state’) (<name> ’mississippi’)))))))) Query 7 Utterance: Give me all the cities in the USA Prolog: answer(A,(city(A),loc(A,B),const(B,countryid(usa)))). SPARQL: SELECT ?A WHERE { ?xA <type> "city" . ?xA <name> ?A . } λ-DCS: (!<name> (<type> ’city’)) Query 8
  • 16. References Uwe A mann, Andreas Bartho, and Christian Wende. 2010. Reasoning Web. Semantic Technologies for Soft- ware Engineering: 6th International Summer School 2010, Dresden, Germany, August 30-September 3, 2010. Tutorial Lectures, volume 6325. Springer Science & Business Media. Henk P Barendregt and Erik Barendsen. 1984. Introduction to lambda calculus. Nieuw archief voor wisenkunde, 4(2):337–372. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question- answer pairs. In EMNLP, pages 1533–1544. Percy Liang. 2013. Lambda dependency-based compositional semantics. arXiv preprint arXiv:1309.4408.