master_thesis_greciano_v2

Transformation between Query Languages
Miguel Cristian Greciano Raiskila
Tokyo National Institute of Informatics
Technische Universität Darmstadt
Universidad Politécnica de Madrid
mc.greciano@gmail.com
Yusuke Miyao
Tokyo National Institute of Informatics
Associate Professor
Master Thesis Tutor
yusuke@nii.ac.jp
Abstract
The Semantic Web, an extension of the
Web that provides easier ways to retrieve
data, has seen a major growth in the last
years. Not only do users of the Inter-
net desire information, they desire better,
quicker, easier and more efficient ways to
access that information and find answers
to their questions.
A clear example of the Semantic Web phe-
nomenon is the popularity of the Freebase
data set. It is a big collection of struc-
tured data harvested from many sources.
Freebase runs on a database infrastructure
created in-house by Metaweb that uses a
graph model. Because its data structure
is non-hierarchical, Freebase is open for
users to enter new objects and relation-
ships into the underlying graph, a great
advantage. Since 2008, Freebase imple-
ments RDF (Resource Description For-
mat), allowing Freebase to be used as
linked data and be queried by languages
such as SPARQL, which is also quite pop-
ular. SPARQL users normally create their
own SPARQL queries manually, or at least
semi-manually. There are even hundreds
of sample SPARQL queries on the web
with their associated natural language ut-
terance to hint users how to create or adapt
their own queries. It would be however
ideal if this whole step from natural lan-
guage to executing a query in the database
was automated.
In this paper we wish to address a key
feature in the desired automation between
natural language and query execution:
transformation between query languages.
Indeed in the example of Freebase and
SPARQL, natural languages utterances are
far in nature from RDF graphs, but closer
in nature to semantic trees. If we could
close the gap between trees and graphs by
transforming between different query lan-
guages, we would draw near to our final
goal of automation. In this paper we out-
line two algorithms that transform queries
between different languages, the SPARQL
→ λ-DCS one being the more relevant.
We also provide the necessary background
as to the reasons and utility for said algo-
rithms.
1 Introduction
This paper outlines the algorithms developed to
transform queries from one query language into
their equivalents in a different query language.
It also provides a brief but comprehensive back-
ground on the query languages and tools chosen
for this study, as well as the reasons to why these
were chosen for this study.
An obvious initial question is what kind of
benefits would transformation between query lan-
guages provide. It is true that equivalent queries
produce equivalent results (”Where was Barack
Obama born?” should return ”Honolulu” no mat-
ter which query language one is using), however,
different languages express different concepts for
equivalent queries, as well as being different in
efficiency and answer-retrieving speed for equiv-
alent queries. Just as one can express a database
containing the books in a library and associated
data both in, for example, programming languages
Java and C, the implementation of said database
will be obviously different due to the different
inherent nature between both programming lan-
guages. Java will use objects to represent the data
and C will use other data structures and care about
storage and memory differently.
Furthermore, transformation between query
languages can be very useful when a specific query

language is more suitable than another for cer-
tain natural language processing tasks. Some lan-
guages have a graph structure and others have
a tree structure. Associating syntactic trees that
arise from parsing a natural language sentence
with a query could be intuitively easier done if
the query itself has a tree structure. For ex-
ample, in their ”Semantic Parsing on Freebase
from Question-Answer Pairs” (Berant et al., 2013)
Jonathan Berant, Andrew Chou, Roy Frostig and
Percy Liang already suggested a way to map natu-
ral language utterances to queries on the Freebase
data set: through a tree-structured query language
they name Lambda Dependency-Based Composi-
tional Semantics (from now on λ-DCS).
An ambitious project that our team is currently
working on, the Question-Answering Project
(from now on, the QA project), intends to auto-
matically create these Freebase queries from natu-
ral language input. As mentioned in the abstract,
we already possess a good amount of utterance-
query pairs, however in order to train the inter-
preter we must be able to traverse from utterance-
semantic tree-graph-query and the reverse way
too.
The chosen query languages for the transforma-
tions are Prolog, SPARQL and λ-DCS. Section 3
contains a brief exposition of the syntax and pe-
culiarities of these languages. Section 4 outlines
the transformation algorithms between these lan-
guages. In section 5 we analyze the performance
of said algorithms. Section 6 explains the encoun-
tered problems and limitations to the algorithms
and the transformations, as well as some possible
solutions. Section 7 suggests future tasks to be
carried out from this project, and section 8 pro-
vides the conclusion to our work. Section 9 is an
Appendix containing some of the results of our
work: queries equivalently expressed in different
query languages.
As we will see, the λ-DCS → SPARQL trans-
formation is already implicit in the toolkit that
we have. However the major contribution of this
paper is the reverse SPARQL → λ-DCS trans-
formation. It is not a trivial task as λ-DCS
→ SPARQL is transforming trees into graphs,
whereas SPARQL → λ-DCS is the opposite: rans-
forming graphs into trees. Trees are by defini-
tion a particular case of graphs and thus less ex-
pressive, so in theory transforming from graphs
to trees should pose a significant challenge. The
SPARQL → λ-DCS has not been proposed yet
and is important for the QA project, which needs
to traverse from semantic trees to graphs (λ-DCS
→ SPARQL) and also from graphs to semantic
trees (SPARQL → λ-DCS) in order to execute
training.
The chosen experimental database is GeoQuery
1. It is a small database containing geographical
information about all of the states in the US (cities,
rivers, roads, highest and lowest points...). An ex-
tended geographical database with more data is
included in Freebase, however we chose to work
with GeoQuery because it is simple, intuitive,
small, comes in various formats and is free and
open source (despite Freebase being also free and
open source, there are of course data sets which
are not, so this advantage should not be consid-
ered a given). It is also relatively well-known.
One of the formats GeoQuery comes in is the Pro-
log format, and this is the main reason why we
chose to select Prolog as one of the query lan-
guages in this research. GeoQuery comes with
more than 880 functional Prolog queries as sam-
ples, all of them associated with natural language
questions such as ”What is the longest river in Col-
orado?”. The samples are first-order-logic queries,
i.e. they treat with quantities and sets, not propo-
sitions. Thus, we have a good starting source ma-
terial for our objective of transforming functional
queries in one language to their equivalents in an-
other language. In this case, the algorithms are de-
signed to transform first from Prolog to SPARQL,
and then from SPARQL to λ-DCS. We thus wish
the algorithms function properly in the GeoQuery
database, with the hope that such algorithms gen-
eralize well when transforming queries in larger
datasets like Freebase.
2 Related Work
Automated transformation between query lan-
guages is not a very common practice. Most of the
times queries are written manually for the purpose
of retrieving desired information. Consequently,
related work is scarce on the web. One can still
find similar attempts though: in the book ”Reason-
ing Web” there’s a subsection addressing transfor-
mation between SPARQL and GReQL. (A mann
et al., 2010)
As we will see in section 4.3, the transforma-
1
http://www.cs.utexas.edu/users/ml/
nldata/geoquery.html

tion λ-DCS → SPARQL is implicit in the SEM-
PRE toolkit, however the reverse transformation
SPARQL → λ-DCS is not. One of the algo-
rithms proposed in this paper attempts to exe-
cute this transformation. The other algorithm pro-
posed is for the Prolog → SPARQL transforma-
tion, which as far as the authors are concerned,
nobody else has attempted to develop. Thus, even
though Transformation between Query Languages
is not a new concept, it is still rare, and this paper
pioneers the transformations Prolog → SPARQL
and SPARQL → λ-DCS.
3 Query Languages Overview
Here we do not intend to explain all the intrica-
cies of the used languages. However we wish to
provide a simple background for each of them,
along with some basic definitions, so that the
reader can fully comprehend the algorithms de-
veloped in this work and the subsequent results of
the research. We also provide references to more
complete expositions and/or tutorials of these lan-
guages should the reader wish to deepen his un-
derstanding of these query languages.
3.1 Prolog
Prolog is a general purpose logic programming
language, with roots in first-order logic. Prolog
is declarative: the program logic is expressed in
terms of relations or relationships, the query of
which initiates a computation. The relationships
are thus arbitrary, i.e., the author decides how to
define said relationships, and there is no set of uni-
versal relationships. These relationships connect
Prolog variables with each other, as well as with
constants. An easy tool to interpret and execute
Prolog queries is SWI-Prolog. 2
Here is an overview of the GeoQuery database
in Prolog format. The database entries have the
following pattern:
# state(name, abbreviation,
capital, population, area,
state number, city1, city2,
city3, city4)
# city(state, state abbreviation,
name, population)
# river(name, length, [states
through which it flows])
# border(state,
2
http://www.swi-prolog.org/
state abbreviation, [states that
border it])
# highlow(state,
state abbreviation, highest point,
highest elevation, lowest point,
lowest elevation)
# mountain(state,
state abbreviation, name, height)
# road(number, [states it passes
through])
# lake(name, area, [states it is
in])
and here we provide some instances of said pat-
terns as an example:
# state(’arkansas’, ’ar’, ’little
rock’, 2286.0e+3, 53.2e+3,25,
’little rock’, ’fort smith’,
’north little rock’, ’pine
bluff’).
# state(’california’, ’ca’,
’sacramento’, 23.67e+6,
158.0e+3,31, ’los angeles’, ’san
diego’, ’san francisco’, ’san
jose’).
# state(’colorado’, ’co’,
’denver’, 2889.0e+3, 104.0e+3,38,
’denver’, ’colorado springs’,
’aurora’, ’lakewood’).
...
# river(’mississippi’, 3778,
[’minnesota’, ’wisconsin’,
’iowa’, ’illinois’, ’missouri’,
’kentucky’, ’tennessee’,
’arkansas’, ’mississippi’,
’louisiana’, ’louisiana’]).
# river(’missouri’, 3968,
[’montana’, ’north dakota’,
’south dakota’, ’iowa’,
’nebraska’, ’missouri’,
’missouri’]).
# river(’colorado’, 2333,
[’colorado’, ’utah’, ’arizona’,
’nevada’, ’california’]).
Apart from the entries in the database following
a known pattern, relationships have to be defined
in order to be understood by the Prolog interpreter.
Here are two examples of such definitions in Pro-
log:

# loc(cityid(City,St),
stateid(State)):-
city(State,St,City, ).
# const(V,V).
The first example, the relation ”loc”, indicates
that when ”loc” appears, if the input variables are
a city and a state, the interpreter has the informa-
tion available in the first, second and third proper-
ties of the corresponding city entry (State, St and
City properties). The second example, the rela-
tion ”const”, indicates that both inputs are to be
associated together. ”Const” is useful for defining
Prolog variables as constants.
And finally we present one of the Prolog query
samples contained in the GeoQuery database:
answer(A,(city(A),loc(A,B),
const(B,stateid(virginia)))).
This query will retrieve all cities in the state of
Virginia. As we can see, the query begs for A to
be returned, where A is a city, and A is located
in B, which corresponds to a state with constant
ID equal to Virginia. We will use this query as
the example input for the algorithms described in
section 4.
3.2 SPARQL
SPARQL (”SPARQL Protocol and RDF Query
Language”) is an RDF query language. In
other words, it is a semantic query language for
databases, able to retrieve and manipulate data
stored in Resource Description Framework (RDF)
format. It is recognized as one of the key tech-
nologies of the semantic web, and it has become
an official W3C Recommendation. A very instruc-
tive SPARQL tutorial can be found in the Apache
Jena homepage 3. We suggest either executing
SPARQL queries with the Apache Jena frame-
work, or with a Virtuoso server. 4 In this subsec-
tion we shall explain the very basics of SPARQL,
please refer to the aforementioned tutorial to actu-
ally learn SPARQL.
SPARQL is a very popular language to query
RDF graphs. Important datasets like Freebase are
stored in RDF format. RDF graphs basically con-
3
http://jena.apache.org/tutorials/
sparql.html
4
http://kidehen.typepad.com/kingsley_
idehens_typepad/
sist on a set of triples or statements – patterns
like Subject <Verb> Object or Entity1
<Relationship> Entity2. Here is an ex-
ample of said RDF graphs:
<state25> <type> ’state’ .
<state25> <name> ’mississippi’ .
...
...
<river1> <type> ’river’ .
<river1> <name> ’mississippi’ .
SparQL matches triple templates with the RDF
graphs and returns the triples that fit the blueprint.
The templates are expressed with SPARQL vari-
ables, which can be recognized because they start
with a ”?” symbol. Here we have two examples of
SPARQL queries:
SELECT ?x WHERE {
?x <name> ’mississippi’ .
}
SELECT ?x WHERE {
?x <type> ’state’ .
?x <name> ’mississippi’ .
}
As we can see, ?x is a SPARQL variable, and
in both cases it tries to match the elements that
appear in the left of RDF triples. It is also the vari-
able that is selected with the SELECT operator, it
is thus the variable to be queried and its values re-
turned as an answer to the query. The WHERE
block indicates the patterns that the graphs must
match. All patterns in the WHERE block must be
matched in order for the entity in the left be as-
sociated to ?x. Thus, if we execute both queries
on the RDF example above, the first query will
return <state25> and <river1> as answers,
but the second query will only return <state25>
as its answer, because only <state25> matches
both ?x <type> ’state’ and ?x <name>
’mississippi’, whereas <river1> only
matches the latter.
3.3 λ-DCS
Lambda dependency-based compositional seman-
tics (λ-DCS) is a new formal language for repre-
senting logical forms in semantic parsing. It was
developed by Percy Liang. (Liang, 2013) It at-

tempts to express logical forms in a simpler way
than Lambda Calculus. By eliminating variables
and making existential quantification implicit, λ-
DCS logical forms are generally more compact
than those in Lambda Calculus. Compared to the
graph structure of SPARQL, the tree structure in
λ-DCS, as well as the absence of variables, should
be very helpful when attempting to associate the
generated trees that are produced from parsing nat-
ural language utterances and the database queries
that will retrieve the answer.
To provide an insight on how λ-DCS is nota-
tionally simpler than lambda calculus, compare
the following expressions:
- Natural language utterance: “people who have
lived in Seattle”
– Logical form (lambda calculus):
λx.∃e.PlacesLived(x,e) ∧ Location(e,Seattle)
– Logical form (λ-DCS):
PlacesLived.Location.Seattle
– SEMPRE notation:
(!<name> (and (<type> ’people’)
(<livedin> (and (<type> ’state’)
(<name> ’seattle’))))
All express the same concept, however λ-DCS
lacks variables and thus has a much more sim-
plified expression compared to Lambda Calcu-
lus. If the reader is interested in a deeper un-
derstanding of Lambda Calculus, we recommend
Barendrengt’s ”Introduction to lambda calculus”
(Barendregt and Barendsen, 1984). SEMPRE is a
toolkit for training semantic parsers, which map
natural language utterances to denotations (an-
swers) via intermediate logical forms. It is the
toolkit Percy Liang developed in order to execute
λ-DCS queries. 5 We also used the SEMPRE
toolkit in this work, and when in this report we
refer to λ-DCS queries, we are actually referring
to the SEMPRE notation of them, not their logi-
cal form. The SEMPRE query above can be read
as: ”return the name of all entities of type ’peo-
ple’ that have lived in the entities of type ’state’
and name ’seattle’. The logical group nature of
this language is thus clearly manifest, with opera-
tions such as intersection (and) and union (or) be-
ing used.
When executing λ-DCS queries, the SEM-
5
http://nlp.stanford.edu/software/
sempre/
PRE toolkit automatically transforms them into an
equivalent SPARQL query which then executes in
a Virtuoso server. Thus the λ-DCS → SPARQL
transformation is implicit within the toolkit. In
this paper we attempt to perform the opposite
transformation, SPARQL → λ-DCS, which can
be extremely useful for the aforementioned QA
project.
4 Transformation algorithms
In this section we will describe the transformation
algorithms step by step. Because the algorithms
are much easier to understand given a specific ex-
ample, we will use a simple sample query asso-
ciated to the question ”What are the cities in Vir-
ginia?”
4.1 Prolog → SPARQL Algorithm
The first thing to note is that the GeoQuery
database is not provided in RDF format. Thus, we
first need to transform the database to RDF format
so that SPARQL and λ-DCS queries can be exe-
cuted on the GeoQuery database. There are many
different trivial ways to do this transformation. In
our case we chose to transform an entry into a ge-
ographical entity with associated properties, all of
the entries constituting different and independent
graphs (no linking between graphs). For example,
this entry:
# state(’arizona’,’az’,’phoenix’,
2718.0e+3,114.0e+3,48,’phoenix’,
’tucson’,’mesa’,’tempe’).
transforms to:
<state3> <type> ’state’ .
<state3> <name> ’arizona’ .
<state3> <abbreviation> ’az’ .
<state3> <capital> ’phoenix’ .
<state3> <population> 2718.0e+3 .
<state3> <area> 114.0e+3 .
<state3> <state number> 48 .
<state3> <city1> ’phoenix’ .
<state3> <city2> ’tucson’ .
<state3> <city3> ’mesa’ .
<state3> <city4> ’tempe’ .
Note that an extra property, <type>, is added
to clarify that the geographical entity <state>
is a state. Prolog identifies directly that the entry

in a database is a state, the RDF format does not,
however.
Now that the GeoQuery database is also in RDF
format, we can attempt to transform the sample
Prolog queries into SPARQL queries. As stated
before, the Prolog relationships are arbitrarily se-
lected by the GeoQuery database, and thus this
Prolog → SPARQL transformation will not be
universal, but specific to this particular case of
the GeoQuery database. Other Prolog relation-
ships would require other transformations. The
SPARQL → λ-DCS transformation that will be
proposed afterwards, however, is indeed intended
to be universal. We will first enumerate the steps
in abstract, and then explain how the algorithm ex-
ecutes on an example Prolog query.
1. Use the NLTK toolkit to create a tree from
the Prolog query
2. Identify the Prolog variables (single capital
letters in the leaves of the tree) and store them
in an empty variables dictionary
3. Identify the type of the Prolog variables and
store the type in the variables dictionary
3.1. Look for single-leaf nodes
3.2. Look for two-leaf ”const” labeled nodes
3.3. Leave the non-type-informing nodes for
the next step
4. With the type of the variables, indicate the
relationships between the Prolog variables in
SPARQL format. This step must interpret the
nodes of the tree that were not interpreted in
Step 3.
5. Organize all collected information in a cor-
rect RDF graph form. In this work we opted
to have a string that concatenated the infor-
mation progressively as the algorithm was
executed.
Now we will see the algorithm executed on an
example. Given the sample query in Prolog that
retrieves all the cities in the state of Virginia:
answer(A,(city(A),loc(A,B),
const(B,stateid(virginia)))).
the first step is to parse such query and create an
NLTK tree from it. This is how the tree looks like:
(Step 1)
Note that the label ”goals” has been added at an
unnamed node in the original Prolog query. The
name is arbitrary and is simply there so that all
nodes in the NLTK tree are labeled. The next step
is to identify the Prolog variables (Step 2) and their
type (Step 3), information required to interpret the
other Prolog relationships. Variables are single
capital letters located in the leaves of the tree. We
can obtain the type of the variable either from the
label of single-leaf subtrees - e.g. ”city(A)” tells
us A is a city - or from two-leaf subtrees with the
label ”const” - e.g. const(B,stateid(virginia)) tells
us B is a state, and its name is Virginia. We can
thus now create a variable dictionary containing
all the variables and their corresponding types:
varDict={’A’:’city’,’B’:’state’}
Finally, we are now able to interpret the other
subtrees or Prolog relationships like loc(A,B) -
which informs that city A is located in state B.
(Step 4) We would be unable to infer this without
the type of A and B (”loc” could refer to a river
A located in state B, a relationship with a different
name), so these subtrees can only be interpreted at
this stage. Now that all information about the Pro-
log relationships has been retrieved, it can now be
expressed as a SPARQL query: (Step 5)
SELECT ?city WHERE {
?xA <name> ?city .
?xA <type> "city" .
?xA <name> ?A .
?xB <type> "state" .
?xB <name> ?B .
?xB <name> "virginia" .
?xA <state> ?B .
}

This is the equivalent query in SPARQL that re-
trieves all the cities in the state of Virginia. Prolog
variables are expressed with an extra ”x” in front
of them when they appear on the left in SPARQL
because they reference geographical entities. In
the right they reference String names, and thus
need to be differentiated. It also helps for clar-
ity purposes. A further post-processing of this
SPARQL query is possible, condensing for exam-
ple these four statements
?xB <name> ?B .
?xA <state> ?B .
into just this one:
?xA <state> "virginia" .
However the SPARQL query that the algorithm
provides expresses all the information given in the
original Prolog query, and thus we chose to leave
it like it is.
4.2 SPARQL → λ-DCS Algorithm
From the SPARQL query we obtained with the
previous algorithm we will now attempt to create
an equivalent λ-DCS query. We will ﬁrst present
the steps of the algorithm in abstract, and in the
next step we can see the algorithm applied to the
sample query and the result we eventually arrive
to.
1. Parse and interpret every line in the SPARQL
query, and create a variable dictionary
1.1. Identify the variables (they start with
”?”)
1.2. Assign relationships between variables
and/or constants
1.3. Add the reverse relationships (starting
with ”!”) in the target-variables
2. Traverse the variable dictionary to eliminate
the SPARQL variables
2.1. Start with the selected variable
2.2. transcribe its relationships
2.3. 0 relationships → [], more than 1 rela-
tionship → ”and” operator
2.4. select next SPARQL variable to tra-
verse, and repeat this step until all vari-
ables have been traversed
3. Add special options (”Count”, ”Limit”,
”Or”...) where appropiate
In the following page the reader can ﬁnd this
algorithm applied iteration by iteration with the
sample query from the previous section.

EXECUTION OF THE ALGORITHM IN A
SAMPLE QUERY, STEP BY STEP
?xA <name> ?city .
?xA <type> "city" .
?xA <name> ?A .
?xB <name> ?B .
?xA <state> ?B .
}
Step 0: original SPARQL query
Step 1: Create a variable dictionary with relation-
ships
Step 2: Start from the selected variable (green)
Step 3: Continue with next variable (?xA). Sev-
eral relationships translate as an ”and” operator in
SEMPRE. The reverse relationship with the previ-
ous variable is always ignored.
Step 4: Next variables: ?B and ?A. ?A has no
relationships left after ignoring its reverse rela-
tionship. Thus, it translates as a general variable,
[] in SEMPRE.
Step 5: Last variable: ?xB. No variables left -
SEMPRE query ﬁnished.

Now we mention some clarifying comments to
the steps above. Once again we wish to identify all
the variables in the query, but this time the vari-
able dictionary will also contain all the relation-
ships of said variables with constants and other
variables. Note that, unlike in Prolog, ?xA and
?A are different variables in SPARQL. In our vari-
able dictionary we will also include the reverse
relationships between variables, expressed in λ-
DCS with an exclamation mark ”!” in front of
the relationship. For example, when we read the
line ?xA <state> ?B . we will include both
xA - <state> - ?B and ?B - !<state>
- ?xA in the variable dictionary. This will be
essential when traversing variables. The variable
dictionary created from our sample query is in
Step 1.
In order to eliminate the SPARQL variables we
will iterate the transcription of relationships as we
traverse the variable dictionary. We start at the se-
lected variable, retrieved from the query’s first line
SELECT ?city WHERE { and transcribe its
only property into λ-DCS format. (Step 2)
We now proceed to eliminate ?xA. We ignore
the reverse relationship that joins ?xA with ?city
(as this relationship has just been transcribed)
and focus on all the other relationships - ”type”,
”name” and ”state”. (Step 3) Because we have
more than one relationship to transcribe from ?xA,
we use the λ-DCS ”and” operator to express the
intersection of the groups expressed by these rela-
tionships.
We repeat this step until we have traversed all
variables in the dictionary. In the particular case
of variable ?A, (Step 4) its only relationship is
ignored because it was already expressed in the
previous iteration. This leaves no relationships to
transcribe and thus ?A is changed to [], the λ-DCS
operator that expresses an undefined variable. Af-
ter all the iterations we arrive at the final λ-DCS
expression: (Step 5)
This is the equivalent query in λ-DCS that re-
trieves all the cities in the state of Virginia. As can
be noted, Prolog and SPARQL variables have dis-
appeared and only a mathematical group expres-
sion remains.
As a last word, recall that this SPARQL to λ-
DCS algorithm is much more relevant, as it is in-
tended to be universal. SPARQL grammar and re-
lationships are not arbitrary like Prolog relation-
ships, thus one would expect this algorithm to per-
form well no matter the database and SPARQL
queries that are provided as input.
5 Results
Apart from the sample query that corresponds to
”What are the cities in Virginia?”, the Appendix
in Section 9 provides more examples of queries
transformed by the developed algorithms. In the
Appendix the reader can understand the different
nature of the different query languages by com-
paring equivalent queries, and appreciate Prolog
as a variable propositional language, SPARQL as
a graph language and λ-DCS as a variable-less
mathematical group language.
Note how for one of the queries, an equivalent
λ-DCS query was not possible to obtain with the
described algorithms. Besides, the Prolog query
and the SPARQL query there return different re-
sults. Flaws and limitations of the algorithms are
discussed in the next section.
Overall, the algorithms were able to success-
fully transform about 90% of the 880 sam-
ple queries provided in the GeoQuery database.
By ”successfully transform” we mean that these
queries have been correctly expressed in Prolog,
SPARQL and λ-DCS, providing equivalent results
when executed. The queries that were not suc-
cessfully transformed into some language along
with those whose transformation did not provide
equivalent results make the remaining 10%. The
algorithms’ coverage is thus pretty satisfactory,
especially considering that all basic queries can
be successfully transformed with these algorithms.
Other operators apart from the basic ones, e.g.
count, descending order, max, union/or..., were
also successfully interpreted by the algorithms.
The algorithms still admit however refinement,
improvement and extensions, since some opera-
tors are not yet included or are problematic. These
problems and limitations will be outlined in the
following section.
6 Encountered Problems and
Limitations
The development of the explained algorithms did
encounter some difficulties, and in some cases
we were unable to successfully transform certain
queries. Here we will mention and explain where
relevant these difficulties and how they were over-
come, if it was the case.

First we will address some of the prob-
lems encountered when treating with Prolog.
Due to Prolog’s arbitrary definition of rela-
tionships, it is obvious that in some cases
one could define a better relationship to ease
the transformation to SPARQL. For example,
instead of the relationships capital(A)
+ loc(A,B) it would be much better to
define the relationship capital(A,B),
which combines both and avoids having to
create a statement ?xStateCapitalOf
<capital> ?A . that contains an undefined
variable. In addition, some of the sample Geo-
Query queries contain redundancy that then
spreads as identical statements in SPARQL. As
seen in Query 2 of the Appendix, the Prolog
state(B),const(B,stateid(oregon))
could simply be expressed as
const(B,stateid(oregon)), without re-
dundancy. Finally, inconsistencies within the for-
mat of the Prolog GeoQuery database obviously
leads to problems when trying to test equivalent
queries. The property <lowest elevation>,
for example, is only defined for those states
that do not border the sea. Those states which
do border the sea are assumed to have lowest
elevation equal to zero, however the absence of
such a relationship leads to the inconsistencies
between Prolog and SPARQL queries expressed
in query Y of Table X, apart from requiring
an OPTIONAL operator in SPARQL which, as
we will address briefly, cannot be expressed in
λ-DCS.
The main problem when treating with SPARQL
was the absence of a good SPARQL parser, which
would greatly simplify interpreting nested boxes.
The algorithm thus far can only interpret a very
simple UNION nested block, but not for example
a query like this:
SELECT ?A WHERE {
SELECT ?B WHERE {
...
}
...
}
As a mathematical group and logical language,
λ-DCS has a wider coverage than SPARQL, but
only when one group or one variable is being
queried. A serious limitation of λ-DCS is that the
language is unable to query two variables or two
groups simultaneously. For example, a query to
retrieve the name and surname of all employees in
a company would look like this in SPARQL:
SELECT ?name ?surname WHERE {
?person <name> ?name .
?person <surname> ?surname .
}
This SPARQL query will return a table of two
columns, one column for the names and one col-
umn for the corresponding surnames. However,
due to having two variables to be retrieved, it
is impossible to express this in λ-DCS. λ-DCS
could retrieve a list of the names of the employ-
ees and a list of the surnames of the employees,
i.e. two separate lists, but not a single list with
name-surname pairs, which would be the equiv-
alent of the two-column answer from SPARQL.
This is obviously a big setback to transforming
any SPARQL query to an equivalent λ-DCS ex-
pression, as a huge strength of SPARQL is re-
trieving associated variables and properties in ta-
bles, and thus a large amount of SPARQL queries
will have more than one variable retrieved and
will be impossible to transform to λ-DCS. Fur-
thermore, the OPTIONAL operator in SPARQL
cannot be expressed as a logical mathematical
group, which means it cannot transform to λ-
DCS either. The SPARQL OPTIONAL operator
addresses the sparsity and irregularity of proper-
ties in RDF graph databases, allowing a query to
match a relationship whether it exists or not. For
example, a SPARQL query that retrieves the name
of the lowest point of a state and its correspond-
ing height, IF it exists in the database, would be
similar to this:
SELECT ?lowpoint ?height WHERE {
?xA <type> ’highlow’ .
?xA <lowest point> ?lowpoint .
OPTIONAL (?xA <lowest height>
?height)
}
The result would be a table with two columns:
one column for the name of the lowest point, and
one column for its corresponding height. If the
<lowest height> property is not found, the
name will still be retrieved and its correspond-

ing cell in the second column would be left blank.
This cannot be expressed in λ-DCS because of the
OPTIONAL operator and, as explained above, be-
cause of the existence of more than one variable to
be retrieved. It is true that the OPTIONAL oper-
ator’s main utility in SPARQL is most of the time
tied to retrieving more than one variable, so these
two limitations can generally be seen as one. It is
however, as explained above, a considerable limi-
tation to expressing SPARQL queries in λ-DCS, as
multivariable queries and OPTIONAL operators
are quite common in SPARQL queries. The only
way to tackle this problem would be to develop an
extension to λ-DCS that would effectively allow
for multiple logical groups to be retrieved simul-
taneously as well as allowing some properties of
said groups to be optional.
7 Future Work
As already stated throughout this paper, there is
yet big room for improvement in refining the pro-
posed algorithms, either by extensions to cover
more operators or using tools to better interpret the
input queries. For example, if a good SPARQL
parser were to be developed then interpreting
nested SPARQL blocks would become a much
more feasible task.
Another important step to take is to test the pro-
posed algorithms in other data sets and observe
their performance. The Prolog → SPARQL al-
gorithm obviously does not generalize well due
to Prolog’s arbitrary declarative nature, however
the SPARQL → λ-DCS algorithm is designed
to be universal. Thus the latter should definitely
be tested using sample SPARQL queries from
databases like Freebase or QALD 6 as input.
Finally, we eagerly await the deployment of the
aforementioned QA project that would make full
use of the proposed algorithms for its purposes.
The success of said project would broaden the per-
spective of the utility behind transforming queries
between different query languages.
8 Conclusion
In this work we carried out the transformation be-
tween the Prolog, SPARQL and λ-DCS query lan-
guages. We discovered that it is a feasible task
when treating with queries that originate from nat-
6
http://greententacle.techfak.
uni-bielefeld.de/˜cunger/qald/index.
php?x=home&q=5
ural language utterances or requests. We are sat-
isfied on how the proposed algorithms are able to
transform the big majority of basic queries suc-
cessfully, and we consider it would be worthy to
continue the work and refine the algorithms.
Furthermore, working on these transformations
has brought us a deeper understanding on the sim-
ilarities and differences of the target query lan-
guages, and how some adapt better to different
tasks. As one could intuitively think from the be-
ginning, there are also concepts and queries that
can not be expressed in all languages, and thus a
total coverage transformation is impossible. How-
ever, this should not be a setback to performing
said transformations where they are viable. In-
deed, we hope to see the QA project reach the full
potential of the transformations presented in this
paper.

9 APPENDIX: Queries in different query languages
Utterance: What are the cities in Virginia?
Prolog:
answer(A,(city(A),loc(A,B), const(B,stateid(virginia)))).
SPARQL:
?xA <name> ?city .
?xA <type> "city" .
?xA <name> ?A .
?xB <name> ?B .
?xA <state> ?B .
}
λ-DCS:
(!<name> (and (and (<type> ’city’) (<name> [])) (<state> (!<name>
(and (<type> ’state’) (<name> ’virginia’))))))
Query 1 (Used as example)
Utterance: What is the name of the highest point in Oregon?
Prolog:
answer(A,highest(A,(place(A),loc(A,B),state(B),const(B,stateid(oregon))))).
SPARQL:
SELECT ?A WHERE {
?xB <name> ?B .
?xB <name> ?B .
?xB <name> "oregon" .
?xA <highest point> ?A .
?xA <highest elevation> ?height0 .
?xA <state> ?B .
}
ORDER BY DESC(?height0) LIMIT 1
λ-DCS:
(!<highest point> (argmax 1 1 (<state> (!<name> (and (and
(<type> ’state’) (<type> ’state’)) (<name> (!<name> (and (and
(<type> ’state’) (<type> ’state’)) (<name> ’oregon’)))))))
<highest elevation>))
Query 2

Utterance: What is the capital of Texas?
Prolog:
answer(A,(capital(A),loc(A,B),const(B,stateid(texas)))).
SPARQL:
SELECT ?A WHERE {
?xB <name> ?B .
?xB <name> "texas" .
?xStateCapitalOf <capital> ?A .
?xA <name> ?A .
?xA <state> ?B .
}
λ-DCS:
(and (!<capital> []) (!<name> (<state> (!<name> (and (<type> ’state’)
(<name> ’texas’))))))
Query 3
Utterance: How many states have a lower elevation than Arizona?
Prolog:
answer(A,count(B,(state(B),low point(B,C),lower(C,D),
low point(E,D),const(E,stateid(arizona))),A)).
SPARQL: (does not provide equivalent results as the Prolog query!)
SELECT (COUNT (?B) AS ?numberOFstate) WHERE {
?xB <name> ?B .
?xC <state> ?B .
?xC <lowest point> ?C .
?xE <type> "state" .
?xE <name> ?E .
?xE <name> "arizona" .
OPTIONAL {?xC <lowest elevation> ?height0 . }
OPTIONAL {?xD <lowest elevation> ?height1 . }
FILTER ( IF ( BOUND(?height1), ?height0, 0 ) <
IF ( BOUND(?height1), ?height1, 0 ) )
?xD <state> ?E .
?xD <lowest point> ?D .
}
λ-DCS: Not possible! (See Section 6: Encountered problems and limitations)
Query 4

Utterance: What is the name of the lakes in Michigan?
Prolog:
answer(A,(lake(A),loc(A,B),const(B,stateid(michigan)))).
SPARQL:
SELECT ?lake WHERE {
?xA <name> ?lake .
?xA <type> "lake" .
?xA <name> ?A .
?xB <name> ?B .
?xB <name> "michigan" .
?xA <isin> ?B .
}
λ-DCS:
(!<name> (and (and (<type> ’lake’) (<name> [])) (<isin> (!<name> (and
(<type> ’state’) (<name> ’michigan’))))))
Query 5
Utterance: How many rivers ﬂow through Colorado?
Prolog:
answer(A,count(B,(river(B),loc(B,C),const(C,stateid(colorado))),A)).
SPARQL:
SELECT (COUNT (?B) AS ?numberOFriver) WHERE {
?xB <type> "river" .
?xB <name> ?B .
?xC <type> "state" .
?xC <name> ?C .
?xC <name> "colorado" .
?xB <flowsthru> ?C .
}
λ-DCS:
(count (!<name> (and (<type> ’river’) (<flowsthru> (!<name> (and
(<type> ’state’) (<name> ’colorado’)))))))
Query 6

Utterance: What are the names of the highest points of the states bordering Mississippi?
Prolog:
answer(A,(high point(B,A),state(B),next to(B,C),const(C,stateid(mississippi)))).
SPARQL:
SELECT ?A WHERE {
?xB <name> ?B .
?xC <type> "state" .
?xC <name> ?C .
?xC <name> "mississippi" .
?x0 <state> ?B .
?x0 <highest point> ?A .
?x1 <state> ?C .
?x1 <borderingstate> ?B .
}
λ-DCS:
(!<highest point> (<state> (and (!<name> (<type> ’state’))
(!<borderingstate> (<state> (!<name> (and (<type> ’state’) (<name>
’mississippi’))))))))
Query 7
Utterance: Give me all the cities in the USA
Prolog:
answer(A,(city(A),loc(A,B),const(B,countryid(usa)))).
SPARQL:
SELECT ?A WHERE {
?xA <type> "city" .
?xA <name> ?A .
}
λ-DCS:
(!<name> (<type> ’city’))
Query 8

References
Uwe A mann, Andreas Bartho, and Christian Wende. 2010. Reasoning Web. Semantic Technologies for Soft-
ware Engineering: 6th International Summer School 2010, Dresden, Germany, August 30-September 3, 2010.
Tutorial Lectures, volume 6325. Springer Science & Business Media.
Henk P Barendregt and Erik Barendsen. 1984. Introduction to lambda calculus. Nieuw archief voor wisenkunde,
4(2):337–372.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-
answer pairs. In EMNLP, pages 1533–1544.
Percy Liang. 2013. Lambda dependency-based compositional semantics. arXiv preprint arXiv:1309.4408.

master_thesis_greciano_v2

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (12)

Similar to master_thesis_greciano_v2

Similar to master_thesis_greciano_v2 (20)

master_thesis_greciano_v2