discovering information among the vast amount available on the Web, operate on (nearly) all the Web operate on all kinds of Web documents at the same time
very simple interface (usually just a bag of words), many casual users are familiar limited to filtering & ranking relevant documents (for human consumption) search request are often rather vaguely related to the search intent and, at best, a ranking can be provided highly parallel (and thus scalable) evaluation of Web searches requires a human in the loop to ultimately judge the relevance of a search result
that’s search, but what about “querying”
allow automated processing, aggregation, and deduction of data
allow automated processing, aggregation, and deduction of data
allow automated processing, aggregation, and deduction of data
really? If you believe that I have some bank stocks
database-style queries on Web (mostly XML or RDF) data operate on Web data formats such as XML and RDF, the presumptive foundation for the Semantic Web (a small set of ) documents, operate on small fraction of the Webʼs data require knowledge of the language itself as well as of the data to be queried require significant training to be employed effectively, comparable to a programming task but: allow automated processing, aggregation, and deduction of data precise selection of data items in Web documents scale not better than traditional SQL database technology, and thus are clearly unable to process significant subsets of all Web data
database-style queries on Web (mostly XML or RDF) data operate on Web data formats such as XML and RDF, the presumptive foundation for the Semantic Web (a small set of ) documents, operate on small fraction of the Webʼs data require knowledge of the language itself as well as of the data to be queried require significant training to be employed effectively, comparable to a programming task but: allow automated processing, aggregation, and deduction of data precise selection of data items in Web documents scale not better than traditional SQL database technology, and thus are clearly unable to process significant subsets of all Web data
database-style queries on Web (mostly XML or RDF) data operate on Web data formats such as XML and RDF, the presumptive foundation for the Semantic Web (a small set of ) documents, operate on small fraction of the Webʼs data require knowledge of the language itself as well as of the data to be queried require significant training to be employed effectively, comparable to a programming task but: allow automated processing, aggregation, and deduction of data precise selection of data items in Web documents scale not better than traditional SQL database technology, and thus are clearly unable to process significant subsets of all Web data
sort by score
hierarchical databases have been there before? -> what’s new? declarative access/query languages
XPath axes and their efficient evaluation already are and will remain one of the chief outcomes of XML reasearch
trotz hoher Funktionalität, einfache aber mächtig & = combiniert mit
example of composition is still an easy one, hard (but too long for this talk) are construction of intermediary trees
Dec 2008/Jan 2009 spike in XQuery news: release of new XQuery 1.1 working draft?
Dec 2008/Jan 2009 spike in XQuery news: release of new XQuery 1.1 working draft?
Dec 2008/Jan 2009 spike in XQuery news: release of new XQuery 1.1 working draft?
Web Queries: From a Web of Data to a Semantic Web - Presentation Transcript
Web Queries
From a Web of Data to a Semantic Web
François Bry, Tim Furche, Klara Weiand
based on work in the European project KiWi “Knowledge in a Wiki”
1
?
if you believe that I have some
bank stocks for you ...
7
or
query: value of specific element
Share value: 27.40$
8
or
query: value of specific element
Share value: 27.40$
rule: automatically sell at < 28.00$
Action!
8
Web Search Web Queries
Scale “Web”, TB/PBs a few “documents”, GBs
for basic selection lang. high,
Parallelizability very high (NC)
otherwise very low
independent documents, trees (XML) or graphs (RDF),
Data
heterogenous homogenous
(almost) everyone, many casual
Used by few experts
users
Expertise Level/Knowledge
low very high
required
Expressiveness very low very high
Result presentation Ranking, clustering... programmed
Matching often fuzzy or vague only precise answers
Documents (or summaries
Return value parts of trees or graphs
thereof)
Actionability very low very high
9
web search
Expressive Power
Ease of use
web queries
In a nutshell...
10
Web Queries
Search + Query = Easy + Automation
Goals:
make web querying accessible to casual users
e.g., to enable data aggregation/mashups by users
allow precise queries over vastly heterogeneous data
precise queries and rules critical to the (Social) Semantic Web
unless they are accessible no widespread adoption
Classification of approaches to combining Web search & query
enhance search: add data extraction / object search
enhance querying: add keyword search / information retrieval
keyword-based QLs for structured data: grounds-up redesign
11
Search + Query
Approach 1: Enhance Search
“Peek” into web documents
Extract data items / Web objects from web documents
to provide more fine-grained answers
Examples
Google (Squared)
Google Rich Snippets
Yahoo! Search Monkey
12
Search + Query
Approach 1: Enhance Search
“Peek” into web documents
Extract data items / Web objects from web documents
to provide more fine-grained answers
Examples
Google (Squared)
Google Rich Snippets
Yahoo! Search Monkey
12
Search + Query
Approach 2: Enhance Querying
Enhance existing (XML) web query languages
by adding information retrieval functionality
ranking, scoring, fuzzy matching
Examples
XQuery and XPath Full Text 1.0, W3C Cand. Recommendation
13
Search + Query
Approach 2: Enhance Querying
Enhance existing (XML) web query languages
by adding information retrieval functionality
ranking, scoring, fuzzy matching
Examples
XQuery and XPath Full Text 1.0, W3C Cand. Recommendation
1 doc('bib.xml')/bib/book[title ftcontains 'programming']
2 for
where
$b score $s in doc('bib.xml')/bib/book
$b ftcontains 'web' ftand 'query'
order by $s descending
return <title> {$b//title} </title>
13
Search + Query
Approach 3: Keyword Queries
Keyword query for structured Web (XML and RDF) data
apply web search paradigm to querying tree and graph data
operates in the same setting as e.g. XQuery and SPARQL
few or single document, no Web scale
but: easier handling of very heterogeneous data
querying of semi-structured data through easy-to-use interface
Ultimate goal:
Easy usage combined with enough power
→ to automate data processing tasks
14
Search + Query
Approach 3: Keyword Queries
Keyword query for structured Web (XML and RDF) data
apply web search paradigm to querying tree and graph data
operates in the same setting as e.g. XQuery and SPARQL
few or single document, no Web scale K. Weiand, T. Furche, and F. Bry.
Quo vadis, web queries?
but: easier handling of very heterogeneous data In Web4Web, 2008.
querying of semi-structured data through easy-to-use interface
Ultimate goal:
Easy usage combined with enough power
→ to automate data processing tasks
Focus of this tutorial
14
Web Queries
Overview
1. Summary of web query research in the 00s
1.1. XML
1.2. RDF
2. Keyword query languages
2.1. Motivation
2.2. Classification
2.3. Issues
3. KWQL
4. Discussion and Outlook
15
16
Part 1
Web Queries
Summary of XML & RDF
Query Language Research
17
1.1
XML
Tree Data & Tree Queries
Data–XPath–XQuery
18
nted of that element. The following listing shows a small XML
. Fi- fragment that illustrates elements and element nesting:
s an <bib xmlns:dc="http://purl.org/dc/elements/1.1/">
n of 2 <article journal="Computer Journal" id="12">
-E). <dc:title>...Semantic Web...</dc:title>
how 4 <year>2005</year>
<authors>
m to
6 <author>
the <first>John</first> <last>Doe</last> </author>
Web 8 <author>
ches <first>Mary</first> <last>Smith</last> </author>
most 10 </authors>
</article>
ased
12 <article journal="Web Journal">
the <dc:title>...Web...</dc:title>
arch 14 <year>2003</year>
<authors>
16 <author>
<first>Peter</first> <last>Jones</last> </author>
18 <author>
<first>Sue</first> <last>Robinson</last> </author>
tion 20 </authors>
XML
</article>
eral.
22 </bib>
lica-
ML, In addition, we can observe attributes (name, value (sometimes graph), mostly uniform edges
ordered tree pairs
and associated with start tags) that are essentially like elements
rop- but may only contain character data, no other nested
19
122] attributes or elements. Also, by definition, element order is
tion significant, attribute order is not. For instance
ery languages
ble interfaces we can write other elements or character data as children
implemented of that element. The following listing shows a small XML
ion VI-D). Fi- fragment that illustrates elements and element nesting:
ries not as an <bib xmlns:dc="http://purl.org/dc/elements/1.1/">
r extension of 2 <article journal="Computer Journal" id="12">
Section VI-E). <dc:title>...Semantic Web...</dc:title>
mary of how 4 <year>2005</year>
<authors>
d RDF aim to
6 <author>
!"#
ther with the <first>John</first> <last>Doe</last> </author>
aditional Web 8 <author>
$%$
g approaches <first>Mary</first> <last>Smith</last> </author>
, are the most 10 </authors>
</article>
eyword-based
12 <article journal="Web Journal">
nges, and the <dc:title>...Web...</dc:title>
of Web search 14 <year>2003</year>
<authors>
16 <author> !&# !"-#.
ND RDF <first>Peter</first> <last>Jones</last> </author>
18 <author> '()%*+, '()%*+,
<first>Sue</first> <last>Robinson</last> </author>
epresentation 20 </authors>
</article>
ata in general.
22 </bib>
er of applica-
kup (XHTML, In addition, we can observe attributes (name, value pairs
G 7 [141]) and associated with start tags) that are essentially like elements
(Apple’s prop- !-#
but may only contain character data, no other nested !3# !5#. !"&# !"3# !"5# !"/#. !&-#
nd XSLT [122] attributes or elements. Also, by definition, element order is
r serialization significant, attribute order is not. For instance
)%)+, 4,'( '0)12(6 720(8'+ )%)+, 4,'( '0)12(6 720(8'+
rmats such as
<author><last>Doe</last><first>John</first></author>
describing the represents different information than the author element
rText Markup in lines 6–9, but
e on the web, <article id="12" journal="Computer
are fixed. XML Journal">...</article>!!!"#$%&'()*" !/#. !9#. !":# !&=#.
by specifying 4556 7.%89($2"-.92'&: !!!+$,!!! 455> +$,"-.92'&:
represents the same element+$,"!!! item as lines 2–15.
information
Figure 1 gives a graphical representation of the XML doc-
'0)12( '0)12( '0)12( '0)12(
ation in XML
ument that is referenced in preceding illustrations. When
et [68] which
represented as a graph, an XML document without links is
ML document.
a labeled tree where each node in the tree corresponds to
parts, closely
an element and its type. Edges connect nodes and their
children, that is, elements and the elements nested in
el, we provide
ates from the
them, elements and their content and elements and their !:# !<# !"=# !""# !"<# !"9#. !&"# !&&#
attributes. Since the visual distinction between the parent-
ved and thus
ped. This view
child relationship can be made without edge labels and ;(6) +'6) ;(6) +'6) ;(6) +'6) ;(6) +'6)
since attributes are not addressed or receive no special
uch as Xcerpt
treatment in the research presented in this text, edges will
follow XPath
not be labeled in the following figures.
Elements, attributes, and character data are XML’s most
XML is a syn-
common information types. In addition, XML documents
ems are called
may also contain comments, processing instructions (name-
end tags, both -./' 0.$ 1&23 #%)(/ ;$($2 -.'$< #9$ =.,)'<.'
XML
value pair with specific semantics that can be placed any-
r>...</author>
where an element can be placed), document level informa-
place of ‘. . . ’,
tion (such as the XML or the document type declarations),
entities, and notations, which are essentially just other kinds
of information containers.
Fig. 1. Visual representation of sample XML document
ordered tree (sometimes graph), mostly uniform edges
On top of these information types, two additional fa- B. Resource Description Framework (RDF)
cilities relevant to the information content of XML docu- 19
As the second preeminent data format on the S
ments are introduced by subsequent specifications: Names-
Web, the Resource Description Format (RDF) [109
paces [33] and Base URIs [140]. Namespaces allow the
?
what’s new about trees? didn’t we
try that before?
20
r
ce sto
an
self
parent
child
preceding-sibling
following-sibling
fol
ing
lo
win
d
ce
g
pre
descendant XPath
axis for navigation/selection in tree, context, horizontal axes for order
21
XPath: Navigation in Trees
Intro in 5 Points
Data: rooted, ordered, unranked, finite trees (no ID/IDREF resolution)
Paths are sequences of steps (axis & test on properties of node)
adorned with existential predicates (in []) to obtain tree queries
some more advanced features: value joins, aggregation, …
Answers are sets of nodes from the input document
/child::html/descendant::h1[not(preceding::h1
=
“Introduction”)]/child::p[attribute::class=”abstract”]
no variables, no construction, no grouping/ordering
22
axis Nodes (n) = {(n : R axis (n, n )}
are conc
λ Nodes (n) = {(n : Labλ (n )} widely su
node() Nodes (n) = Nodes(T ) tension o
axis::nt[qual] Nodes (n)= {n : n ∈ axis Nodes ∧ n ∈ nt Nodes ∧ in XQue
qual Bool (n )} in XQue
step/path Nodes (n) = {n : n ∈ step Nodes (n)∧ which ca
n ∈ path Nodes (n )} Indeed,
path1 ∪ path2 Nodes (n) = path1 Nodes (n) ∪ path2 Nodes (n) normaliz
h) X
path Bool (n) = path Nodes (n)
sion of X
use cases
path1 ∧ path2 Bool (n) = path1 Bool (n) ∧ path2 Bool (n)
not only
path1 ∨ path2 Bool (n) = path1 Bool (n) ∨ path2 Bool (n) The mos
¬path Bool (n) = ¬ path Bool (n) 1) Sequ
lab() = λ Bool (n) = Labλ (n) expr
path1 = path2 Bool (n) = ∃n , n : n ∈ path1 Nodes (n)∧ sequ
n ∈ path2 Nodes (n)∧ (n , n ) from
23 trast
TABLE I of n
XPath: Navigation in Trees
A Success Story
Commercial success: One of the most successful QLs
widely implemented and used, several W3C standards
Research success:
designed with little concern for formal “beauty”
but with few restrictions: turns out it hit a formal sweet spot
Expressiveness: monadic datalog (datalog with only unary
intensional predicates, i.e., answer predicates)
also: two-variable first-order logic (FO2)
Polynomial complexity, linear for navigational XPath
24
XPath: Navigation in Trees
A Success Story (cont.)
Completeness: falls short of being first-order complete
for first-order completeness a “conditional axis” (UNTIL operator) needed
e.g., all a nodes reachable by a path of only b nodes
but: partially justified by results on elimination of reverse axes
reverse axis: parent, ancestor, preceeding, …
eliminating these axes possible in navigational XPath
though in few cases at exponential cost or resulting in
introduction of expensive node-identity joins
we can safely ignore them in most cases
in conditional XPath this elimination is not possible
25
XPath: Navigation in Trees
A Success Story (cont.)
Completeness: falls short of being first-order complete
for first-order completeness a “conditional axis” (UNTIL operator) needed XPath, the
M. Marx. Conditional
First Order Complete XPath
e.g., all a nodes reachable by a path of only b nodes Dialect. PODS 2004.
but: partially justified by results on elimination of reverse axes
reverse axis: parent, ancestor, preceeding, …
eliminating these axes possible in navigational XPath D. Olteanu, H. Meuss, T. Furche, and F. Bry.
XPath: Looking Forward.
though in few cases at exponential cost or resulting in XMLDM @ EDBT, LNCS 2490, 2002.
introduction of expensive node-identity joins
we can safely ignore them in most cases
in conditional XPath this elimination is not possible C. Ley and M. Benedikt. How big
must complete XML query
languages be? ICDT 2009.
25
Following (x, y ) :⇔ x <pre y ∧ x <post y
From these axes, all others can be defined in first-order logic.
pre- and post-orders are sufficient to represent the full tree
XPath: Navigation in Trees structure.
A Success Story (cont.)
Evaluation:
naïve decomposition: Structural Joins
sequence of joins, descendant with closure over child relation
Computing descendants by a single theta-join on the
better: structural joins for descendant (single lookup using tree labeling)
representation relation.
e.g., pre/post encoding R pre post lab
a:1:7 1 7 a
fits nicely with relational storage
2 3 b
b:2:3 a:5:6 3 1 a
4 2 c
5 6 a
a:3:1 c:4:2 b:6:4 d:7:5
6 4 b
7 5 d
twig joins: stack-based, holistic, not easily adapted to relational DBS
Example
CREATE VIEW descendant AS
SELECT r1.pre, r2.pre FROM R r1, R r2 WHERE r1.pre < r2.pre
AND r2.post < r1.post;
26
Such joins are called structural joins [Al-Khalifa et al., 2002].
XPath: Navigation in Trees
A Success Story (cont.)
1 simple, fairly easy to learn
2 formal “sweet” spot
3 efficient evaluation
27
XPath: Navigation in Trees
A Success Story (cont.)
1 simple, fairly easy to learn
M. Benedikt and C. Koch.
2 formal “sweet” spot XPath Leashed. In ACM
Computing Surveys, 2007
3 efficient evaluation
27
?
that’s it? everyone happy?
28
let
$auction
:=
doc(’auction.xml’)
for
$o
in
$auction/open_auctions/open_auction
return
<corrupt
id=’{
$o/@id
}’>
{
if
(sum($o/(initial
|
bidder/increase))
=
$o/current)
then
text
{
’no’
}
else
$auction//people/person[@id
=
$o//@person]/name
}
</corrupt>
29
XQuery: Graph Queries & Construction
From XPath to Hell
Adds an enormous set of features to XPath
price: highly complex language, Turing-complete, very hard to implement
here: just some highlights
Variables: XPath plus variables → no longer FO2
“an article that has the conference’s chair as author”
not any more polynomial, but NP-/PSPACE-complete
Construction: harmless if only at end of query
but: composition (i.e., querying of data constructed in the same query)
“find all papers written by the author with the most papers”
NEXPTIME-complete or in EXPSPACE (can not be captured by datalog)
30
XQuery: Graph Queries & Construction
From XPath to Hell
Recursion: programmable
with composition → Turing complete
i.e., like Prolog or datalog with value invention
Sequences: rather than sets
without composition little effect
with composition: drastically harder optimization as iteration semantics of
a query must be precisely observed
31
s
type [α] (Haskell), Array (Ruby) or 0 Q3
iter pos item
for y in properly reflect this
ble<T > (Linq). To 2001 to 2008 loop(s0 ) loop(s1 ) πiter:outer, Operator 1 1 ’WD/CR/PR’ Seman
return
herently unordered relational database s1 iter iter
pos:pos1 ,
item
1 2 ’WD/CR/PR’
. . .
. . .
if ( y lt 2007)
we embed order in the data and use 1 ’REC’ 1 pos1 :sort,posπa:b. . . project
···else . σa 1 8 ’REC’ select r
then ’WD/CR/PR’ elseon the
bles with columns pos|item (shown ’REC’ iter pos item . ✶
7 1 ’REC’ . iter=inner × Cartes
present such sequences. Note that the 8 1 ’REC’ .
s need not be dense and not even be of
8∪ ✶a=b , a=b
. equi-jo
(a) Query Q3 .
ordered domain will do. In specific cases (b) Associated loop tables.
@item:’REC’ @item:’WD/CR/PR’ ∪, ∪, (disjoin
may already reflected by the items xi @pos:1 @pos:1
δ elimina
···then
f a sequence of encoded nodes resulting πiter πiter
@a:c pos ’WD/···’
iter item
attach
on step evaluation)—column itemand its W3C Recommendation
Figure 3: may σitem σitem1
#a 1 1 ’WD/CR/PR’
2 1 ’WD/CR/PR’
attach
track maturity level (Query Q3 ). Annotations ¬s0,1
e of pos. a:b1 ,..,bn .
. .
. . . attach
item1 :item . . .
y lt 2007 6 1 ’WD/CR/PR’ attach
Query denote semantics is principally
dynamic iteration scopes.
#a:b1 ,..,bn /c
iter pos item πiter,pos,item:item2
πa:b1 ,..,bn attach
1 1 true
as the core language construct: any 2 1 true
. . . item2 :item1 ,item outer:iter,
sidered (Linq) . . .
Enumerable.Range(2001,8).Select(
to be iteratively evaluated in the agga:b/c
inner,
sort:pos attach
. . . ✶
most enclosing for loop. The top level of ? ’WD/CR/PR’ : ’REC’ )
y => y < 2007 8 1 false iter1 =iter
ssumed (Links)
to be wrapped inside the scope π :iter,
for (y <- [2001,2002,. . .,2008])iter11 :item @item:2007 Table 1: Excerpt of
item
gle-iteration loop for _ in (0) return e
[if (y < 2007) then "WD/CR/PR" else "REC"]
occur free in e (i.e., the choice of 0 is
y @pos:1 @pos:1 (with agg ∈ {count
2007
iter pos item iter pos item
(Ruby)
ndamental idea behind(2001..2008).collect { 2001
loop lifting is to 1 1 πiter:inner, πiter:inner 1 1 2007
item 2 1 2007
|y|a y < 2007 ? ’WD/CR/PR’ : ’REC’ }
ode that consumes and emits “fully un-
2 1 2002
. . . . . .
. . . .
. .
. .
.
. . .
esentation of e’s value. if y unrolling then "WD/CR/PR" else "REC" | #inner
(Haskell) [ Here, < 2007 8 1 2008 gramming languag
8 1 2007
le that y <- [2001..2008] ] 2001 to 2008
iter pos item set-oriented algebr
1 1 2001
ble with schema iter|pos|item holds the if y < 2007 else ’REC’
(Python) [ ’WD/CR/PR’ 1 2 2002
. . . work for database-
for y in range(2001,2008) ]
ues that e assumes during its iterative . . .
.
1
. .
8 2008
not stumble if pro
32
stances [13].
Figure 4: A sample of iterative constructs found in
able has key iter, pos since e may yield
’s companion languages (paraphrases of Q3 ). code to evaluatecode.
s in each distinct iteration—only if e’s
Figure 5: Loop-lifted algebraic
Algebraic
Q3
W
s
type [α] (Haskell), Array (Ruby) or 0 Q3
iter pos item
for y in properly reflect this
ble<T > (Linq). To 2001 to 2008 loop(s0 ) loop(s1 ) πiter:outer, Operator 1 1 ’WD/CR/PR’ Seman
return
herently unordered relational database s1 iter iter
pos:pos1 ,
item
1 2 ’WD/CR/PR’
. . .
. . .
if ( y lt 2007)
we embed order in the data and use 1 ’REC’ 1 pos1 :sort,posπa:b. . . project
···else . σa 1 8 ’REC’ select r
then ’WD/CR/PR’ elseon the
bles with columns pos|item (shown ’REC’ iter pos item . ✶
7 1 ’REC’ . iter=inner × Cartes
present such sequences. Note that the 8 1 ’REC’ .
s need not be dense and not even be of
8∪ ✶a=b , a=b
. equi-jo
(a) Query Q3 .
ordered domain will do. In specific cases (b) Associated loop tables.
@item:’REC’ @item:’WD/CR/PR’ ∪, ∪, (disjoin
may already reflected by the items xi @pos:1 @pos:1
δ elimina
···then
f a sequence of encoded nodes resulting πiter πiter
@a:c pos ’WD/···’
iter item
attach
on step evaluation)—column itemand its W3C Recommendation
Figure 3: may σitem σitem1
#a 1 1 ’WD/CR/PR’
2 1 ’WD/CR/PR’
attach
track maturity level (Query Q3 ). Annotations ¬s0,1
e of pos. a:b1 ,..,bn .
. .
. . . attach
. .
item1 :item .
y lt 2007 6 1 ’WD/CR/PR’ attach
Query denote semantics is principally
dynamic iteration scopes.
#a:b1 ,..,bn /c
iter pos item πiter,pos,item:item2
πa:b1 ,..,bn attach
1 1 true
as the core language construct: any 2 1 true
. . . item2 :item1 ,item outer:iter,
sidered (Linq) . . .
Enumerable.Range(2001,8).Select(
to be iteratively evaluated in the agga:b/c
inner,
sort:pos attach
. . . ✶
most enclosing for loop. The top level of ? ’WD/CR/PR’ : ’REC’ )
y => y < 2007 8 1 false iter1 =iter
ssumed (Links)
to be wrapped inside the scope π :iter,
for (y <- [2001,2002,. . .,2008])iter11 :item @item:2007 Table 1: Excerpt of
item
gle-iteration loop for _ in (0) return e
[if (y < 2007) then "WD/CR/PR" else "REC"]
occur free in e (i.e., the choice of 0 is
y @pos:1 @pos:1 (with agg ∈ {count
2007
iter pos item iter pos item
(Ruby)
ndamental idea behind(2001..2008).collect { 2001
loop lifting is to 1 1 πiter:inner, πiter:inner 1 1 2007
item 2 1 2007
|y|a y < 2007 ? ’WD/CR/PR’ : ’REC’ }
ode that consumes and emits “fully un-
2 1 2002
. . . . . .
. . . .
. .
. .
.
. . .
esentation of e’s value. if y unrolling then "WD/CR/PR" else "REC" | #inner
(Haskell) [ Here, < 2007 8 1 2008 gramming languag
8 1 2007
le that P. Boncz, T. Grust, M. van Keulen, S. 2001 to 2008
y <- [2001..2008] ] set-oriented algebr
Manegold, J. Rittinger, J. Teubner. iter pos item
1 1 2001
MonetDB/XQuery: ’WD/CR/PR’
(Python) [ A Fast XQuery
ble with schema iter|pos|item holds the if y < 2007 else ’REC’ 1 2 2002 work for database-
. . .
Processor Poweredy in range(2001,2008) ]
for by a
ues that e assumes during its iterative . . .
. . . not stumble if pro
Relational Engine, SIGMOD 2006. 1 8 2008
32
stances [13].
Figure 4: A sample of iterative constructs found in
able has key iter, pos since e may yield
’s companion languages (paraphrases of Q3 ). code to evaluatecode.
s in each distinct iteration—only if e’s
Figure 5: Loop-lifted algebraic
Algebraic
Q3
W
XQuery: Graph Queries & Construction
From XPath to Hell
1 significantly harder than XPath
composition & recursion
2 expensive operations
yet significant steps toward
3 fast implementations
33
XQuery: Graph Queries & Construction
From XPath to Hell
1 significantly harder than XPath
composition & recursion M. Benedikt and C. Koch.
2 expensive operations
Interpreting Tree-to-Tree
Queries. ICALP 2006.
yet significant steps toward
3 fast implementations
33
XQuery: Graph Queries & Construction
From XPath to Hell
1 significantly harder than XPath
composition & recursion M. Benedikt and C. Koch.
2 expensive operations
Interpreting Tree-to-Tree
Queries. ICALP 2006.
P. Boncz, T. Grust, et al.
yet significant steps toward
3 fast implementations
MonetDB/XQuery: A Fast XQuery
Processor Powered by a
Relational Engine, SIGMOD 2006.
33
XQuery: Graph Queries & Construction
From XPath to Hell
1 significantly harder than XPath
composition & recursion M. Benedikt and C. Koch.
2 expensive operations
Interpreting Tree-to-Tree
Queries. ICALP 2006.
P. Boncz, T. Grust, et al.
yet significant steps toward
3 fast implementations
MonetDB/XQuery: A Fast XQuery
Processor Powered by a
Relational Engine, SIGMOD 2006.
33
34
1.2
RDF
Graph Data for the Web
35
Legend
Other Smith
Class Literal
Resource
Person
last Mary
type
first
Computer
11 Bag
Journal
Person
number name
Journal type _2
type
type
_1 first John
2005
isPartOf author
year last
smith2005 type Article Doe
RDF
title
...Semantic
Web...
unordered, unranked, unrooted finite graph, edge and node labeled, different node kinds
36
?
what’s new about that? isn’t that just
a fancy repr. of a ternary relation?
37
RDF: Semantics for the Web?
What’s new about RDF?
data model of RDF very similar to relational
but: “unique” identifiers in form of URIs that can be shared on the Web
uniqueness only thanks to careful assignment practice (unlike GUIDs)
human-readable (unlike GUIDs)
but: existential information in form of blank nodes
like named null values in SQL, also known as Codd tables
but: implied information due to RDF/S semantics (and, thus, entailment)
e.g., class hierarchy and typing, domain and range of properties, axiomatic triples
notion of redundancy-free or lean graph/answer
38
bib:hasPart.5 queries are rang
(CONSTRUCT or S
CONSTRUCT { ?j bib:hasPart ?a }
(WHERE clause) o
2 WHERE { ?a rdf:type bib:Article AND ?a bib:isPartOf ?j
AND ?j bib:name ‘Computer Journal’ } expressions (in c
FILTER expressio
The query illustrates SPARQLs fundamental query con- condition must
struct: a pattern (s, p, o) for RDF triples (whose components first limitation i
are usually thought of as subject, predicate, object). Any allowed in stand
RDF triple is also a triple pattern, but triple patterns allow priori and rewritt
variables for each component. Furthermore, SPARQL also (as FILTER expre
allows literals in subject position, anticipating the same which, in turn,
change also in RDF itself. We use the variant syntax for boolean value” i
SPARQL discussed in [161] to ease the definition of syntax Finally, we a
and semantics of the language. For instance, standard CONSTRUCT clause
SPARQL, uses . instead of AND for triple conjunction. We variables occurri
consider two forms of SPARQL queries, viz. SELECT queries literals, and all v
that return list of variable bindings and CONSTRUCT queries only ever bound
that return new RDF graphs. Triple patterns contained in a The first conditio
CONSTRUCT clause (or “template”) are instantiated with the adding appropria
variable bindings provided by the evaluation of the triple query body.
pattern in the WHERE clause. We omit named graphs and Following [16
assume that all queries are on the single input graph. An queries based
(s, p, o) D
Subst = {θ : dom(θ) = Vars((s, p, o)) ∧ t θ ∈ D}
pattern1 AND pattern2 D
Subst = pattern1 D pattern2 D
Subst Subst
pattern1 UNION pattern2 D
Subst = pattern1 D ∪ pattern2 D
Subst Subst
pattern1 MINUS pattern2 D
Subst = pattern1 D pattern2 D
Subst Subst
pattern1 OPT pattern2 D
Subst = pattern1 D pattern2 D
Subst Subst
pattern FILTER condition D
Subst = {θ ∈ pattern D : Vars(condition) ⊂ dom(θ)
Subst
∧ condition D (θ)}
Bool
condition1 ∧ condition2 D (θ)
Bool = condition1 D (θ) ∧ condition2 D (θ)
Bool Bool
condition1 ∨ condition2 D (θ)
Bool = condition1 D (θ) ∨ condition2 D (θ)
Bool Bool
¬condition D (θ)
Bool = ¬ condition D (θ)
Bool
BOUND(?v) D (θ)
Bool = vθ nil
isLITERAL(?v) D (θ)
Bool = vθ ∈ L
isIRI(?v) D (θ)
Bool = vθ ∈ I
isBLANK(?v) D (θ)
Bool = vθ ∈ B
?v = literal D (θ)
Bool = vθ = literal
?u = ?v D (θ)
Bool = uθ = vθ ∧ uθ nil
triple D (θ)
Graph = tripleθ if ∀v ∈ Vars(triple) : vθ nil, otherwise
template1 AND template2 D (θ) = template1 D (θ) ∪ template2 D (θ)
Graph Graph Graph
CONSTRUCT t WHERE p D = std(t ) D (θ)
θ∈ P D
Subst
Graph
SELECT V WHERE p D = πV ( P D )
Subst
TABLE V
S EMANTICS FOR SPARQL
RDF: Semantics for the Web?
Simple RDF QL?
SPARQL: Simple Protocol and RDF Query Language
really simple?
same expressiveness as full relational algebra, PSPACE-complete
but: no composition, no order → simpler than full SQL or XQuery
really an RDF query language?
blank node construction only limited (no quantifier alternation)
talks vaguely about extension to entailment regimes
but no support in SPARQL as defined
no support for lean answers (justified partially by high computational cost)
40
RDF: Semantics for the Web?
SPARQL: Simple RDF QL?
Complexity:
SPARQL with only AND, FILTER, and UNION: NP-complete
i.e., no computationally interesting subset of full relational SPJU queries
full SPARQL: PSPACE-complete
again no computationally interesting subset of full first-order queries
why? due to negation “hidden” in OPTIONAL
reduction from 3SAT using isBound to encode negation
lacks completeness w.r.t. RDF transformations:
relational algebra: can express all PSPACE-transformations on relations
SPARQL: fails at the same for RDF
41
RDF: Semantics for the Web?
SPARQL: Simple RDF QL?
J. Perez, M. Arenas, and
Complexity: C. Gutierrez. Semantics and
Complexity of SPARQL.
SPARQL with only AND, FILTER, and UNION: NP-complete ISWC 2006
i.e., no computationally interesting subset of full relational SPJU queries
full SPARQL: PSPACE-complete
again no computationally interesting subset of full first-order queries
why? due to negation “hidden” in OPTIONAL
reduction from 3SAT using isBound to encode negation
lacks completeness w.r.t. RDF transformations:
relational algebra: can express all PSPACE-transformations on relations
SPARQL: fails at the same for RDF
41
Select all persons and return one blank node
connected to all these persons using “member”
Mary
John member Joe
member member
Not expressible in SPARQL SPARQL
limitations of blank node construction
Why not? no quantifier alternation
Select all persons and return one blank node
connected to all these persons using “member”
Mary
John member Joe
member member T. Furche, C. Ley, B. Linse, B. Marnette, and O.
F. Bry,
Poppe. SPARQLog: SPARQL with Rules and
Quantification. In: Semantic Web Information
Management: A Model-based Perspective, 2009.
Not expressible in SPARQL SPARQL
limitations of blank node construction
Why not? no quantifier alternation
RDF: Semantics for the Web?
SPARQL: Future
doesn’t hit a sweet spot (like XPath)
at least from a formal, database perspective
contributions from database community very rare
yet a significant improvement over most previous RDF QLs
already forms basis for future QL research on RDF
rule extensions for RDF
“From SPARQL to Rules” (Polleres, WWW 2007),
Networked Graphs (Schenk et al, WWW 2008)
no or limited blank node support
but: with rules, restricted quantification as in SPARQL (∃∀) as expressive as
unrestricted quantifier alternation
43
RDF: Semantics for the Web?
SPARQL: Future
doesn’t hit a sweet spot (like XPath)
at least from a formal, database perspective
contributions from database community very rare
yet a significant improvement over most previous RDF QLs
already forms basis for future QL research on RDF
rule extensions for RDF
“From SPARQL to Rules” (Polleres, WWW 2007),
Networked Graphs (Schenk et al, WWW 2008)
no or limited blank node support
but: with rules, restricted quantification as in SPARQL (∃∀) as expressive as
unrestricted quantifier alternation F. Bry, T. Furche, C. Ley, B. Linse, B. Marnette, and O.
Poppe. SPARQLog: SPARQL with Rules and
Quantification. In: Semantic Web Information
Management: A Model-based Perspective, 2009. 43
RDF: Semantics for the Web?
SPARQL: Summary
1 syntax for first-order queries on RDF
foundation for research but little
2 innovation in the language itself
3 lackluster support for RDF specifics
44
RDF: Semantics for the Web?
SPARQL: Summary
1 syntax for first-order queries on RDF
J. Perez, M. Arenas, and
foundation for research but little C. Gutierrez. Semantics and
2 innovation in the language itself
Complexity of SPARQL.
ISWC 2006
3 lackluster support for RDF specifics
44
45
Part 2
Keyword Queries
Classification & main issues
46
47
Queries as Keywords
Main Characteristics
Queries are (mostly) unstructured bags of words
Used in general purpose web search engines
but also elsewhere (Amazon, Facebook,...)
Implicit conjunctive semantics (with limitations)
Often combined with IR techniques
ranking, fuzzy matching
Research focuses on
application to semi-structured data
general structured data (Web objects & tables, relations)
48
Queries as Keywords
Why Keyword Qs for Structured Data
Success in other areas
Users are already familiarized with the paradigm
Allow casual users to query structured data without
having deep knowledge of
the query language
the structure & schema of the data
Enable querying of heterogeneous data
49
Queries as Keywords
Classification of Keyword QLs
Data type
XML
RDF
Implementation
stand-alone systems
translation to conventional query language
keyword-enhanced query languages
50
Queries as Keywords
Classification of Keyword QLs
Complexity of atomic queries
keyword-only
label-keyword
keyword-enhanced
Querying of elements
Values
Node labels
Edge labels
51
Queries as Keywords
XML Keyword QLs
Stand-alone Enhancement
Keyword-only 9 0
Label-keyword 2 0
Keyword-enhanced -- 3
52
Queries as Keywords
XML Keyword QLs
K. Weiand, T. Furche, and F. Bry.
Quo vadis, web queries?
In Web4Web, 2008.
Stand-alone Enhancement
Keyword-only 9 0
Label-keyword 2 0
Keyword-enhanced -- 3
52
Queries as Keywords
RDF Keyword QLs
K. Weiand, T. Furche, and F. Bry.
Quo vadis, web queries?
In Web4Web, 2008.
Stand-alone Translation
Keyword-only 2 2
Label-keyword 0 1
Keyword-enhanced -- --
53
Queries as Keywords
XML Keyword QLs More Popular
Time
first XML keyword query languages are as old as RDF
Familiarity with XML
Complexity of RDF
Graph-shaped
Labeled Edges
Blank nodes
54
Queries as Keywords
XML Keyword QLs More Popular
Time
first XML keyword query languages are as old as RDF
Familiarity with XML
Complexity of RDF
Graph-shaped
Labeled Edges
Blank nodes
54
2.1
Computing
Query Answers
Turning keyword matches into
answers
56
Computing Query Answers
Problem Setting
Query K yields match lists L1 ... L|K|
Answer sets S1 … Sm are constructed from L
may contain either
only one match per match list (m = |K|) or
several (m ≥ |K|)
Data-centric vs. document-centric XML
57
Computing Query Answers
K
={Smith, web}, Problem Setting
L1
={11},
L 2
={3,23},
S1
={11,3},
S 2
={11,23} (1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
58
Computing Query Answers
Problem Setting
Entire XML document
usually too big to serve as a good query answer
Matched nodes alone
usually not informative
Meaningful results
a smaller unit has to be found
59
Computing Query Answers
Approaches
1. Determine entities based on the schema
only keyword matches within one entity or
apply LCA at entity-level
done manually or using schema partitioning with
cardinality as criterion
Lowest Entity Node, Minimal Information Unit, XSeek...
2. Determine entities by connecting keyword matches
Answer entity:
contains (at least) one match for each keyword
60
Computing Query Answers
Connecting Keyword Matches
Classic concept: Lowest Common Ancestor (LCA)
Use LCA to find the maximally specific concept that is
common to the keyword matches
LCA: Lowest node that is ancestor to all elements in
an answer set
61
Computing Query Answers
Lowest Common Ancestor (LCA)
LCA(S1={3,11}) = 2
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
62
Computing Query Answers
Lowest Common Ancestor (LCA)
LCA(S2={11,23}) = 1 — False positive
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
63
2.1.1
Improving LCA
Approaches to improve LCA for
computing answer entities
64
Computing Query Answers
Lowest Common Ancestor (LCA)
Many approaches to improve over LCA
SLCA, MLCA, Interconnection Semantics, VLCA, Amoeba
Join...
Filter LCAs to avoid false positives
Result is subset of LCA result
65
Improving LCA
Interconnection Relationship
Idea: Two different nodes with the same label
correspond to different entities of the same type.
Two nodes are interconnected if,
on the shortest path between them,
every node label occurs only once
Interconnection in answer sets:
star-related: a node in Si interconnected with all nodes in Si
all-pairs related: all nodes in Si pair-wise interconnected
67
Improving LCA
Interconnection Relationship
Idea: Two different nodes with the same label
correspond to different entities of the same type.
Two nodes are interconnected if,
on the shortest path between them,
every node label occurs only once
Interconnection in answer sets:
star-related: a node in Si interconnected with all nodes in Si
all-pairs related: all nodes in Si pair-wise interconnected
S. Cohen, J. Mamou, Y. Kanza, and Y.
Sagiv. XSearch: A Semantic
Search Engine for XML. VLDB 2003
67
Improving LCA
Interconnection Relationship
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
68
Improving LCA
Interconnection Relationship
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
69
Improving LCA
Interconnection Relationship
For |K| = 2, star-related ≡ all-pairs related
Consider S1={3, 8, 11}
70
Improving LCA
Interconnection Relationship
3 and 8 are interconnected
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
71
Improving LCA
Interconnection Relationship
...so are 3 and 11
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
72
Improving LCA
Interconnection Relationship
...but 8 and 11 are not
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
73
Improving LCA
Interconnection Relationship
S1={3, 8, 11} is
star-related but
not all-pairs related
False negative when using all-pairs relatedness
Consider S1={3,7,9}
74
Improving LCA
Interconnection Relationship
(1)
bib
(2) (11)
article article
(3) (4) (5) (10) (12) (13) (14) (19)
name year authors journal title year authors journal
(6) (8) (15) (17)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (9) (16) (18)
name name name name
John Doe Mary Smith Peter Jones Sue Robinson
75
Improving LCA
Interconnection Relationship
(1)
bib
(2) (11)
article article
(3) (4) (5) (10) (12) (13) (14) (19)
name year authors journal title year authors journal
(6) (8) (15) (17)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (9) (16) (18)
name name name name
John Doe Mary Smith Peter Jones Sue Robinson
76
Improving LCA
Interconnection Relationship
(1)
bib
(2) (11)
article article
(3) (4) (5) (10) (12) (13) (14) (19)
name year authors journal title year authors journal
(6) (8) (15) (17)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (9) (16) (18)
name name name name
John Doe Mary Smith Peter Jones Sue Robinson
77
Improving LCA
Interconnection Relationship
S1={3,7,9} is a false negative in both measures
False positives for both when node labels are
different but refer to similar concepts
e.g. “article” vs. “book” vs. “text”
False negatives when node labels are identical but
refer to different concepts
e.g. the name of a person vs. the name of a journal
78
Improving LCA
XKSearch: Smallest LCA (SLCA)
Idea: Enhance LCA with a minimality constraint
SLCA nodes are LCAs which
do not have LCA nodes among their descendants
Similar to XRank/ELCA but stricter
79
Improving LCA
XKSearch: Smallest LCA (SLCA)
Idea: Enhance LCA with a minimality constraint
SLCA nodes are LCAs which
do not have LCA nodes among their descendants
Similar to XRank/ELCA but stricter
Y. Xu and Y. Papakonstantinou. Efficient
Keyword Search for Smallest LCAs in
XML Databases. SIGMOD 2005
79
Improving LCA
XKSearch: Smallest LCA (SLCA)
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
80
Improving LCA
XKSearch: Smallest LCA (SLCA)
No false positive
(1)
bib
(2) (13) (1035)
...
article article article
(3) (4) (5) (12) (14) (15) (16) (23) (1038)
... ...
title year authors journal title year authors journal authors
(6) (9) (17) (20) (1039)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author author
(7) (8) (10) (11) (18) (19) (21) (22) (1040) (1041)
first last first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson Tom Smith
81
Improving LCA
XKSearch: Smallest LCA (SLCA)
But of course...
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
82
Improving LCA
SLCA: False Negatives
(1)
article
(2) (3) (4) (11) (12)
title year authors journal references
(5) (8) (13)
...Semantic Web... 2005 Computer Journal
author author article
(6) (7) (9) (10) (14) (15)
first last first last title authors
(16)
John Doe Mary Smith Web Querying...
author
(17) (18)
first last
Mary Smith
83
Improving LCA
Meaningful LCA (MLCA)
MLCA: LCA where
for each pair of nodes,
no descendant LCA of nodes with the same label exists
Similar to SLCA but less strict since only applies when
node label constraint is fulfilled
Problems similar to SLCA and Interconnection
Semantics
84
Improving LCA
Meaningful LCA (MLCA)
MLCA: LCA where
for each pair of nodes,
no descendant LCA of nodes with the same label exists
Similar to SLCA but less strict since only applies when
node label constraint is fulfilled
Problems similar to SLCA and Interconnection
Semantics
Y. Li, C. Yu, and H. V. Jagadish.
Schema-Free XQuery. VLDB 2004
84
Improving LCA
Meaningful LCA (MLCA)
S1={3,11},S2={23,11}
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
85
Improving LCA
Meaningful LCA (MLCA)
K={Smith, journal}
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
86
Improving LCA
Amoeba Join
Si is a valid answer set if LCA(Si) ∈ Si
only allows matches where
one matched node is
an ancestor of (or identical to) all matched nodes
K={Smith, web}, L1={11}, L2={3,23}, S1={11,3}, S2=
{11,23}
87
Improving LCA
Amoeba Join
Si is a valid answer set if LCA(Si) ∈ Si
only allows matches where
one matched node is
an ancestor of (or identical to) all matched nodes
K={Smith, web}, L1={11}, L2={3,23}, S1={11,3}, S2=
{11,23}
T. Saito and S. Morishita. Amoeba
Join: Overcoming Structural
Fluctuations in XML Data. WebDB
87
Improving LCA
Amoeba Join
K={web, Smith} — No false positive
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
88
Improving LCA
Amoeba Join
K={web, Smith} — False negative
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
89
Improving LCA
Amoeba Join
K={web, Smith, article}
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
90
Improving LCA
Amoeba Join
K={web, Smith, article} — False positive
(1)
article
(2) (3) (4) (11) (12)
title year authors journal references
(5) (8) (13)
...Semantic Web... 2005 Computer Journal
author author article
(6) (7) (9) (10) (14) (15)
first last first last title authors
(16)
John Doe Mary Smith Web Querying...
author
(17) (18)
first last
Mary Smith
91
Improving LCA
Compact LCA (CLCA)
CLCA: LCA node of an answer set Si
where LCA(Si) dominates all nodes in Si
Node vi dominates node vj if there is no vj ∈ Si where LCA
(Si) is descendant of vi
Simpler: CLCA is an LCA
where no node in the associated answer set can be
part of a more specific answer set
Similar to SLCA
92
Improving LCA
Compact Valuable LCA (VLCA)
CVLCA: CLCA node which is also a VLCA node
Recall problems of SLCA and Interconnection
Semantics
93
Improving LCA
Relaxed Tightest Fragment (RTF)
Subtrees are complete with respect to matches
while being as small as possible
Similar to XRank:
maximum match without contained descendant LCAs
K={Smith, web}, L1={11}, L2={3,23}, S1={11,3,23}, S2=
{11,3}, S3={11,23} and LCA(11,3)=2, LCA(11,14)=1,
LCA(11,3,23)=1
94
Improving LCA
Relaxed Tightest Fragment (RTF)
S1={11,3,23}, c1: ✖
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
95
Improving LCA
Relaxed Tightest Fragment (RTF)
S3={11,23}, c1: ✔, c2: ✖
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
96
Improving LCA
Relaxed Tightest Fragment (RTF)
S2={11,3}, c1: ✔, c2: ✔, c3: ✔
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
97
Improving LCA
XRank: Exclusive LCA
Idea: Prefer more specific LCAs
unless reason not to (more matches)
Result node is LCA node which
does not contain further LCAs or
which is still LCA if LCA subtrees are ignored
As before, K={Smith, web}, L1={11}, L2={3,23}, S1=
{11,3}, S2={11,23} and LCA(11,3)=2, LCA(11,23)=1
98
Improving LCA
XRank: Exclusive LCA
Idea: Prefer more specific LCAs
unless reason not to (more matches)
Result node is LCA node which
does not contain further LCAs or
which is still LCA if LCA subtrees are ignored
As before, K={Smith, web}, L1={11}, L2={3,23}, S1=
{11,3}, S2={11,23} and LCA(11,3)=2, LCA(11,23)=1
Lin Guo et al. XRANK: Ranked
keyword search over XML
documents. SIGMOD 2003
98
Improving LCA
XRank: Exclusive LCA
Only (2) is a valid result node
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
99
Improving LCA
XRank: Exclusive LCA
But...
(1)
bib
(2) (13) (1035)
...
article article article
(3) (4) (5) (12) (14) (15) (16) (23) (1038)
... ...
title year authors journal title year authors journal authors
(6) (9) (17) (20) (1039)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author author
(7) (8) (10) (11) (18) (19) (21) (22) (1040) (1041)
first last first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson Tom Smith
100
Improving LCA
XRank: Exclusive LCA
Objection:
XRank targeted at document-centric XML
Does this solve the problem?
Consider K={XML,RDF}
101
Improving LCA
XRank: Exclusive LCA
S1={13}, S2={38,80}...
Is 10 a good return node?
(1)
article
(2) (3) (10)
article authors body
(4) (7) (11) (37) (78)
...Semantic Web... ... ...
author author section section section
(5) (6) (8) (9) (12) (13) (38) (79) (80)
...
first last first last title content title title content
John Doe Mary Smith Introduction... ...XML...RDF ...XML... Conclusion ...RDF...
102
Improving LCA
Summary
No heuristic with perfect precision and recall
Data-driven solutions, not universally applicable
Monotonicity and consistency are desirable but often
violated
In some cases, no heuristic produces suitable results
103
Improving LCA
Summary
K={Smith,2003}
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
104
Improving LCA
Summary
K={Doe, Smith}
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
105
Queries as Keywords
RDF
Summarize RDF graph (Q2RDF, Q2Semantic)
Generate queries or query results by connecting
matches
Dijkstra’s Algorithm (Q2RDF)
Kruskal’s Minimum Spanning Tree Algorithm (SPARK)
Templates based on types (SemSearch)
Cost-based heuristics (Q2Semantic)
106
2.2
Determining
Return Values
From an answer node
to an answer representation
107
Determining Return Values
Approach 2: Matched Nodes
Matched nodes (Interconnection Semantics)
(3) (11)
title last
...Semantic Web... Smith
109
Determining Return Values
Approach 3: LCA & Path
Path from LCA to matched nodes
e.g. BANKS, RTF
(2)
article
(3) (5)
title authors
(9)
...Semantic Web...
author
(11)
last
Smith
110
Determining Return Values
Approach 4: Entire Subtree
LCA or entity subtree e.g. SLCA, XRank, VLCA
(2)
article
(3) (4) (5) (12)
title year authors journal
(6) (9)
...Semantic Web... 2005 Computer Journal
author author
(7) (8) (10) (11)
first last first last
John Doe Mary Smith
111
Determining Return Values
Comparison & Summary
Neither approach is always satisfying
subtree may be too big
path or node may not provide enough information
More controlled return value is desirable
use query elements to determine a suitable return value
automatically
112
Determining Return Values
Exemplar: XSeek
Processing steps:
1. Match query on node labels and content
2. Group matches using SLCA
3. Extract return nodes from query: If a term in K matches a
node label and no descendant content is matched,
consider the term a return node (explicit)
4. When there are no return nodes, return SLCA entity
subtree (implicit)
113
Determining Return Values
Exemplar: XSeek
Terms in the query that are not return nodes are
search predicates i.e. keywords to find
Entities and their attributes inferred from the schema
using cardinality information
Final return value:
explicit or implicit return nodes and entities’ attributes
114
Determining Return Values
Exemplar: XSeek
(1)
Entity node (1:n) Attribute node (1:1) Connection node
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
115
Determining Return Values
Exemplar: XSeek
K={Smith, web}
(2) (9)
article author
(3) (4) (12) (10) (11)
title year journal first last
...Semantic Web... 2005 Computer Journal Mary Smith
116
Determining Return Values
Exemplar: XSeek
K={Smith, title}
(3) (11)
title last
...Semantic Web... Smith
117
Determining Return Values
Exemplar: XSeek
K={Smith, title}
(3) (11)
title last
...Semantic Web... Smith
Z. Liu, J. Walker and Y.Chen. XSeek:
a semantic XML search engine
using keywords. VLDB 2007
117
Queries as Keywords
Expressiveness
Queries: unordered lists with implicit conjunction
+ Conjunction, disjunction (Multiway, Abbaci et al.)
+ Inclusion, sibling, negation, precedence (Abbaci et al.)
User-selected return value (XSeek, MIU)
Numeric comparison operators (MIU)
Optional terms (Interconnection)
label:keyword terms (Interconnection, XSearch)
Keyword-enhanced languages
118
Queries as Keywords
Expressiveness
(1)
bib
(2) (13)
article article
(3) (4) (5) (12) (14) (15) (16) (23)
title year authors journal title year authors journal
(6) (9) (17) (20)
...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal
author author author author
(7) (8) (10) (11) (18) (19) (21) (22)
first last first last first last first last
John Doe Mary Smith Peter Jones Sue Robinson
119
Queries as Keywords
Expressiveness
Transform query into binary tree
Construct set of matching nodes 23
and their ancestors for each term AND
Bottom-up processing: web NOT 4 ... 23
Apply operator to sets of nodes 3
23
(i.e. intersection for conjunction) 2 semantic
1 3
The nodes remaining at the root are 13 2
1
valid answers (LCA-like)
120
Queries as Keywords
Expressiveness
Transform query into binary tree
Construct set of matching nodes 23
and their ancestors for each term AND
Bottom-up processing: web NOT 4 ... 23
Apply operator to sets of nodes 3
23
(i.e. intersection for conjunction) 2 semantic
1 3
The nodes remaining at the root are 13 2
1
valid answers (LCA-like)
F. Abbaci et al. Index and Search
XML Documents by Combining
Content. International Conference on
Internet Computing 2006 120
Queries as Keywords
Ranking
Size of answer subtree
Distance between matched nodes and answer tree
root node
tf-idf/vector space model
XRank: PageRank-like ranking
121
Queries as Keywords
XRank — Ranking factors
Result specificity: vertical distance
Keyword proximity: horizontal distance
Hyperlink awareness: PageRank value, adapted for
XML
distinction between XML edges and IDREF links
Bidirectional propagation between XML edges
distinction between following forward and backward links
122
Queries as Keywords
XRank — Ranking factors
Result specificity: vertical distance
Keyword proximity: horizontal distance
Hyperlink awareness: PageRank value, adapted for
XML
distinction between XML edges and IDREF links
Bidirectional propagation between XML edges
distinction between following forward and backward links
Lin Guo et al. XRANK: Ranked
keyword search over XML
documents. SIGMOD 2003
122
Queries as Keywords
Limitations
Mostly limited to tree data
Determining semantic entities in structured data
Assumption: No element outside of LCA subtree is relevant
No universal solution, data-driven
Relatively low expressiveness
123
Queries as Keywords
Limitations
Query answers
Little control over selection
Result may be too verbose or not informative enough
No construction or aggregation
Exception: Keyword-enhanced languages
No querying of data in mixed formats (RDF & XML)
124
Part 3
KWQL
Keyword-based query language
for Semantic Wikis
125
Semantic Wikis:
The (Semantic) Web in the Small
As on the Semantic Web we have
pages and links between them and annotations
content created by many different people
But we also have
central control, organization and administration
a small (or at least manageable) number of pages
Strong social factors and collaboration
→Wikis as a testbed for the “real” web
126
KWQL: Keyword-Based QL for Wikis
Characteristics
KWQL can access all elements the user interacts with
Combined querying of text, annotation and metadata
Querying of informal to formal annotations
Combination of selection criteria from several data sources
in one query
Aggregation and construction
Data construction
Embedded queries
Continuous queries
127
KWQL: Keyword-Based QL for Wikis
Characteristics
KWQL can access all elements the user interacts with
Combined querying of text, annotation and metadata
Querying of informal to formal annotations
Combination of selection criteria from several data sources
in one query
F. Bry and K. Weiand. Flavors of
Aggregation and construction KWQL, a Keyword Query
Language for a Semantic Wiki.
Data construction SOFSEM, 2010.
Embedded queries
Continuous queries
127
KWQL: Keyword-Based QL for Wikis
Characteristics
Varying complexity of queries
Simple label-keyword queries
Conjunction/disjunction/optional
Structural queries
Link traversal
128
KWQL: Keyword-Based QL for Wikis
Examples
1 Java
2 author:"Mary"
3 ci(text:Java OR (tag(name:XML) AND author:Mary))
ci(tag(name:Java) link(target:ci(title:Lucene)
4 tag(name:uses)))
ci(title:Contents text:($A "-" ALL($T,",")))
5 @ ci(title:$T author:$A)
129
KWQL: Keyword-Based QL for Wikis
visKWQL
KWQL’s visual counterpart
Query by example paradigm
Round-tripping between KWQL and visKWQL
Visualization of textual queries
visKWQL as a tool to learn KWQL
130
KWQL: Keyword-Based QL for Wikis
KWQL and visKWQL
131
KWQL: Keyword-Based QL for Wikis
KWQL and visKWQL
132
KWQL: Keyword-Based QL for Wikis
KWQL and visKWQL
133
KWQL: Keyword-Based QL for Wikis
KWQL and visKWQL
134
Part 4
Conclusion
135
Web Queries
Conclusion
Is it even possible to find a universal grouping
mechanism?
How easy to use can a query language be while still
being powerful enough?
How complicated can a query language be without
becoming too hard to use for casual users?
136
Web Queries
Conclusion
Slides and links at
http://pms.ifi.lmu.de/wise
137
Keyword-based Web query languages are one significa more
Keyword-based Web query languages are one significant effort towards combining the virtues of Web search, viz. being accessible to untrained users and able to cope with vastly heterogeneous data, with those of database-style Web queries. These languages operate essentially in the same setting as XQuery or SPARQL but with an interface for untrained users.
Keyword-based query languages trade some of the precision that languages like XQuery enable by allowing to formulate exactly what data to select and how to process it, for an easier interface accessible to untrained users. The yardstick for these languages becomes an easily accessible interface that does not sacrifice the essential premise of database-style Web queries, that selection and construction are precise enough to fully automate data processing tasks.
To ground the discussion of keyword-based query languages, we give a summary of what we perceive as the main contributions of research and development on Web query languages in the past decade. This summary focuses specifically on what sets Web query languages apart from their predecessors for databases.
Further, this tutorial (1) gives an overview over keyword-based query languages for XML and RDF (2) discusses where the existing approaches succeed and what, in our opinion, are the most glaring open issues, and (3) where, beyond keyword-based query languages, we see the need, the challenges, and the opportunities for combining the ease of use of Web search with the virtues of Web queries. less
0 comments
Post a comment