Fosdem 2013 petra selmer flexible querying of graph data

Flexible querying of graph data

Graph processing room
FOSDEM, 2 Feb 2013

Petra Selmer
petra.selmer.uk@gmail.com
http://www.dcs.bbk.ac.uk/~lselm01/

Introduction

 I shall be presenting my PhD topic which involves
a declarative query language allowing for the
flexible querying of graph-structured data with
complex paths.

2

Agenda

 Who (am I)?
 Why (the motivation)?
 Some background info
 What (is the query language and what
can it do)?
 Illustrative examples
 How (is it done)?

3

Who?

 Petra Selmer
 Part-time PhD student:
 Birkbeck College, University of London
 Prof. Alexandra Poulovassilis
 Dr. Peter T. Wood
 Software Architect:
 University College London’s Institute of Neurology
(Wellcome Trust Centre for Neuroimaging)

4

Why?

 Amount of graph-structured data is
growing fast
 The structure of this data is
becoming more complex, especially
when multiple, heterogeneous data
sources are integrated together
 The structure of the data is also
always subject to change...

5

Why?
 Users of such systems may not be familiar with the underlying data
structure: available paths etc
 The user may not be able to obtain meaningful answers (or indeed,
any answers) from the data IF the querying system is limited to exact
matching of users’ queries
 Also, the user may wish to explore the data by starting from a set of
initial answers and proceeding from there
 The user may additionally wish to derive some intelligence from the
connections....

The data

The query The user

6

Background: Ontologies

 Currently part of the Semantic Web stack (Tim Berners-
Lee, RDF, triple stores)
 Models a domain of interest: inferences, reasoning...
 It can be thought of as a “schema” for graph data
 The following inference rules are included (among
others):
 Subclass: ‘History’, ‘Languages’ are subclasses of
‘Humanities’
 Subproperty, Domain, Range...

7

What?
 Data model: G = (V, E)
 Very general model
 V : vertices (or nodes); each labelled with some
constant
 E : directed, labelled edges; labels drawn from an
alphabet {Ʃ U ‘type’}
 The query language is called Flex-It (it is
declarative)
 The basis is that of conjunctive regular path
queries
 There are two operators which may be applied to the
original query

8

What?
 Conjunctive regular path queries:
 This is where the graph's paths to be traversed are expressed with a
regular expression
 A single regular path query conjunct: (X, R, Y)
 X, Y: either constants or variables
 R: the regular expression
 “Conjunctive”: joining multiple conjuncts; e.g. (X, R1, Y), (Y,
R2, Z), (Z, R3, A)
 The Y’s are matched, the Z’s are matched etc

1) (N1, n+, ?Y):
n n p • Y = N2, N3
N1 N2 N3 N4
2) (N1, n*p, ?Y):
• Y = N4
9

What?
 Approximation allows for the approximate matching
of labels in the path
 An edit operation is applied to each edge label in
the path denoted by the regular expression:
 Edit operations: insertions, deletions, inversions,
substitutions and transpositions of labels
 Each operation has a ‘cost’: usually 1
 Example:
 Query conjunct: (X, a*.b, Y)
 R = a*.b [answers returned at cost 0]
 R’ = p.a*.b (insertion of ‘p’) [answers returned at cost 1]
 R’’ = p.a*.b- (inversion of ‘b’) [answers returned at cost 2]

10

What?
 Relaxation is applied by using inference
rules from an ontology (if one exists).
 Achieved by applying logical relaxation of the query
conditions using the data’s ontology definition
 Relaxation operations: subclass, subproperty, domain
and range
 Each operation has a ‘cost’ – usually 1
 Example:
 We have an ontology:
 Humanities (superclass)
 Languages and History (subclasses of Humanities)
 Assume our query states Languages may be relaxed
 Languages is relaxed to Humanities:
 Instances of Languages will be returned at cost 0
 Instances of History will be returned at cost 1

11

What?

 Answers are ranked according to how
closely they match the original query;
higher-cost answers have a lower ranking
 All answers at a certain distance d are
ranked the same and returned before
answers at a higher distance
 We allow for incremental execution: exact
answers returned first; then answers at
distance 1; ...
12

Example – ‘Lifelong learner metadata’

sc

History

13

 Query: “What work positions can I reach, having a degree in English”?
 Y = the episode; Z = the job
(?Y, ?Z) 
(?X, type, University),
(?X, qualif.type, EnglishStudies),
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
15

 Query: “What work positions can I reach, having a degree in English”?
 Y = the episode; Z = the job
(?Y, ?Z) 
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
 No results from User 2 will be returned...even though it is relevant!
16

 Allowing query approximation can yield some answers:
 Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the
query:
(?Y, ?Z) 
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
 prereq+ can be approximated by next.prereq* at edit distance 1:
 Result: Y = ep22, Z = AirTravelAssistant
17

 Allowing query approximation can yield some answers:
 Replacing the edge label prereq by next, at an edit cost of 1, we get this
variant of the query:
(?Y, ?Z) 
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
 next.prereq* can be approximated by next.next.prereq*, now at edit distance 2:
 Results:
 Y = ep23, Z = Journalist
 Y = ep24, Z = AssistantEditor
18

 Query: “What jobs are open to me if I study English, or something similar, at University”?
(?Y, ?Z) 
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
 In addition to the answers (from User 2) obtained by the previous query, we now also have
answers from the timeline of User 3
 prereq+ can be approximated by next.prereq* (distance 1) and EnglishStudies can be relaxed
– via Languages - to Humanities (distance 2), encompassing History
 Result: Y = ep32, Z = PersonalAssistant (distance of 3 from original query)
20

 Query: “What jobs are open to me if I study English, or something similar, at
University”?
(?Y, ?Z) 
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
 next.prereq* can be approximated by next.next.prereq* (distance 2), with
EnglishStudies again relaxed to Humanities (distance 2)
 Results: (both at distance 4 from the original query)
 Y = ep33, Z = Author
 Y = e34, Z = AssociateEditor
21

How?
 Theory
 Construction of a weighted non-deterministic finite
automaton (NFA) to represent the regular expression
 We apply new states and transitions to the NFA to represent the
approximation and relaxation operations
 Formation of a product automaton: NFA with data
graph G
 We perform a lowest cost path traversal of the product
automaton; construct query tree, do joins etc
 Polynomial time complexity
 Correctness of algorithms proven

22

How?

 Implementation of prototype
 Graph database: DEX (http://www.sparsity-
technologies.com/dex)
 Programming language: C#
 Further work
 New flexible operation combining APPROX and
RELAX  FLEX
 Optimisation!

23

Any questions?

Thank you for your attention!

petra.selmer.uk@gmail.com
24

Fosdem 2013 petra selmer flexible querying of graph data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fosdem 2013 petra selmer flexible querying of graph data

Similar to Fosdem 2013 petra selmer flexible querying of graph data (20)

Recently uploaded

Recently uploaded (20)

Fosdem 2013 petra selmer flexible querying of graph data