1. Collaborative development of cross-
database Bio2RDF queries
Peter Ansell
Microsoft Queensland University of Technology
eResearch Centre
p.ansell@qut.edu.au
2. Introduction
● Large number of cross-disciplinary datasources in
different locations
● Scientists require simplified access to many of the
datasources for complex research
● Recently a large number of datasources have been
published using the RDF syntax
● Now, we need to learn how to query across them and
be able to share that knowledge with others
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
2
3. Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
3
4. Linked Data
1) Use URIs as names for things
2) Use HTTP URIs so that people can look up those
names.
3) When someone looks up a URI, provide useful
information, using the standards (RDF, SPARQL)
4) Include links to other URIs. so that they can
discover more things.
http://www.w3.org/DesignIssues/LinkedData.html
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
4
5. Linked Data querying
● Strategy 1 (Naive) :
– Retrieve resources
– Retrieve linked resources
– Cache the resources and perform queries locally
● Strategy 2 (Search engine):
– Retrieve resources
– Retrieve directly linked resources
– Query a semantic search engine for related resources
– Cache the resources and perform queries locally
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
5
6. Linked Data querying
● Strategy 3 (Distributed query) :
– Mix SPARQL endpoint queries with URI based
resolution to avoid having a large local cache
– Normalise results from each site to form final query
result
– Users can process the results, and perform one or more
queries based on their interpretation of the results
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
6
7. Bio2RDF distributed queries
● Assign Namespaces to providers
● Query across relevant providers given a users query
● Aggregate all results into a single RDF document
and return to the user
● It works: 700000 queries during the last month
● Largest dataset has 10 billion triples, the Protein
Databank, with others making up about 5 billion
triples
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
7
9. Collaboration
● Query and provider definitions have an RDF
representation
● Any other person is able to take a definition and
change it to suit their needs and redistribute their
definition
● If definitions are Linked Data themselves, as the
Bio2RDF configuration item are, the HTTP URI can
be used by others to pull the definition into their
own software
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
9
10. Provenance for queries and data
● Provenance can be attached to each item, including
details such as OpenID URI's and dates
● The sources and queries that were used for a
particular URI can be found by utilising the query
plan option
– http://bio2rdf.org/queryplan/label/go:0000345
– Enables automatic query collaboration, as you could
take this query plan and add or modify queries
inside the query plan
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
10
11. Conclusion
● Many large distributed datasources
● Single interface, RDF
● Distribute queries efficiently across the endpoints
● Enabling people to create the definitions of what
they did so other people can collaborate, via a single
server or copy and paste
Sydney, Australia 3rd eResearch Australasia Conference 9-13 Nov 2009
11