ISWC 2017 In-Use paper. Despite the advantages of Linked Data as a data integration paradigm, accessing and consuming Linked Data is still a cumbersome task. Linked Data applications need to use technologies such as RDF and SPARQL that, despite their expressive power, belong to the data integration stack. As a result, applications and data cannot be cleanly separated: SPARQL queries, endpoint addresses, namespaces, and URIs end up as part of the application code. Many publishers address these problems by building RESTful APIs around their Linked Data. However, this solution has two pitfalls: these APIs are costly to maintain; and they blackbox functionality by hiding the queries they use. In this paper we describe grlc, a gateway between Linked Data applications and the LOD cloud that offers a RESTful, reusable and uniform means to routinely access any Linked Data. It generates an OpenAPI compatible API by using parametrized queries shared on the Web. The resulting APIs require no coding, rely on low-cost external query storage and versioning services, contain abundant provenance information, and integrate access to different publishing paradigms into a single API. We evaluate grlc qualitatively, by describing its reported value by current users; and quantitatively, by measuring the added overhead at generating API specifications and answering to calls.
Automatic Query-Centric API for Routine Access to Linked Data
1. ‹#› Het begint met een idee
AUTOMATIC QUERY-CENTRIC API FOR
ROUTINE ACCESS TO LINKED DATA
Albert Meroño-Peñuela
Rinke Hoekstra
ISWC 2017
October 24th
4. Vrije Universiteit Amsterdam
Linked Data is great for information integration on the Web,
but:
Heterogeneity of access methods: SPARQL, #LD, dumps,
HTML/RDFa, LDA
Hard technological requirements: RDF, SPARQL
Coupling with Linked Data specific libararies
Web developers want APIs and JSON
Queries are second-class Web citizens (i.e. volatile)
> Lost after execution (reusability?)
> Multiple out-of-sync instances if shared among applications
(reliability?)
How to make semantic queries repeatable for Linked
Data consumers automatically?
4
LINKED DATA ACCESS
5. Vrije Universiteit Amsterdam
OpenPHACTS: RESTful
entry point to Linked
Data hubs for Web
applications
Query = Service = URI
5
LINKED DATA APIS
However:
• The API (e.g. Swagger spec, code itself) still needs to be
coded and maintained
• Exclusion of SPARQL <- query reuse?
6. Vrije Universiteit Amsterdam
Automatically builds Swagger
specs and API code
Takes SPARQL queries as input
(1 API operation = 1 SPARQL
query)
> API call functionality limited to SPARQL
expressivity
Makes SPARQL queries uniquely
referenceable by using their
equivalent LDA operation
> Stores SPARQL internally
> But we already have uniquely
referenceable SPARQL…
6
BASIL
7. ‹#› Het begint met een idee
Writing SPARQL by trial
and error = versioning
support
Variety of access
interfaces needed
8. ‹#› Het begint met een idee
8 Het begint met een idee
One .rq file for SPARQL query
Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
8 Faculty / department / title presentation
GITHUB AS A HUB OF
SPARQL QUERIES
9. ‹#› Het begint met een idee
9 Het begint met een idee
Same basic principle of BASIL: 1
SPARQL query = 1 API operation
Automatically builds Swagger spec
and UI from SPARQL
Answers API calls
But:
External query management
Organization of SPARQL queries in
the GitHub repo matches
organization of the API
Thin layer – nothing stored server-
side
Maps
> GitHub API
> Swagger spec
9 Faculty / department / title presentation
12. Vrije Universiteit Amsterdam
12
THE GRLC SERVICE
Assuming your repo is at https://github.com/:owner/:repo
> http://grlc.io/api/:owner/:repo/spec returns the JSON swagger
spec
> http://grlc.io/api/:owner/:repo/ returns the swagger UI
> http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n
calls operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it
> Supports versioning through http://grlc.io/api/:owner/:repo/commit/:sha
13. Vrije Universiteit Amsterdam
13
SPARQL + RDFA + DUMPS + #LD
• Compatible with most
Linked Data access
methods
• Loads remote RDFa/dumps
in memory
• Uses TPF for #LD servers
• Mixes all these into one
homogeneous API
14. Vrije Universiteit Amsterdam
14
PROVENANCE
• Two sources: query history,
spec generation
• Uses W3C PROV
• Uses Git2PROV to get
query history
• Adds spec provenance at
generation time
• Visualizations with PROV-
O-Viz
(http://provoviz.org/)
15. Vrije Universiteit Amsterdam
15
ENUMERATIONS & DROPDOWNS
• Fills in the
swag[paths][op][method][
parameters][enum] array
• Uses the triple pattern of
the SPARQL query’s BGP
against the same SPARQL
endpoint
16. Vrije Universiteit Amsterdam
16
CONTENT NEGOTIATION
• API endpoints can now
end with .content_type
(e.g grlc.io/CLARIAH/wp-
queries/MyQuery.csv)
• Supports .csv, .json,
.html (can be extended)
• grlc sets ‘Accept’ HTTP
header and agnostically
returns same ‘Content-
Type’ as the SPARQL
endpoint
• Up to the SPARQL
endpoint to accept it
17. Vrije Universiteit Amsterdam
17
PAGINATION
• Large query results are
typically nasty to consuming
applications
• Split the result in multiple
parts (or “pages”)
• Size? #+ pagination: 100
• Navigating pages
• rel=next,prev,first,last links
in the HTTP headers (GitHub
API Traversal convention)
• Extra request parameter
?page (defaults to 1)
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-project/Queries/houseType_all
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18447
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=3
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18142
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=4>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=prev,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=1>; rel=first,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
18. Vrije Universiteit Amsterdam
18
DOCKER CONTAINER
• Uses docker
• Infrastructure-independent
install
• Bundles (composes) all required
packages (python, python libs,
grlc, nginx). Can be easily
extended to more
• Publicly available at
hub.docker.com
• One-command server deploy:
docker pull clariah/grlc
19. Vrije Universiteit Amsterdam
915 unique visitors since July 2016
1,878 sessions
46.4% return rate
5 active open source contributors, 14 pull requests
Community of users and devs
19
QUALITATIVE EVALUATION
20. Vrije Universiteit Amsterdam
> “multiple copies of the same queries in different places
(…) was problematic. grlc allows queries to be
maintained in a single location”
> “with grlc the R code becomes clearer due to the
decoupling with SPARQL; and shorter, since a curl
suffices to retrieve the data”
> “it allows us to manage SPARQL queries separate from
the rest of the API – this enables, for instance, to have
different queries without having to deploy a new version
of the API”
> “we use grlc to provide FAQ for those who would prefer
REST over SPARQL, but also to explore the data”
> “we use grlc to expose the ECAI conference proceedings
not only as Linked Data that can be used by Semantic
Web practitioners, but also as a Web API that web
developers can consume”
> “grlc helps to share, extend and repurpose queries by
providing a URI for the resulted queries and by
supporting collaborative update of those queries”20
QUALITATIVE EVALUATION
22. Vrije Universiteit Amsterdam
The spectrum of Linked Data clients: SPARQL intensive applications
vs RESTful API applications
grlc uses decoupling of SPARQL from all client applications
(including LDA) as a powerful practice
Separates query curation workflows from everything else
Allows at the same time
> Web-friendly SPARQL queries
> Web-friendly RESTful APIs
(Moderate) costs mainly due to HTTP requests and query payload
SPARQL projections
Reusability of query catalogs
22
CONCLUSIONS
23. ‹#› Het begint met een idee
THANK YOU!
@ALBERTMERONYO
DATALEGEND.NET
CLARIAH.NL
23
Various access methods; SPARQL, dumps, LD#, etc.
OUR COMMUNITY knows little of these
Still UNDERSTAND benefits and want to use LD for their Linked Data consuming applications
Query becomes non-reusable
SYNC QUERIES ACROSS DIFFERENT APPLICATIONS
Integration is done, APIs around to access that integrated space
APPLICATIONS USING APIS AND SPARQL NEED TO COEXIST
OUR community keen on using version control systems to maintain queries
Trial and error, versioning support