Automatic Query-Centric API for Routine Access to Linked Data

‹#› Het begint met een idee
AUTOMATIC QUERY-CENTRIC API FOR
ROUTINE ACCESS TO LINKED DATA
Albert Meroño-Peñuela
Rinke Hoekstra
ISWC 2017
October 24th

Vrije Universiteit Amsterdam
LINKED DATA
ACCESS

‹#› Het begint met een idee3

Linked Data is great for information integration on the Web,
but:
 Heterogeneity of access methods: SPARQL, #LD, dumps,
HTML/RDFa, LDA
 Hard technological requirements: RDF, SPARQL
 Coupling with Linked Data specific libararies
 Web developers want APIs and JSON
 Queries are second-class Web citizens (i.e. volatile)
> Lost after execution (reusability?)
> Multiple out-of-sync instances if shared among applications
(reliability?)
How to make semantic queries repeatable for Linked
Data consumers automatically?
4
LINKED DATA ACCESS

 OpenPHACTS: RESTful
entry point to Linked
Data hubs for Web
applications
 Query = Service = URI
5
LINKED DATA APIS
However:
• The API (e.g. Swagger spec, code itself) still needs to be
coded and maintained
• Exclusion of SPARQL <- query reuse?

 Automatically builds Swagger
specs and API code
 Takes SPARQL queries as input
(1 API operation = 1 SPARQL
query)
> API call functionality limited to SPARQL
expressivity
 Makes SPARQL queries uniquely
referenceable by using their
equivalent LDA operation
> Stores SPARQL internally
> But we already have uniquely
referenceable SPARQL…
6
BASIL

Writing SPARQL by trial
and error = versioning
support
Variety of access
interfaces needed

8 Het begint met een idee
 One .rq file for SPARQL query
 Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
 Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
8 Faculty / department / title presentation
GITHUB AS A HUB OF
SPARQL QUERIES

9 Het begint met een idee
 Same basic principle of BASIL: 1
SPARQL query = 1 API operation
 Automatically builds Swagger spec
and UI from SPARQL
 Answers API calls
But:
 External query management
 Organization of SPARQL queries in
the GitHub repo matches
organization of the API
 Thin layer – nothing stored server-
side
 Maps
> GitHub API
> Swagger spec
9 Faculty / department / title presentation

10
GRLC’S ARCHITECTURE

11
SPARQL DECORATOR SYNTAX

12
THE GRLC SERVICE
 Assuming your repo is at https://github.com/:owner/:repo
> http://grlc.io/api/:owner/:repo/spec returns the JSON swagger
spec
> http://grlc.io/api/:owner/:repo/ returns the swagger UI
> http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n
calls operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
 Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it
> Supports versioning through http://grlc.io/api/:owner/:repo/commit/:sha

13
SPARQL + RDFA + DUMPS + #LD
• Compatible with most
Linked Data access
methods
• Loads remote RDFa/dumps
in memory
• Uses TPF for #LD servers
• Mixes all these into one
homogeneous API

14
PROVENANCE
• Two sources: query history,
spec generation
• Uses W3C PROV
• Uses Git2PROV to get
query history
• Adds spec provenance at
generation time
• Visualizations with PROV-
O-Viz
(http://provoviz.org/)

15
ENUMERATIONS & DROPDOWNS
• Fills in the
swag[paths][op][method][
parameters][enum] array
• Uses the triple pattern of
the SPARQL query’s BGP
against the same SPARQL
endpoint

16
CONTENT NEGOTIATION
• API endpoints can now
end with .content_type
(e.g grlc.io/CLARIAH/wp-
queries/MyQuery.csv)
• Supports .csv, .json,
.html (can be extended)
• grlc sets ‘Accept’ HTTP
header and agnostically
returns same ‘Content-
Type’ as the SPARQL
endpoint
• Up to the SPARQL
endpoint to accept it

17
PAGINATION
• Large query results are
typically nasty to consuming
applications
• Split the result in multiple
parts (or “pages”)
• Size? #+ pagination: 100
• Navigating pages
• rel=next,prev,first,last links
in the HTTP headers (GitHub
API Traversal convention)
• Extra request parameter
?page (defaults to 1)
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-project/Queries/houseType_all
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18447
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=3
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18142
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=4>; rel=next,
project/Queries/houseType_all?page=2>; rel=prev,
project/Queries/houseType_all?page=1>; rel=first,
project/Queries/houseType_all?page=889>; rel=last

18
DOCKER CONTAINER
• Uses docker
• Infrastructure-independent
install
• Bundles (composes) all required
packages (python, python libs,
grlc, nginx). Can be easily
extended to more
• Publicly available at
hub.docker.com
• One-command server deploy:
docker pull clariah/grlc

 915 unique visitors since July 2016
 1,878 sessions
 46.4% return rate
 5 active open source contributors, 14 pull requests
 Community of users and devs
19
QUALITATIVE EVALUATION

> “multiple copies of the same queries in different places
(…) was problematic. grlc allows queries to be
maintained in a single location”
> “with grlc the R code becomes clearer due to the
decoupling with SPARQL; and shorter, since a curl
suffices to retrieve the data”
> “it allows us to manage SPARQL queries separate from
the rest of the API – this enables, for instance, to have
different queries without having to deploy a new version
of the API”
> “we use grlc to provide FAQ for those who would prefer
REST over SPARQL, but also to explore the data”
> “we use grlc to expose the ECAI conference proceedings
not only as Linked Data that can be used by Semantic
Web practitioners, but also as a Web API that web
developers can consume”
> “grlc helps to share, extend and repurpose queries by
providing a URI for the resulted queries and by
supporting collaborative update of those queries”20
QUALITATIVE EVALUATION

21
QUANTITATIVE EVALUATION
The cost of grlc is independent of the dataset size
HTTP requests and payloads are important costs

The spectrum of Linked Data clients: SPARQL intensive applications
vs RESTful API applications
grlc uses decoupling of SPARQL from all client applications
(including LDA) as a powerful practice
 Separates query curation workflows from everything else
 Allows at the same time
> Web-friendly SPARQL queries
> Web-friendly RESTful APIs
 (Moderate) costs mainly due to HTTP requests and query payload
 SPARQL projections
 Reusability of query catalogs
22
CONCLUSIONS

THANK YOU!
@ALBERTMERONYO
DATALEGEND.NET
CLARIAH.NL
23

24
QUANTIATIVE EVALUATION

25
QUANTIATIVE EVALUATION

Automatic Query-Centric API for Routine Access to Linked Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automatic Query-Centric API for Routine Access to Linked Data

Similar to Automatic Query-Centric API for Routine Access to Linked Data (20)

More from Albert Meroño-Peñuela

More from Albert Meroño-Peñuela (16)

Recently uploaded

Recently uploaded (20)

Automatic Query-Centric API for Routine Access to Linked Data

Editor's Notes