Repeatable Semantic Queries for the Linked Data Agnostic

‹#› Het begint met een idee
REPEATABLE SEMANTIC QUERIES
FOR THE LINKED DATA AGNOSTIC
Albert Meroño-Peñuela
Rinke Hoekstra
Richard Zijdeman
Auke Rijpma
Ashkan Ashkpour
Carlos Martínez
… and many others
@albertmeronyo
Linked Data, Research, Culture:
Distributed Publishing, Searching, and
Archiving
01-05-2017

Vrije Universiteit Amsterdam
 VU University Amsterdam – Computer
Science (Knowledge Representation &
Reasoning group)
 International Institute of Social
History (IISG), Amsterdam
 CLARIAH – National Infrastructure for
Digital Humanities
> DataLegend : Structured Data Hub
 DANS: CEDAR – Dutch historical
censuses as 5-star LOD
2
INSTITUTIONAL SLIDE

Links > Queries
3
QUERIES AND LINKS

4 Het begint met een idee
 Publishing Dutch historical
censuses as 5-star LD
> Intensive use of RDF Data Cube
> Harmonization rules
> Provenance
> http://lod.cedar-project.nl/
 1st historical census data as Linked
Data (1795-1971)
 8 million observations (sex, marital
status, occupation position, housing type)
 External links
> Geographical: 2.7M
> Occupations: 350K
https://collab.iisg.nl/web/hisco
> Religion: 250K http://licr.io/
 High value for social historians
4 Faculty / department / title presentation
CEDAR-PROJECT.NL

 CLARIAH-WP4: Structured
data hub for social history
 IPUMS, NAPP, CEDAR, etc
> Macro-, micro-, meso-data
> Civil registries, occupation, religion,
country-level economic indicators
> National (Netherlands) and
international
 Mostly CSV tables turned
into RDF Data Cube and
CSVW
 More than 1B triples
 Higher variety of humanities
scholars  higher variety of
data access requirements
5
CLARIAH-WP4: DATALEGEND.NET
Exi sts
Frequency Table
Variable does not yet existVariables
Mappings
Publish
Augment
Includes both external LinkedDataand
standard vocabularies, e.g. World Bank
External (Meta)Data
Existing Variables
& Codes
Provenance tracking of a
External Datasets
StructuredDataHub

LINKED DATA
ACCESS

‹#› Het begint met een idee7

Linked Data is great for information integration on the Web,
but:
 Hard technological requirements: RDF, SPARQL
 Heterogeneity of access methods: SPARQL, #LD, dumps,
HTML/RDFa, LDA
 Web developers want APIs and JSON
 Breaking the law endpoint
 Queries are second-class Web citizens (i.e. volatile)
> Lost after execution (reusability?)
> Multiple out-of-sync instances if shared among applications (reliability?)
How to make semantic queries repeatable for the Linked
Data agnostic?
8
LINKED DATA ACCESS

 OpenPHACTS: RESTful
entry point to Linked
Data hubs for Web
applications
 Query = Service = URI
9
LINKED DATA APIS
However:
• The API (e.g. Swagger spec, code itself) still needs to be
coded and maintained
• Exclusion of SPARQL <- query reuse?

 Automatically builds Swagger
specs and API code
 Takes SPARQL queries as input
(1 API operation = 1 SPARQL
query)
> API call functionality limited to SPARQL
expressivity
 Makes SPARQL queries uniquely
referenceable by using their
equivalent LDA operation
> Stores SPARQL internally
> But we already have uniquely
referenceable SPARQL…
10
BASIL

 Writing SPARQL by trial and error <- Versioning support
 Variety of access interfaces needed
11
CENSUS DATA QUERYING INTERFACES

 One .rq file for SPARQL query
 Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
 Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
GITHUB AS A HUB OF
SPARQL QUERIES

‹#› Het begint met een idee14
Query centralization helps maintaining
distributed applications

 Same basic principle of BASIL: 1
SPARQL query = 1 API operation
 Automatically builds Swagger spec
and UI from SPARQL
 Answers API calls
But:
 External query management
 Organization of SPARQL queries in
the GitHub repo matches
organization of the API
 Thin layer – nothing stored server-
side
 Maps
> GitHub API
> Swagger spec

16
MAPPING GITHUB AND SWAGGER

17
SPARQL DECORATOR SYNTAX

18
THE GRLC SERVICE
 Assuming your repo is at https://github.com/:owner/:repo
> http://grlc.io/api/:owner/:repo/spec returns the JSON swagger
spec
> http://grlc.io/api/:owner/:repo/ returns the swagger UI
> http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n
calls operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
 Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it

19
SPICED-UP SWAGGER UI

20
SPARQL + RDFA + DUMPS + #LD
• Compatible with most
Linked Data access
methods
• Loads remote RDFa/dumps
in memory
• Uses TPF for #LD servers
• Mixes all these into one
homogeneous API

21
PROVENANCE
• Two sources: query history,
spec generation
• Uses W3C PROV
• Uses Git2PROV to get
query history
• Adds spec provenance at
generation time
• Visualizations with PROV-
O-Viz
(http://provoviz.org/)

22
ENUMERATIONS & DROPDOWNS
• Fills in the
swag[paths][op][method][
parameters][enum] array
• Uses the triple pattern of
the SPARQL query’s BGP
against the same SPARQL
endpoint

23
CONTENT NEGOTIATION
• API endpoints can now
end with .content_type
(e.g grlc.io/CLARIAH/wp-
queries/MyQuery.csv)
• Supports .csv, .json,
.html (can be extended)
• grlc sets ‘Accept’ HTTP
header and agnostically
returns same ‘Content-
Type’ as the SPARQL
endpoint
• Up to the SPARQL
endpoint to accept it

24
PAGINATION
• Large query results are
typically nasty to consuming
applications
• Split the result in multiple
parts (or “pages”)
• Size? #+ pagination: 100
• Navigating pages
• rel=next,prev,first,last links
in the HTTP headers (GitHub
API Traversal convention)
• Extra request parameter
?page (defaults to 1)
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-project/Queries/houseType_all
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18447
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=3
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18142
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=4>; rel=next,
project/Queries/houseType_all?page=2>; rel=prev,
project/Queries/houseType_all?page=1>; rel=first,
project/Queries/houseType_all?page=889>; rel=last

25
CACHE
• Moved implementation
outside of grlc (not its
direct responsibility)
• grlc sets HTTP header
Cache-Control to public,
max-age=900 (15 minutes,
customizable)
• nginx caches all grlc
generated JSON (and
other static/dynamic
assets)
• nginx becomes part of the
bundle

26
DOCKER CONTAINER
• Uses docker
• Infrastructure-independent
install
• Bundles (composes) all required
packages (python, python libs,
grlc, nginx). Can be easily
extended to more
• Publicly available at
hub.docker.com
• One-command server deploy:
docker pull clariah/grlc

27
SPARQL2GIT
 GUI for maintaining SPARQL
query repositories in GitHub
 Easy editing grlc parameters to
use your SPARQL as an API
 Update query contents
transparently – no git knowledge
required!
 Gets your SPARQL in GitHub +
your Linked Data APIs at once
 http://sparql2git.com/

28
EVALUATION – USE CASES
 CEDAR: Access to census data for
historians
> Hides SPARQL
> Allows them to fill query parameters
through forms
> Co-existence of SPARQL and non-SPARQL
clients
 CLARIAH - Born Under a Bad Sign:
Do prenatal and early-life
conditions have an impact on
socioeconomic and health
outcomes later in life? (uses 1891
Canada and Sweden Linked Census Data)
> Reduction of coupling between SPARQL
libs and R
> Shorter R code – input stream as CSV

The spectrum of Linked Data clients: SPARQL intensive applications
vs RESTful API applications
grlc uses decoupling of SPARQL from all client applications
(including LDA) as a powerful practice
 Separates query curation workflows from everything else
 Allows at the same time
> Web-friendly SPARQL queries
> Web-friendly RESTful APIs
 Helps you to easily organise your LDA – just organise your SPARQL
repository and you’re set
 Try it out!
> http://grlc.io/
> https://github.com/CLARIAH/grlc
29
CONCLUSIONS

1. What's your current practice when you code (semantic)
queries into applications? Where do you store them?
How do you reuse them (if at all)?
2. How much effort does it cost to you to maintain your
APIs up to date? What does this maintenance consist of
exactly?
30
QUESTIONS

THANK YOU!
@ALBERTMERONYO
DATALEGEND.NET
CLARIAH.NL
31

Repeatable Semantic Queries for the Linked Data Agnostic

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Repeatable Semantic Queries for the Linked Data Agnostic

Similar to Repeatable Semantic Queries for the Linked Data Agnostic (20)

More from Albert Meroño-Peñuela

More from Albert Meroño-Peñuela (15)

Recently uploaded

Recently uploaded (20)

Repeatable Semantic Queries for the Linked Data Agnostic

Editor's Notes