‹#› Het begint met een idee
REPEATABLE SEMANTIC QUERIES
FOR THE LINKED DATA AGNOSTIC
Albert Meroño-Peñuela
Rinke Hoekstra
Richard Zijdeman
Auke Rijpma
Ashkan Ashkpour
Carlos Martínez
… and many others
@albertmeronyo
Linked Data, Research, Culture:
Distributed Publishing, Searching, and
Archiving
01-05-2017
Vrije Universiteit Amsterdam
 VU University Amsterdam – Computer
Science (Knowledge Representation &
Reasoning group)
 International Institute of Social
History (IISG), Amsterdam
 CLARIAH – National Infrastructure for
Digital Humanities
> DataLegend : Structured Data Hub
 DANS: CEDAR – Dutch historical
censuses as 5-star LOD
2
INSTITUTIONAL SLIDE
Vrije Universiteit Amsterdam
Links > Queries
3
QUERIES AND LINKS
‹#› Het begint met een idee
4 Het begint met een idee
 Publishing Dutch historical
censuses as 5-star LD
> Intensive use of RDF Data Cube
> Harmonization rules
> Provenance
> http://lod.cedar-project.nl/
 1st historical census data as Linked
Data (1795-1971)
 8 million observations (sex, marital
status, occupation position, housing type)
 External links
> Geographical: 2.7M
> Occupations: 350K
https://collab.iisg.nl/web/hisco
> Religion: 250K http://licr.io/
 High value for social historians
4 Faculty / department / title presentation
CEDAR-PROJECT.NL
Vrije Universiteit Amsterdam
 CLARIAH-WP4: Structured
data hub for social history
 IPUMS, NAPP, CEDAR, etc
> Macro-, micro-, meso-data
> Civil registries, occupation, religion,
country-level economic indicators
> National (Netherlands) and
international
 Mostly CSV tables turned
into RDF Data Cube and
CSVW
 More than 1B triples
 Higher variety of humanities
scholars  higher variety of
data access requirements
5
CLARIAH-WP4: DATALEGEND.NET
Exi sts
Frequency Table
Variable does not yet existVariables
Mappings
Publish
Augment
Includes both external LinkedDataand
standard vocabularies, e.g. World Bank
External (Meta)Data
Existing Variables
& Codes
Provenance tracking of a
External Datasets
StructuredDataHub
Vrije Universiteit Amsterdam
LINKED DATA
ACCESS
‹#› Het begint met een idee7
Vrije Universiteit Amsterdam
Linked Data is great for information integration on the Web,
but:
 Hard technological requirements: RDF, SPARQL
 Heterogeneity of access methods: SPARQL, #LD, dumps,
HTML/RDFa, LDA
 Web developers want APIs and JSON
 Breaking the law endpoint
 Queries are second-class Web citizens (i.e. volatile)
> Lost after execution (reusability?)
> Multiple out-of-sync instances if shared among applications (reliability?)
How to make semantic queries repeatable for the Linked
Data agnostic?
8
LINKED DATA ACCESS
Vrije Universiteit Amsterdam
 OpenPHACTS: RESTful
entry point to Linked
Data hubs for Web
applications
 Query = Service = URI
9
LINKED DATA APIS
However:
• The API (e.g. Swagger spec, code itself) still needs to be
coded and maintained
• Exclusion of SPARQL <- query reuse?
Vrije Universiteit Amsterdam
 Automatically builds Swagger
specs and API code
 Takes SPARQL queries as input
(1 API operation = 1 SPARQL
query)
> API call functionality limited to SPARQL
expressivity
 Makes SPARQL queries uniquely
referenceable by using their
equivalent LDA operation
> Stores SPARQL internally
> But we already have uniquely
referenceable SPARQL…
10
BASIL
Vrije Universiteit Amsterdam
 Writing SPARQL by trial and error <- Versioning support
 Variety of access interfaces needed
11
CENSUS DATA QUERYING INTERFACES
‹#› Het begint met een idee
‹#› Het begint met een idee
13 Het begint met een idee
 One .rq file for SPARQL query
 Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
 Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
13 Faculty / department / title presentation
GITHUB AS A HUB OF
SPARQL QUERIES
‹#› Het begint met een idee14
Query centralization helps maintaining
distributed applications
‹#› Het begint met een idee
15 Het begint met een idee
 Same basic principle of BASIL: 1
SPARQL query = 1 API operation
 Automatically builds Swagger spec
and UI from SPARQL
 Answers API calls
But:
 External query management
 Organization of SPARQL queries in
the GitHub repo matches
organization of the API
 Thin layer – nothing stored server-
side
 Maps
> GitHub API
> Swagger spec
15 Faculty / department / title presentation
Vrije Universiteit Amsterdam
16
MAPPING GITHUB AND SWAGGER
Vrije Universiteit Amsterdam
17
SPARQL DECORATOR SYNTAX
Vrije Universiteit Amsterdam
18
THE GRLC SERVICE
 Assuming your repo is at https://github.com/:owner/:repo
> http://grlc.io/api/:owner/:repo/spec returns the JSON swagger
spec
> http://grlc.io/api/:owner/:repo/ returns the swagger UI
> http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n
calls operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
 Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it
Vrije Universiteit Amsterdam
19
SPICED-UP SWAGGER UI
Vrije Universiteit Amsterdam
20
SPARQL + RDFA + DUMPS + #LD
• Compatible with most
Linked Data access
methods
• Loads remote RDFa/dumps
in memory
• Uses TPF for #LD servers
• Mixes all these into one
homogeneous API
Vrije Universiteit Amsterdam
21
PROVENANCE
• Two sources: query history,
spec generation
• Uses W3C PROV
• Uses Git2PROV to get
query history
• Adds spec provenance at
generation time
• Visualizations with PROV-
O-Viz
(http://provoviz.org/)
Vrije Universiteit Amsterdam
22
ENUMERATIONS & DROPDOWNS
• Fills in the
swag[paths][op][method][
parameters][enum] array
• Uses the triple pattern of
the SPARQL query’s BGP
against the same SPARQL
endpoint
Vrije Universiteit Amsterdam
23
CONTENT NEGOTIATION
• API endpoints can now
end with .content_type
(e.g grlc.io/CLARIAH/wp-
queries/MyQuery.csv)
• Supports .csv, .json,
.html (can be extended)
• grlc sets ‘Accept’ HTTP
header and agnostically
returns same ‘Content-
Type’ as the SPARQL
endpoint
• Up to the SPARQL
endpoint to accept it
Vrije Universiteit Amsterdam
24
PAGINATION
• Large query results are
typically nasty to consuming
applications
• Split the result in multiple
parts (or “pages”)
• Size? #+ pagination: 100
• Navigating pages
• rel=next,prev,first,last links
in the HTTP headers (GitHub
API Traversal convention)
• Extra request parameter
?page (defaults to 1)
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-project/Queries/houseType_all
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18447
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=3
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18142
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=4>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=prev,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=1>; rel=first,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
Vrije Universiteit Amsterdam
25
CACHE
• Moved implementation
outside of grlc (not its
direct responsibility)
• grlc sets HTTP header
Cache-Control to public,
max-age=900 (15 minutes,
customizable)
• nginx caches all grlc
generated JSON (and
other static/dynamic
assets)
• nginx becomes part of the
bundle
Vrije Universiteit Amsterdam
26
DOCKER CONTAINER
• Uses docker
• Infrastructure-independent
install
• Bundles (composes) all required
packages (python, python libs,
grlc, nginx). Can be easily
extended to more
• Publicly available at
hub.docker.com
• One-command server deploy:
docker pull clariah/grlc
Vrije Universiteit Amsterdam
27
SPARQL2GIT
 GUI for maintaining SPARQL
query repositories in GitHub
 Easy editing grlc parameters to
use your SPARQL as an API
 Update query contents
transparently – no git knowledge
required!
 Gets your SPARQL in GitHub +
your Linked Data APIs at once
 http://sparql2git.com/
Vrije Universiteit Amsterdam
28
EVALUATION – USE CASES
 CEDAR: Access to census data for
historians
> Hides SPARQL
> Allows them to fill query parameters
through forms
> Co-existence of SPARQL and non-SPARQL
clients
 CLARIAH - Born Under a Bad Sign:
Do prenatal and early-life
conditions have an impact on
socioeconomic and health
outcomes later in life? (uses 1891
Canada and Sweden Linked Census Data)
> Reduction of coupling between SPARQL
libs and R
> Shorter R code – input stream as CSV
Vrije Universiteit Amsterdam
The spectrum of Linked Data clients: SPARQL intensive applications
vs RESTful API applications
grlc uses decoupling of SPARQL from all client applications
(including LDA) as a powerful practice
 Separates query curation workflows from everything else
 Allows at the same time
> Web-friendly SPARQL queries
> Web-friendly RESTful APIs
 Helps you to easily organise your LDA – just organise your SPARQL
repository and you’re set
 Try it out!
> http://grlc.io/
> https://github.com/CLARIAH/grlc
29
CONCLUSIONS
Vrije Universiteit Amsterdam
1. What's your current practice when you code (semantic)
queries into applications? Where do you store them?
How do you reuse them (if at all)?
2. How much effort does it cost to you to maintain your
APIs up to date? What does this maintenance consist of
exactly?
30
QUESTIONS
‹#› Het begint met een idee
THANK YOU!
@ALBERTMERONYO
DATALEGEND.NET
CLARIAH.NL
31

Repeatable Semantic Queries for the Linked Data Agnostic

  • 1.
    ‹#› Het begintmet een idee REPEATABLE SEMANTIC QUERIES FOR THE LINKED DATA AGNOSTIC Albert Meroño-Peñuela Rinke Hoekstra Richard Zijdeman Auke Rijpma Ashkan Ashkpour Carlos Martínez … and many others @albertmeronyo Linked Data, Research, Culture: Distributed Publishing, Searching, and Archiving 01-05-2017
  • 2.
    Vrije Universiteit Amsterdam VU University Amsterdam – Computer Science (Knowledge Representation & Reasoning group)  International Institute of Social History (IISG), Amsterdam  CLARIAH – National Infrastructure for Digital Humanities > DataLegend : Structured Data Hub  DANS: CEDAR – Dutch historical censuses as 5-star LOD 2 INSTITUTIONAL SLIDE
  • 3.
    Vrije Universiteit Amsterdam Links> Queries 3 QUERIES AND LINKS
  • 4.
    ‹#› Het begintmet een idee 4 Het begint met een idee  Publishing Dutch historical censuses as 5-star LD > Intensive use of RDF Data Cube > Harmonization rules > Provenance > http://lod.cedar-project.nl/  1st historical census data as Linked Data (1795-1971)  8 million observations (sex, marital status, occupation position, housing type)  External links > Geographical: 2.7M > Occupations: 350K https://collab.iisg.nl/web/hisco > Religion: 250K http://licr.io/  High value for social historians 4 Faculty / department / title presentation CEDAR-PROJECT.NL
  • 5.
    Vrije Universiteit Amsterdam CLARIAH-WP4: Structured data hub for social history  IPUMS, NAPP, CEDAR, etc > Macro-, micro-, meso-data > Civil registries, occupation, religion, country-level economic indicators > National (Netherlands) and international  Mostly CSV tables turned into RDF Data Cube and CSVW  More than 1B triples  Higher variety of humanities scholars  higher variety of data access requirements 5 CLARIAH-WP4: DATALEGEND.NET Exi sts Frequency Table Variable does not yet existVariables Mappings Publish Augment Includes both external LinkedDataand standard vocabularies, e.g. World Bank External (Meta)Data Existing Variables & Codes Provenance tracking of a External Datasets StructuredDataHub
  • 6.
  • 7.
    ‹#› Het begintmet een idee7
  • 8.
    Vrije Universiteit Amsterdam LinkedData is great for information integration on the Web, but:  Hard technological requirements: RDF, SPARQL  Heterogeneity of access methods: SPARQL, #LD, dumps, HTML/RDFa, LDA  Web developers want APIs and JSON  Breaking the law endpoint  Queries are second-class Web citizens (i.e. volatile) > Lost after execution (reusability?) > Multiple out-of-sync instances if shared among applications (reliability?) How to make semantic queries repeatable for the Linked Data agnostic? 8 LINKED DATA ACCESS
  • 9.
    Vrije Universiteit Amsterdam OpenPHACTS: RESTful entry point to Linked Data hubs for Web applications  Query = Service = URI 9 LINKED DATA APIS However: • The API (e.g. Swagger spec, code itself) still needs to be coded and maintained • Exclusion of SPARQL <- query reuse?
  • 10.
    Vrije Universiteit Amsterdam Automatically builds Swagger specs and API code  Takes SPARQL queries as input (1 API operation = 1 SPARQL query) > API call functionality limited to SPARQL expressivity  Makes SPARQL queries uniquely referenceable by using their equivalent LDA operation > Stores SPARQL internally > But we already have uniquely referenceable SPARQL… 10 BASIL
  • 11.
    Vrije Universiteit Amsterdam Writing SPARQL by trial and error <- Versioning support  Variety of access interfaces needed 11 CENSUS DATA QUERYING INTERFACES
  • 12.
    ‹#› Het begintmet een idee
  • 13.
    ‹#› Het begintmet een idee 13 Het begint met een idee  One .rq file for SPARQL query  Good support of query curation processes > Versioning > Branching > Clone-pull-push  Web-friendly features! > One URI per query > Uniquely identifiable > De-referenceable (raw.githubusercontent.com) 13 Faculty / department / title presentation GITHUB AS A HUB OF SPARQL QUERIES
  • 14.
    ‹#› Het begintmet een idee14 Query centralization helps maintaining distributed applications
  • 15.
    ‹#› Het begintmet een idee 15 Het begint met een idee  Same basic principle of BASIL: 1 SPARQL query = 1 API operation  Automatically builds Swagger spec and UI from SPARQL  Answers API calls But:  External query management  Organization of SPARQL queries in the GitHub repo matches organization of the API  Thin layer – nothing stored server- side  Maps > GitHub API > Swagger spec 15 Faculty / department / title presentation
  • 16.
  • 17.
  • 18.
    Vrije Universiteit Amsterdam 18 THEGRLC SERVICE  Assuming your repo is at https://github.com/:owner/:repo > http://grlc.io/api/:owner/:repo/spec returns the JSON swagger spec > http://grlc.io/api/:owner/:repo/ returns the swagger UI > http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n calls operation with specifiec parameter values > Uses BASIL’s SPARQL variable name convention for query parameters  Sends requests to > https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their decorators > https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference queries, get the SPARQL, and parse it
  • 19.
  • 20.
    Vrije Universiteit Amsterdam 20 SPARQL+ RDFA + DUMPS + #LD • Compatible with most Linked Data access methods • Loads remote RDFa/dumps in memory • Uses TPF for #LD servers • Mixes all these into one homogeneous API
  • 21.
    Vrije Universiteit Amsterdam 21 PROVENANCE •Two sources: query history, spec generation • Uses W3C PROV • Uses Git2PROV to get query history • Adds spec provenance at generation time • Visualizations with PROV- O-Viz (http://provoviz.org/)
  • 22.
    Vrije Universiteit Amsterdam 22 ENUMERATIONS& DROPDOWNS • Fills in the swag[paths][op][method][ parameters][enum] array • Uses the triple pattern of the SPARQL query’s BGP against the same SPARQL endpoint
  • 23.
    Vrije Universiteit Amsterdam 23 CONTENTNEGOTIATION • API endpoints can now end with .content_type (e.g grlc.io/CLARIAH/wp- queries/MyQuery.csv) • Supports .csv, .json, .html (can be extended) • grlc sets ‘Accept’ HTTP header and agnostically returns same ‘Content- Type’ as the SPARQL endpoint • Up to the SPARQL endpoint to accept it
  • 24.
    Vrije Universiteit Amsterdam 24 PAGINATION •Large query results are typically nasty to consuming applications • Split the result in multiple parts (or “pages”) • Size? #+ pagination: 100 • Navigating pages • rel=next,prev,first,last links in the HTTP headers (GitHub API Traversal convention) • Extra request parameter ?page (defaults to 1) ~ curl -X GET -H"Accept: text/csv" -I http://localhost:8088/api/CEDAR-project/Queries/houseType_all HTTP/1.0 200 OK Content-Type: text/csv; charset=UTF-8 Content-Length: 18447 Server: grlc/1.0.0 Link: <http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=2>; rel=next, <http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=889>; rel=last ~ curl -X GET -H"Accept: text/csv" -I http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=3 HTTP/1.0 200 OK Content-Type: text/csv; charset=UTF-8 Content-Length: 18142 Server: grlc/1.0.0 Link: <http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=4>; rel=next, <http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=2>; rel=prev, <http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=1>; rel=first, <http://localhost:8088/api/CEDAR- project/Queries/houseType_all?page=889>; rel=last
  • 25.
    Vrije Universiteit Amsterdam 25 CACHE •Moved implementation outside of grlc (not its direct responsibility) • grlc sets HTTP header Cache-Control to public, max-age=900 (15 minutes, customizable) • nginx caches all grlc generated JSON (and other static/dynamic assets) • nginx becomes part of the bundle
  • 26.
    Vrije Universiteit Amsterdam 26 DOCKERCONTAINER • Uses docker • Infrastructure-independent install • Bundles (composes) all required packages (python, python libs, grlc, nginx). Can be easily extended to more • Publicly available at hub.docker.com • One-command server deploy: docker pull clariah/grlc
  • 27.
    Vrije Universiteit Amsterdam 27 SPARQL2GIT GUI for maintaining SPARQL query repositories in GitHub  Easy editing grlc parameters to use your SPARQL as an API  Update query contents transparently – no git knowledge required!  Gets your SPARQL in GitHub + your Linked Data APIs at once  http://sparql2git.com/
  • 28.
    Vrije Universiteit Amsterdam 28 EVALUATION– USE CASES  CEDAR: Access to census data for historians > Hides SPARQL > Allows them to fill query parameters through forms > Co-existence of SPARQL and non-SPARQL clients  CLARIAH - Born Under a Bad Sign: Do prenatal and early-life conditions have an impact on socioeconomic and health outcomes later in life? (uses 1891 Canada and Sweden Linked Census Data) > Reduction of coupling between SPARQL libs and R > Shorter R code – input stream as CSV
  • 29.
    Vrije Universiteit Amsterdam Thespectrum of Linked Data clients: SPARQL intensive applications vs RESTful API applications grlc uses decoupling of SPARQL from all client applications (including LDA) as a powerful practice  Separates query curation workflows from everything else  Allows at the same time > Web-friendly SPARQL queries > Web-friendly RESTful APIs  Helps you to easily organise your LDA – just organise your SPARQL repository and you’re set  Try it out! > http://grlc.io/ > https://github.com/CLARIAH/grlc 29 CONCLUSIONS
  • 30.
    Vrije Universiteit Amsterdam 1.What's your current practice when you code (semantic) queries into applications? Where do you store them? How do you reuse them (if at all)? 2. How much effort does it cost to you to maintain your APIs up to date? What does this maintenance consist of exactly? 30 QUESTIONS
  • 31.
    ‹#› Het begintmet een idee THANK YOU! @ALBERTMERONYO DATALEGEND.NET CLARIAH.NL 31

Editor's Notes

  • #4 If the Web is really about variety, both methods should be allowed, empowered and freely exchangeable…
  • #10 Integration is done, APIs around to access that integrated space
  • #18 MIME, enumerate, method, pagination