Making social science more reproducible by encapsulating access to linked data

‹#› Het begint met een idee
MAKING SOCIAL SCIENCE MORE REPRODUCIBLE
BY ENCAPSULATING ACCESS TO LINKED DATA
Albert Meroño-Peñuela
Richard Zijdeman
Ashkan Ashkpour
Rinke Hoekstra
ESSHC 2018

Vrije Universiteit Amsterdam
 VU University Amsterdam – Computer
Science (Knowledge Representation &
Reasoning group)
 International Institute of Social
History (IISG), Amsterdam
 CLARIAH – National Infrastructure for
Digital Humanities
> DataLegend : Structured Data Hub
 DANS: CEDAR – Dutch historical
censuses as 5-star LOD
2
INSTITUTIONAL SLIDE

Links > Queries
3
QUERIES AND LINKS

 Reproducibility: proxy for replicability
 Key for the scientific method
 Currently, we include a small link to “cite datasets” as data
provenance
4
REPRODUCIBILITY & DATA PROVENANCE

 Point to an “open dataset”
 Two big problems
Combination of multiple datasets
Subset of the original data
 Usually these two are resolved with “data munging”
 NOT part of the citation
 Critical for reproducibility!
5
DATA “CITATIONS”

 Solutions achieved through Linked (Open) Data
> Combination of datasets: RDF
> Selecting and transforming subsets of the data: SPARQL
 Success shown in a great number of disciplines, including
social history
> See http://linkeddatabook.com/editions/1.0/
6
SEMANTIC WEB SOLUTIONS

7
RDF: COMBINING DATASETS
<https://www.w3.org/People/Berners-Lee/>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person>

 Implement the “research question” of the study
8
SPARQL: SELECTING SUBSETS OF THE DATA

Unfortunately there are two problems...
1. Encoding a research question in SPARQL is difficult (as
you’ve seen)
2. Lack of methods and tools to save, maintain and execute
these queries quickly (i.e. without having to write them
again)
9
OKAY, HOW CAN I USE THEM?

10 Het begint met een idee
 http://grlc.io/ and
https://github.com/CLARIAH/grlc
1. Incentivates collaborative writing
of SPARQL queries in GitHub
2. Automatically builds APIs using
those queries, providing
executable URIs (HTTP links)
This means:
 External query management
 API is organized just as the GitHub
repository
 Thin layer – nothing stored server-
side
10 Faculty / department / title presentation

11 Het begint met een idee
 Collaborative writing of research
questions in SPARQL
 Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
> Pull requests
 Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
11 Faculty / department / title presentation
SPARQL IN GITHUB

12
AUTOMATIC BUILD OF APIS
• 1 research
question = 1
SPARQL query = 1
URI (HTTP link)
• Actionable
(executable if we
click them)
• JSON specification,
Swagger-UI for
human readability

13
… AND THE ACTIONABLE LINKS?
 Assuming your queries are at
https://github.com/:owner/:repo
> http://grlc.io/api/:owner/:repo/spec returns the JSON swagger
spec
> http://grlc.io/api/:owner/:repo/ returns the swagger UI
> http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n
calls operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
 Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it
> Supports versioning through http://grlc.io/api/:owner/:repo/commit/:sha

14
GRLC’S ARCHITECTURE

15
SPARQL DECORATOR SYNTAX

16
SPARQL + RDFA + DUMPS + #LD
• Compatible with most
Linked Data access
methods
• Loads remote RDFa/dumps
in memory
• Uses TPF for #LD servers
• Mixes all these into one
homogeneous API

17
PROVENANCE
• Two sources: query history,
spec generation
• Uses W3C PROV
• Uses Git2PROV to get
query history
• Adds spec provenance at
generation time
• Visualizations with PROV-
O-Viz
(http://provoviz.org/)

18
ENUMERATIONS & DROPDOWNS
• Fills in the
swag[paths][op][method][
parameters][enum] array
• Uses the triple pattern of
the SPARQL query’s BGP
against the same SPARQL
endpoint

19
CONTENT NEGOTIATION
• API endpoints can now
end with .content_type
(e.g grlc.io/CLARIAH/wp-
queries/MyQuery.csv)
• Supports .csv, .json,
.html (can be extended)
• grlc sets ‘Accept’ HTTP
header and agnostically
returns same ‘Content-
Type’ as the SPARQL
endpoint
• Up to the SPARQL
endpoint to accept it

20
SAME QUESTIONS, DIFFERENT DATASETS
Three ways of separating queries (research questions,
SPARQL) from the data (datasets, endpoints):
• Using a grlc-repository endpoint.txt file (OK, but all
queries asked against same data)
• Using a query-dependent #+ endpoint: name (OK,
but endpoint still depends on query at execution)
• Using a query-dependent HTTP parameter (OK,
profit!  )

21
SAME QUESTIONS, DIFFERENT DATASETS
If our SPARQL query (=research question) is at
http://grlc.io/api/user/repo/query
We can ask it to many different endpoints using
http://grlc.io/api/user/repo/query?endpoint=dataset1
http://grlc.io/api/user/repo/query?endpoint=dataset2
etc

 1,551 unique visitors since July 2016
 3,251 sessions
 58.97% return rate
 5 active open source contributors, 31 pull requests
 Community of users and developers
22
QUALITATIVE EVALUATION

> “multiple copies of the same queries in different places
(…) was problematic. grlc allows queries to be
maintained in a single location”
> “with grlc the R code becomes clearer due to the
decoupling with SPARQL; and shorter, since a curl
suffices to retrieve the data”
> “it allows us to manage SPARQL queries separate from
the rest of the API – this enables, for instance, to have
different queries without having to deploy a new version
of the API”
> “we use grlc to provide FAQ for those who would prefer
REST over SPARQL, but also to explore the data”
> “we use grlc to expose the ECAI conference proceedings
not only as Linked Data that can be used by Semantic
Web practitioners, but also as a Web API that web
developers can consume”
> “grlc helps to share, extend and repurpose queries by
providing a URI for the resulted queries and by
supporting collaborative update of those queries”23
USE CASES

24
QUANTITATIVE EVALUATION
The cost of grlc is independent of the dataset size
HTTP requests and payloads are important costs

 The need of data links in papers for reproducibility
 Dataset citations not enough
> Combine datasets
> Reduce and transform datasets
 We use Linked Open Data and Semantic Web technology in grlc
> Motivates collaborative writing of research questions in SPARQL
> Enables the maintenance and creation of ACTIONABLE DATA LINKS to
combine and transform datasets
> Allows to separate queries and data, and ask the same questions to
different datasets
 Success in multiple domains, including Social Science & History
 Open source
http://grlc.io
https://github.com/CLARIAH/grlc25
CONCLUSIONS

THANK YOU!
DATALEGEND.NET
CLARIAH.NL
26
http://grlc.io/

27
PAGINATION
• Large query results are
typically nasty to consuming
applications
• Split the result in multiple
parts (or “pages”)
• Size? #+ pagination: 100
• Navigating pages
• rel=next,prev,first,last links
in the HTTP headers (GitHub
API Traversal convention)
• Extra request parameter
?page (defaults to 1)
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-project/Queries/houseType_all
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18447
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=3
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18142
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=4>; rel=next,
project/Queries/houseType_all?page=2>; rel=prev,
project/Queries/houseType_all?page=1>; rel=first,
project/Queries/houseType_all?page=889>; rel=last

28
DOCKER CONTAINER
• Uses docker
• Infrastructure-independent
install
• Bundles (composes) all required
packages (python, python libs,
grlc, nginx). Can be easily
extended to more
• Publicly available at
hub.docker.com
• One-command server deploy:
docker pull clariah/grlc

29
QUANTIATIVE EVALUATION

30
QUANTIATIVE EVALUATION

Making social science more reproducible by encapsulating access to linked data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making social science more reproducible by encapsulating access to linked data

Similar to Making social science more reproducible by encapsulating access to linked data (20)

More from Albert Meroño-Peñuela

More from Albert Meroño-Peñuela (16)

Recently uploaded

Recently uploaded (20)

Making social science more reproducible by encapsulating access to linked data

Editor's Notes