Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Making social science more reproducible by encapsulating access to linked data
1. ‹#› Het begint met een idee
MAKING SOCIAL SCIENCE MORE REPRODUCIBLE
BY ENCAPSULATING ACCESS TO LINKED DATA
Albert Meroño-Peñuela
Richard Zijdeman
Ashkan Ashkpour
Rinke Hoekstra
ESSHC 2018
2. Vrije Universiteit Amsterdam
VU University Amsterdam – Computer
Science (Knowledge Representation &
Reasoning group)
International Institute of Social
History (IISG), Amsterdam
CLARIAH – National Infrastructure for
Digital Humanities
> DataLegend : Structured Data Hub
DANS: CEDAR – Dutch historical
censuses as 5-star LOD
2
INSTITUTIONAL SLIDE
4. Vrije Universiteit Amsterdam
Reproducibility: proxy for replicability
Key for the scientific method
Currently, we include a small link to “cite datasets” as data
provenance
4
REPRODUCIBILITY & DATA PROVENANCE
5. Vrije Universiteit Amsterdam
Point to an “open dataset”
Two big problems
Combination of multiple datasets
Subset of the original data
Usually these two are resolved with “data munging”
NOT part of the citation
Critical for reproducibility!
5
DATA “CITATIONS”
6. Vrije Universiteit Amsterdam
Solutions achieved through Linked (Open) Data
> Combination of datasets: RDF
> Selecting and transforming subsets of the data: SPARQL
Success shown in a great number of disciplines, including
social history
> See http://linkeddatabook.com/editions/1.0/
6
SEMANTIC WEB SOLUTIONS
9. Vrije Universiteit Amsterdam
Unfortunately there are two problems...
1. Encoding a research question in SPARQL is difficult (as
you’ve seen)
2. Lack of methods and tools to save, maintain and execute
these queries quickly (i.e. without having to write them
again)
9
OKAY, HOW CAN I USE THEM?
10. ‹#› Het begint met een idee
10 Het begint met een idee
http://grlc.io/ and
https://github.com/CLARIAH/grlc
1. Incentivates collaborative writing
of SPARQL queries in GitHub
2. Automatically builds APIs using
those queries, providing
executable URIs (HTTP links)
This means:
External query management
API is organized just as the GitHub
repository
Thin layer – nothing stored server-
side
10 Faculty / department / title presentation
11. ‹#› Het begint met een idee
11 Het begint met een idee
Collaborative writing of research
questions in SPARQL
Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
> Pull requests
Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
11 Faculty / department / title presentation
SPARQL IN GITHUB
12. Vrije Universiteit Amsterdam
12
AUTOMATIC BUILD OF APIS
• 1 research
question = 1
SPARQL query = 1
URI (HTTP link)
• Actionable
(executable if we
click them)
• JSON specification,
Swagger-UI for
human readability
13. Vrije Universiteit Amsterdam
13
… AND THE ACTIONABLE LINKS?
Assuming your queries are at
https://github.com/:owner/:repo
> http://grlc.io/api/:owner/:repo/spec returns the JSON swagger
spec
> http://grlc.io/api/:owner/:repo/ returns the swagger UI
> http://grlc.io/api/:owner/:repo/:operation?p_1=v_1...p_n=v_n
calls operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it
> Supports versioning through http://grlc.io/api/:owner/:repo/commit/:sha
16. Vrije Universiteit Amsterdam
16
SPARQL + RDFA + DUMPS + #LD
• Compatible with most
Linked Data access
methods
• Loads remote RDFa/dumps
in memory
• Uses TPF for #LD servers
• Mixes all these into one
homogeneous API
17. Vrije Universiteit Amsterdam
17
PROVENANCE
• Two sources: query history,
spec generation
• Uses W3C PROV
• Uses Git2PROV to get
query history
• Adds spec provenance at
generation time
• Visualizations with PROV-
O-Viz
(http://provoviz.org/)
18. Vrije Universiteit Amsterdam
18
ENUMERATIONS & DROPDOWNS
• Fills in the
swag[paths][op][method][
parameters][enum] array
• Uses the triple pattern of
the SPARQL query’s BGP
against the same SPARQL
endpoint
19. Vrije Universiteit Amsterdam
19
CONTENT NEGOTIATION
• API endpoints can now
end with .content_type
(e.g grlc.io/CLARIAH/wp-
queries/MyQuery.csv)
• Supports .csv, .json,
.html (can be extended)
• grlc sets ‘Accept’ HTTP
header and agnostically
returns same ‘Content-
Type’ as the SPARQL
endpoint
• Up to the SPARQL
endpoint to accept it
20. Vrije Universiteit Amsterdam
20
SAME QUESTIONS, DIFFERENT DATASETS
Three ways of separating queries (research questions,
SPARQL) from the data (datasets, endpoints):
• Using a grlc-repository endpoint.txt file (OK, but all
queries asked against same data)
• Using a query-dependent #+ endpoint: name (OK,
but endpoint still depends on query at execution)
• Using a query-dependent HTTP parameter (OK,
profit! )
21. Vrije Universiteit Amsterdam
21
SAME QUESTIONS, DIFFERENT DATASETS
If our SPARQL query (=research question) is at
http://grlc.io/api/user/repo/query
We can ask it to many different endpoints using
http://grlc.io/api/user/repo/query?endpoint=dataset1
http://grlc.io/api/user/repo/query?endpoint=dataset2
etc
22. Vrije Universiteit Amsterdam
1,551 unique visitors since July 2016
3,251 sessions
58.97% return rate
5 active open source contributors, 31 pull requests
Community of users and developers
22
QUALITATIVE EVALUATION
23. Vrije Universiteit Amsterdam
> “multiple copies of the same queries in different places
(…) was problematic. grlc allows queries to be
maintained in a single location”
> “with grlc the R code becomes clearer due to the
decoupling with SPARQL; and shorter, since a curl
suffices to retrieve the data”
> “it allows us to manage SPARQL queries separate from
the rest of the API – this enables, for instance, to have
different queries without having to deploy a new version
of the API”
> “we use grlc to provide FAQ for those who would prefer
REST over SPARQL, but also to explore the data”
> “we use grlc to expose the ECAI conference proceedings
not only as Linked Data that can be used by Semantic
Web practitioners, but also as a Web API that web
developers can consume”
> “grlc helps to share, extend and repurpose queries by
providing a URI for the resulted queries and by
supporting collaborative update of those queries”23
USE CASES
25. Vrije Universiteit Amsterdam
The need of data links in papers for reproducibility
Dataset citations not enough
> Combine datasets
> Reduce and transform datasets
We use Linked Open Data and Semantic Web technology in grlc
> Motivates collaborative writing of research questions in SPARQL
> Enables the maintenance and creation of ACTIONABLE DATA LINKS to
combine and transform datasets
> Allows to separate queries and data, and ask the same questions to
different datasets
Success in multiple domains, including Social Science & History
Open source
http://grlc.io
https://github.com/CLARIAH/grlc25
CONCLUSIONS
26. ‹#› Het begint met een idee
THANK YOU!
DATALEGEND.NET
CLARIAH.NL
26
http://grlc.io/
27. Vrije Universiteit Amsterdam
27
PAGINATION
• Large query results are
typically nasty to consuming
applications
• Split the result in multiple
parts (or “pages”)
• Size? #+ pagination: 100
• Navigating pages
• rel=next,prev,first,last links
in the HTTP headers (GitHub
API Traversal convention)
• Extra request parameter
?page (defaults to 1)
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-project/Queries/houseType_all
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18447
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
~ curl -X GET -H"Accept: text/csv" -I
http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=3
HTTP/1.0 200 OK
Content-Type: text/csv; charset=UTF-8
Content-Length: 18142
Server: grlc/1.0.0
Link: <http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=4>; rel=next,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=2>; rel=prev,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=1>; rel=first,
<http://localhost:8088/api/CEDAR-
project/Queries/houseType_all?page=889>; rel=last
28. Vrije Universiteit Amsterdam
28
DOCKER CONTAINER
• Uses docker
• Infrastructure-independent
install
• Bundles (composes) all required
packages (python, python libs,
grlc, nginx). Can be easily
extended to more
• Publicly available at
hub.docker.com
• One-command server deploy:
docker pull clariah/grlc
“A way of studying the connections of the past through connections in the present…”
Talk about how all these social science history related projects brought interesting new ideas to Semantic Web research
SSH can benefit from these results too from the methodological point of view
Links like in the ones we use on the Web, HTTP resources (like web pages)
If the Web is really about variety, both methods should be allowed, empowered and freely exchangeable…
Great for attribution and track of use
But to what extent does it enable replication?
Not enough: whole tracks in conferences with replication studies and negative results
Why does this happen?
. Instructions on how to merge them? How to clean them? Outliers? Suited for the purpose of the study?
Critical in Social History -> combination of datasets is mandatory for interdisciplinary studies (demography, economics, history of work, etc.)
“Projections” or “aggregations” or other kinds of queries and transformations that we do before analysis