Research institutions, governments and sometimes even the industry are promoting a way to publish data that conforms to principles of openness such as being Findable, Accessible, Interoperable and Reusable.
These principles can be adhered to in a multitude of ways: Linked Open Data is one of them; it is favoured by scientific communities, but its adoption is not limited to research contexts. In this talk I will provide an account of how my research projects enjoyed the benefits of being on either side of the FAIR data supply chain.
1. FAIR Linked Data
Publishing them, using them, and
why it doesn’t take a giant to do it.
Alessandro Adamou
Open Scholarship Week 2021
2. WHOAMI
Digital Humanities Scientist at
Bibliotheca Hertziana - Max Planck Institute for Art History
prior: Research Fellow at Data Science Institute, NUI Galway
Computer scientist background, eventually chose DH as
application domain
Research projects in multiple domains:
- Education (academic and informal), music history,
eGov, smart cities, literature, industry 4.0
My roles in those projects involved:
- Creating data myself
- Cataloguing/integrating data by others
3. MSc degree studies (until 2007):
Was initially introduced to structured data
- Storing in a relational DBMS (MySQL, MSAccess) or
native XML database (basex, eXistDB)
- Publishing: XML and not much more than that
- No formation on how to make good data schemas,
other than RDBMS optimisation
Otherwise, was perfectly happy with calling “data” the
tables in an HTML page, an Excel sheet, or even text in a
PDF or Word document.
First contact with a Data Paradigm
4. Rationales for hierarchical data
● Both are built according to somewhat sensible
rationales
● Neither way is standard or conformant to a shared
logical paradigm.
● Just good enough for a thesaurus
5. Masters and Doctoral theses (2008-12):
Machine-readable data (mostly structured data with
annotations)
- Learned about triple stores but didn’t use them (more
on that later)
- Publishing: RDF format and the many ways to
serialise it: XML, JSON, Turtle
- The world of “good data schemas” opens to me:
ontologies and semantics
Evolution of the Data Paradigm
6. No longer just hierarchies!
● Hierarchy is only taxonomical: what is
of one type is also of the type above
● I can make up as many types of relation
as I want
Galway
Connacht
Ireland
locatedIn
locatedIn
City
Region
Country
Place
a
a
a
Settlement
(locatedIn)
(locatedIn)
ONTOLOGY
7. D
a
t
a
L
i
n
k
e
d
“Like the web of hypertext, the
web of data is constructed with
documents on the web. However,
unlike the web of hypertext,
where links are relationships
anchors in hypertext documents
written in HTML, for data they are
links between arbitrary things.”
- Tim Berners-Lee, 2006
Berners-Lee, T. “Linked Data: Design Issues” (2006).
https://www.w3.org/DesignIssues/LinkedData.html
2006
8. Linked Data
● Not a format, but:
○ a set of principles and recommendations,
○ for all of which, Web standards and technologies exist
Berners-Lee, T. “Linked Data: Design Issues” (2006). https://www.w3.org/DesignIssues/LinkedData.html
1. Have a system of identifiers as names for
all things in your data (people, artworks,
time periods, books, events…)
2. Make it possible to look up those names
3. When looked up, the information
returned is formatted using standards.
4. This information contains links to other
things, using the same system of identifiers.
URI
HTTP URI
RDF, SPARQL
External URIs
Some custom code (for convention);
or: Linked Data API implementations
Web/application servers (Apache,
Jetty, Nginx...)
Triple stores / Graph databases with
SPARQL servers; Client programming
libraries
External LD services (e.g. Wikidata)
PRINCIPLE STANDARD TECHNOLOGY
9. Post-doctoral work (2013-):
Linked data
- Shared conventions encompass not only the data
schemas (ontologies), but also the data elements!
- Storage: Graph Databases - triple stores and more
(Virtuoso, Jena, GraphDB, Neo4J...)
- Publishing: RDF format + public query endpoint
(SPARQL language)
Evolution of the Data Paradigm
10. Many ways to publish Linked Data
Embedded inside HTML pages.
Serialise to XML, JSON or
plaintext and provide a
downloadable data dump.
Publish a Web service for users
to query with the SPARQL
language.
Make a URI point to RDF
snippets that describe that
thing in your data.
12. The Open Data stars
Open and on the Web
“Make your stuff available on the Web (whatever format) under an open license.”
Machine-readable
“Make it available as structured data (e.g., Excel instead of image scan of a table).”
Open format
“Make it available in a non-proprietary open format (e.g. CSV instead of Excel).”
URIs for everything
“Use URIs to denote things, so that people can point at your stuff.”
Linking
“Link your data to other data to provide context.” → Linked Open Data
★★★★★
A 5-Star deployment scheme for your data.
★★★★
★★★
★★
★
Source: “5 ★ Open Data”. https://5stardata.info/
14. ● Linked (Open) Data are a set principles born from the questions arising from
the scientific community.
○ Several of its arising technologies born here in NUI Galway 😊
● They are standardised at a scholarly/technical level (W3C)...
● but not at a political level (European Union, G20)
● Legitimation of LOD by policy makers had not been sought, but a reference
framework to evaluate LOD against was still missing...
The scholarly-political gap
15. 2016
“What constitutes ‘good data
management’ is largely undefined,
and is generally left as a decision
for the data or repository owner.
Therefore, bringing some clarity
around the goals and desiderata of
good data management and
stewardship [...] would be of great
utility.”
- Mark D. Wilkinson et al, 2016
R
e
u
s
a
b
l
e
I
n
t
e
r
o
p
e
r
a
b
l
e
A
c
c
e
s
s
i
b
l
e
F
i
nd
a
b
l
e
16. FAIR Principles
The first set of guidelines to have gained support by policy-makers.
★ Findable
Data are assigned a globally unique and persistent identifier and indexed in
a searchable source.
★ Accessible
Standard and open communications protocol; metadata always available.
★ Interoperable
Shared knowledge representation language. Data schemas are also FAIR.
★ Reusable
Data are richly described, with a clear license and provenance information.
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and
stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
17. X X
X X
X X
X X X
X X
Linked
Standards
Non-proprietary
Machine-readable
Online - Open
Findable Accessible Interoperable Reusable
5★ data as a FAIR implementation
18. What you need to publish 5★ data
- Manage a Web domain/host to publish HTTP URIs with
- e.g. bnb.data.bl.uk , wikidata.org …
- Storage: a Triple store / Graph database with SPARQL
Data stores
Many triple stores and
graph databases come
in an open source
“community edition” with
limited performance
capabilities, and an
“uncapped” proprietary
edition.
- Many open source solutions (e.g. Jena,
Virtuoso, GraphDB, Blazegraph)
- CPU power to handle user queries
- Varies, but affordable hosted solutions
- Programming skills: only some Web
development (HTML+JS+CSS, Web apps)
- Or, CMS and Linked Data frameworks
19. What you need to use 5★ data
- Human consumption: a Web browser is enough
- Especially if the linked data are embedded in pages
Did you know?
Search engines like
Google use Linked Data
to display “rich”
interactive search results
entries and the info
boxes that often appear
to the right of search
results.
- Developers:
- curl program alone can do wonders!
- Client libraries available for many
languages (Java, Python, NodeJS...)
- But even without them, HTTP client
libraries will do the job just fine!
- Core resource is network traffic
20. An emerging new paradigm
Scholars in cultural heritage and digital humanities
are among the most vocal in lamenting that LOD
principles do not address every common issue.
In particular, while understanding the principles may
be easy, dealing with FAIR LOD may not be!
Conforming data to scholarly recommendation does
not necessarily entail visibility and uptake!
22. U
s
a
b
l
e
D
a
t
a
O
p
e
n
L
i
n
k
e
d
2018
“If our data isn’t used, then no
value is gained from the
resources that were invested in
its creation, publication,
maintenance and improvement.
If we want our data to be used,
then it needs to be usable.”
- Robert Sanderson, 2018
23. The LOUD stars
Not quite (yet) as formalised as open data stars.
★ The right abstraction for the audience
“Some use cases and requirements should drive
the interoperability layer between systems.”
★ Few barriers to entry
“If it takes time to understand the model, [...]
query syntax and so forth, then developers will
look for easier targets.”
24. The LOUD stars
★ Comprehensible by introspection
“Data should be understandable by looking at it,
rather than requiring the developer to read the
ontology and vocabularies.”
★ Documentation with working examples
“Documentation clarifies the patterns that the
developer can expect to encounter, such that they
can implement robustly.”
★ Few exceptions, many consistent patterns
“While not everything is homogenous, a set of
patterns that manage exceptions well is better than
many custom fields.”
25. The Listening Experience Database,
https:/
/led.kmi.open.ac.uk
➢ A music history catalogue
➢ Developed since 2013
- (i.e. LOD but before FAIR)
➢ Linked Data features
- Data license is CC BY-NC SA
- Data documentation page
- External query endpoint
- Links to MusicBrainz and more
- Embedded data in pages
➢ How many Open Data stars?
➢ How FAIR? How LOUD?
A born-linked Humanities data project
26. Just like Napalm Death music:
To be FAIR, it must be LOUD!
Image
Sven Mandel /
CC-BY-SA-4.0