The LOD Gateway: Open Source Infrastructure for Linked Data

Open Source Infrastructure for Linked Data
CIDOC-CRM 2023, Mexico City
The LOD Gateway
David Newbury
Assistant Director, Software and User Experience, Getty

Hi! I’m David.
I lead the software and user experience teams at Getty.
Getty is a big museum/research hub in Los Angeles. We do lots of things with
data.
All of the actual work here was done by my fabulously talented team. I just talk.
And we’re not Getty Images. Same rich family, same last name, no connection.
2
Introduction

Getty has been doing Linked Data since 2014,
starting with the Getty Vocabularies.
It’s a collection of concepts, people, and places
deeply relevant to the study of art and
architecture.
3
Getty’s Linked Data: Getty Vocabularies

Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
4
Getty’s Linked Data: Archival Records

Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
… and our museum collection.
5
Getty’s Linked Data: Archival Records

We’ve also built a complex, powerful
infrastructure to support doing this across
our application landscape.
It’s been fun. We’ve learned a lot.
6
Getty’s Linked Data: APIs

Behind the scenes, all of these applications
are powered by a utility called The LOD
Gateway.
We’ve recently open-sourced this tool, and I’d
like to share it with you today.
7
Getty’s Linked Data: The LOD Gateway

This API system was designed to help Getty
manage one of the fundamental complications
that comes with using Linked Data:
Graphs vs. Documents.
8

Let’s take a basic JSON-LD record:
"@context": "https://linked.art/ns/v1/linked-art.json",
"id": "object/1",
"type": "HumanMadeObject",
"identified_by": {
"id": "object/1/name",
"type": "Name",
"content": "Irises"
},
"produced_by": {
"id": "object/1/production",
"carried_out_by": {"id":"person/1"}
}
9

And a second, related one:
"@context": "https://linked.art/ns/v1/linked-art.json",
"id": "person/1",
"type": "Person",
"identified_by": {
"id": "person/1/name",
"type": "Name",
"content": "Vincent Van Gogh"
}
10

These could be seen as two separate documents:
11
"@context":
"https://linked.art/ns/v1/linked-art.json",
"id": "person/1",
"type": "Person",
"identified_by": {
"type": "Name",
"content": "Vincent Van Gogh"
}
"@context":
"id": "object/1",
"identified_by": {
"type": "Name",
"content": "Irises"
},
"produced_by": {
}

Or as a single graph.
12

From the point of view of the data, these
two structures are equivalent—they contain
the same facts.
But from a usability perspective, they make
different things easy or hard.
13
"@context":
"https://linked.art/ns/v1/linked-
art.json",
"id": "person/1",
"type": "Person",
"identified_by": {
"type": "Name",
"content": "Vincent
Van Gogh"
}
"@context":
"https://linked.art/ns/v1/linked-art.js
on",
"id": "object/1",
"identified_by": {
"type": "Name",
"content": "Irises"
},
"produced_by": {
}

Documents are optimized for Access:
They provide a speciﬁc set of data bundled
together by the data creator that provide all
the facts you need…given a speciﬁc context.
14
Documents: For Access and Discovery
"@context":
"id": "object/1",
"identified_by": {
"type": "Name",
"content": "Irises"
},
"produced_by": {
}

Graphs, alternately, are optimized for querying:
Allowing a user to deﬁne a speciﬁc context based
on novel criteria of interest, and returning that
subset of facts.
15
Graphs: For Queries

“What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?”
and
“What is the tombstone data about Irises?”
16
Imagine two Questions:

At the Getty, we have never asked:
“What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?
…but we ask
What is the tombstone data about Irises?
Several thousand times a day.
17
Imagine two Questions:

Having an interface for documents lets us
provide a simple, easily understandable
record that maps well to known contexts.
This is important, because people usually
expect these contexts. It makes answering
common questions simple.
18

It also maps nicely to the sort of affordances
that work well on the internet—REST APIs,
cache control, JSON documents, webpages.
This is also important, because using these
well-known systems helps us make our
systems fast and easy to build.
19

Research is different—each scholar brings
their own question and their own context.
Meeting their need means empowering
them to draw their own boundaries within
the data.
20
Graphs: For Asking Questions

Doing so is complex—it moves the burden of
deﬁning the relevant context to the user of
the data, not the creator of the data.
But it makes asking new questions possible,
even if it might be inefﬁcient or complicated.
21
Graphs: For Asking Questions

The LOD Gateway is a tool designed to
allow for both use cases..
It allows you to create, update, and delete
JSON-LD documents, and behind the
scenes it will keep a triplestore in sync with
those changes.
22
Meeting Both Needs

This works for Linked.Art records, IIIF
Manifest, Web Annotations: any JSON-LD
document.
If you POST it to the LOD Gateway, that
record will be available at the URL deﬁned
in that document’s id property.
23
Meeting Both Needs

It’s also RDF-aware: If there are nested children included the main document,
it automatically makes those dereferencable, too.
24
LOD Gateway: RDF Aware
https://example.com/object/1
https://example.com/object/1/identifier/1

You can also request documents in other RDF formats.
25
LOD Gateway: RDF Aware
https://example.com/object/1?format=turtle

It also provides both a full SPARQL API and
an embedded GUI for testing queries.
It can be conﬁgured to use any SPARQL
Triplestore—we use Fuseki in testing and
Amazon Neptune in production.
26
LOD Gateway: SPARQL-Enabled

This ﬂexibility makes it simple to write and retrieve
data in a form that matches your primary use case,
but still allows you the ﬂexibility to go beyond
that—either for research or for unexpected
features—without needing to rewrite your API.
27
Two Views: One Set of Facts

You can also conﬁgure it to run without the
RDF integration as a JSON document store.
We do this all the time, because of another
feature of the LOD Gatway: Change Logs!
28
LOD Gateway: now SPARQL-Free

The third critical use for our data is synchronization across systems.
A editor changes a record, which means the API needs updated, which means the website
needs updated, and the search interfaces, and third-party systems…
29
LOD Gateway: Tracking Changes

Every time you create, update, or delete a
record in the LOD Gateway, it adds a entry
to an Activity Stream.
This lets a consuming system identify
only the records that have been changed
since the last time they synced.
30

You can do this for the whole dataset, for a
given entity type, or even for a single entity.
This happens automatically, every time you
update a record in the LOD Gateway.
It’s even smart enough to not generate a
change event if the data didn’t change.
31

These change logs follow the
W3C ActivityStream standard and are
implemented using the patterns from the
IIIF Change Discovery API.
Using standards makes it easy for external
consumers to build integrations against
these ﬂows.
32
LOD Gateway: ActivityStreams and Standards

The change log only describes which
records changed. But for some kinds of
data, it's valuable to also be able to see what
has changed over time for a given record.
To do so, the LOD Gateway also supports
Memento, the standard underneath the
Internet Archive.
33

This feature lets you automatically open older
versions of the record—providing an audit log
and the ability for scholars to understand
how knowledge changes over time.
34

How do we use this?
How can you use this?
35

36
Getty’s Data Infrastructure: Managing Complexity

37
Getty’s Data Infrastructure: 14 Instances

One tool, many needs.
Building this tool has let a small team support 14 different APIs—and put in place
new ones whenever we need.
Our smallest instance is 250 records. Our largest is over 1 million.
38
LOD Gateway: Consistent Patterns, Consistent Tools

Critical Infrastructure.
The only way we’ve built what we have is using this tool.
Every research tool, every API.
39

And now you can, too.
As of today, we’ve released this tool as open source software under the BSD-3
license.
https://github.com/thegetty/lod-gateway
40

This is a “Third System”:
This is heavily-tested infrastructure, built because we have made so many
mistakes.
It’s not perfect, but our hope is that it helps you avoid at least the mistakes we know
about—and allows the brilliant modeling ecosystem CIDOC builds be used in
production by others around the world.
41
LOD Gateway: Built on top of our mistakes

Thank you!
Find me or ask me questions at:
dnewbury@getty.edu
42

The LOD Gateway: Open Source Infrastructure for Linked Data

More Related Content

Similar to The LOD Gateway: Open Source Infrastructure for Linked Data

More from David Newbury

Recently uploaded

The LOD Gateway: Open Source Infrastructure for Linked Data