Open Source Infrastructure for Linked Data
CIDOC-CRM 2023, Mexico City
The LOD Gateway
David Newbury
Assistant Director, Software and User Experience, Getty
Hi! I’m David.
I lead the software and user experience teams at Getty.
Getty is a big museum/research hub in Los Angeles. We do lots of things with
data.
All of the actual work here was done by my fabulously talented team. I just talk.
And we’re not Getty Images. Same rich family, same last name, no connection.
2
Introduction
Getty has been doing Linked Data since 2014,
starting with the Getty Vocabularies.
It’s a collection of concepts, people, and places
deeply relevant to the study of art and
architecture.
3
Getty’s Linked Data: Getty Vocabularies
Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
4
Getty’s Linked Data: Archival Records
Since then, we’ve moved most of our major
systems to use Linked Data—including our
archives…
… and our museum collection.
5
Getty’s Linked Data: Archival Records
We’ve also built a complex, powerful
infrastructure to support doing this across
our application landscape.
It’s been fun. We’ve learned a lot.
6
Getty’s Linked Data: APIs
Behind the scenes, all of these applications
are powered by a utility called The LOD
Gateway.
We’ve recently open-sourced this tool, and I’d
like to share it with you today.
7
Getty’s Linked Data: The LOD Gateway
This API system was designed to help Getty
manage one of the fundamental complications
that comes with using Linked Data:
Graphs vs. Documents.
8
Getty’s Linked Data: The LOD Gateway
Let’s take a basic JSON-LD record:
"@context": "https://linked.art/ns/v1/linked-art.json",
"id": "object/1",
"type": "HumanMadeObject",
"identified_by": {
"id": "object/1/name",
"type": "Name",
"content": "Irises"
},
"produced_by": {
"id": "object/1/production",
"carried_out_by": {"id":"person/1"}
}
9
Getty’s Linked Data: The LOD Gateway
And a second, related one:
"@context": "https://linked.art/ns/v1/linked-art.json",
"id": "person/1",
"type": "Person",
"identified_by": {
"id": "person/1/name",
"type": "Name",
"content": "Vincent Van Gogh"
}
10
Getty’s Linked Data: The LOD Gateway
These could be seen as two separate documents:
11
Getty’s Linked Data: The LOD Gateway
"@context":
"https://linked.art/ns/v1/linked-art.json",
"id": "person/1",
"type": "Person",
"identified_by": {
"id": "person/1/name",
"type": "Name",
"content": "Vincent Van Gogh"
}
"@context":
"https://linked.art/ns/v1/linked-art.json",
"id": "object/1",
"type": "HumanMadeObject",
"identified_by": {
"id": "object/1/name",
"type": "Name",
"content": "Irises"
},
"produced_by": {
"id": "object/1/production",
"carried_out_by": {"id":"person/1"}
}
Or as a single graph.
12
Getty’s Linked Data: The LOD Gateway
From the point of view of the data, these
two structures are equivalent—they contain
the same facts.
But from a usability perspective, they make
different things easy or hard.
13
Getty’s Linked Data: The LOD Gateway
"@context":
"https://linked.art/ns/v1/linked-
art.json",
"id": "person/1",
"type": "Person",
"identified_by": {
"id": "person/1/name",
"type": "Name",
"content": "Vincent
Van Gogh"
}
"@context":
"https://linked.art/ns/v1/linked-art.js
on",
"id": "object/1",
"type": "HumanMadeObject",
"identified_by": {
"id": "object/1/name",
"type": "Name",
"content": "Irises"
},
"produced_by": {
"id": "object/1/production",
"carried_out_by": {"id":"person/1"}
}
Documents are optimized for Access:
They provide a specific set of data bundled
together by the data creator that provide all
the facts you need…given a specific context.
14
Documents: For Access and Discovery
"@context":
"https://linked.art/ns/v1/linked-art.json",
"id": "object/1",
"type": "HumanMadeObject",
"identified_by": {
"id": "object/1/name",
"type": "Name",
"content": "Irises"
},
"produced_by": {
"id": "object/1/production",
"carried_out_by": {"id":"person/1"}
}
Graphs, alternately, are optimized for querying:
Allowing a user to define a specific context based
on novel criteria of interest, and returning that
subset of facts.
15
Graphs: For Queries
“What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?”
and
“What is the tombstone data about Irises?”
16
Imagine two Questions:
At the Getty, we have never asked:
“What objects does Getty have that have images larger than
1200px on the longest side that have been exhibited in both New
York and Paris and were created by artists who lived before 1850?
…but we ask
What is the tombstone data about Irises?
Several thousand times a day.
17
Imagine two Questions:
Having an interface for documents lets us
provide a simple, easily understandable
record that maps well to known contexts.
This is important, because people usually
expect these contexts. It makes answering
common questions simple.
18
Documents: For Access and Discovery
It also maps nicely to the sort of affordances
that work well on the internet—REST APIs,
cache control, JSON documents, webpages.
This is also important, because using these
well-known systems helps us make our
systems fast and easy to build.
19
Documents: For Access and Discovery
Research is different—each scholar brings
their own question and their own context.
Meeting their need means empowering
them to draw their own boundaries within
the data.
20
Graphs: For Asking Questions
Doing so is complex—it moves the burden of
defining the relevant context to the user of
the data, not the creator of the data.
But it makes asking new questions possible,
even if it might be inefficient or complicated.
21
Graphs: For Asking Questions
The LOD Gateway is a tool designed to
allow for both use cases..
It allows you to create, update, and delete
JSON-LD documents, and behind the
scenes it will keep a triplestore in sync with
those changes.
22
Meeting Both Needs
This works for Linked.Art records, IIIF
Manifest, Web Annotations: any JSON-LD
document.
If you POST it to the LOD Gateway, that
record will be available at the URL defined
in that document’s id property.
23
Meeting Both Needs
It’s also RDF-aware: If there are nested children included the main document,
it automatically makes those dereferencable, too.
24
LOD Gateway: RDF Aware
https://example.com/object/1
https://example.com/object/1/identifier/1
You can also request documents in other RDF formats.
25
LOD Gateway: RDF Aware
https://example.com/object/1?format=turtle
It also provides both a full SPARQL API and
an embedded GUI for testing queries.
It can be configured to use any SPARQL
Triplestore—we use Fuseki in testing and
Amazon Neptune in production.
26
LOD Gateway: SPARQL-Enabled
This flexibility makes it simple to write and retrieve
data in a form that matches your primary use case,
but still allows you the flexibility to go beyond
that—either for research or for unexpected
features—without needing to rewrite your API.
27
Two Views: One Set of Facts
You can also configure it to run without the
RDF integration as a JSON document store.
We do this all the time, because of another
feature of the LOD Gatway: Change Logs!
28
LOD Gateway: now SPARQL-Free
The third critical use for our data is synchronization across systems.
A editor changes a record, which means the API needs updated, which means the website
needs updated, and the search interfaces, and third-party systems…
29
LOD Gateway: Tracking Changes
Every time you create, update, or delete a
record in the LOD Gateway, it adds a entry
to an Activity Stream.
This lets a consuming system identify
only the records that have been changed
since the last time they synced.
30
LOD Gateway: Tracking Changes
You can do this for the whole dataset, for a
given entity type, or even for a single entity.
This happens automatically, every time you
update a record in the LOD Gateway.
It’s even smart enough to not generate a
change event if the data didn’t change.
31
LOD Gateway: Tracking Changes
These change logs follow the
W3C ActivityStream standard and are
implemented using the patterns from the
IIIF Change Discovery API.
Using standards makes it easy for external
consumers to build integrations against
these flows.
32
LOD Gateway: ActivityStreams and Standards
The change log only describes which
records changed. But for some kinds of
data, it's valuable to also be able to see what
has changed over time for a given record.
To do so, the LOD Gateway also supports
Memento, the standard underneath the
Internet Archive.
33
LOD Gateway: ActivityStreams and Standards
This feature lets you automatically open older
versions of the record—providing an audit log
and the ability for scholars to understand
how knowledge changes over time.
34
LOD Gateway: ActivityStreams and Standards
How do we use this?
How can you use this?
35
36
Getty’s Data Infrastructure: Managing Complexity
37
Getty’s Data Infrastructure: 14 Instances
One tool, many needs.
Building this tool has let a small team support 14 different APIs—and put in place
new ones whenever we need.
Our smallest instance is 250 records. Our largest is over 1 million.
38
LOD Gateway: Consistent Patterns, Consistent Tools
Critical Infrastructure.
The only way we’ve built what we have is using this tool.
Every research tool, every API.
39
LOD Gateway: Consistent Patterns, Consistent Tools
And now you can, too.
As of today, we’ve released this tool as open source software under the BSD-3
license.
https://github.com/thegetty/lod-gateway
40
LOD Gateway: Consistent Patterns, Consistent Tools
This is a “Third System”:
This is heavily-tested infrastructure, built because we have made so many
mistakes.
It’s not perfect, but our hope is that it helps you avoid at least the mistakes we know
about—and allows the brilliant modeling ecosystem CIDOC builds be used in
production by others around the world.
41
LOD Gateway: Built on top of our mistakes
Thank you!
Find me or ask me questions at:
dnewbury@getty.edu
42

The LOD Gateway: Open Source Infrastructure for Linked Data

  • 1.
    Open Source Infrastructurefor Linked Data CIDOC-CRM 2023, Mexico City The LOD Gateway David Newbury Assistant Director, Software and User Experience, Getty
  • 2.
    Hi! I’m David. Ilead the software and user experience teams at Getty. Getty is a big museum/research hub in Los Angeles. We do lots of things with data. All of the actual work here was done by my fabulously talented team. I just talk. And we’re not Getty Images. Same rich family, same last name, no connection. 2 Introduction
  • 3.
    Getty has beendoing Linked Data since 2014, starting with the Getty Vocabularies. It’s a collection of concepts, people, and places deeply relevant to the study of art and architecture. 3 Getty’s Linked Data: Getty Vocabularies
  • 4.
    Since then, we’vemoved most of our major systems to use Linked Data—including our archives… 4 Getty’s Linked Data: Archival Records
  • 5.
    Since then, we’vemoved most of our major systems to use Linked Data—including our archives… … and our museum collection. 5 Getty’s Linked Data: Archival Records
  • 6.
    We’ve also builta complex, powerful infrastructure to support doing this across our application landscape. It’s been fun. We’ve learned a lot. 6 Getty’s Linked Data: APIs
  • 7.
    Behind the scenes,all of these applications are powered by a utility called The LOD Gateway. We’ve recently open-sourced this tool, and I’d like to share it with you today. 7 Getty’s Linked Data: The LOD Gateway
  • 8.
    This API systemwas designed to help Getty manage one of the fundamental complications that comes with using Linked Data: Graphs vs. Documents. 8 Getty’s Linked Data: The LOD Gateway
  • 9.
    Let’s take abasic JSON-LD record: "@context": "https://linked.art/ns/v1/linked-art.json", "id": "object/1", "type": "HumanMadeObject", "identified_by": { "id": "object/1/name", "type": "Name", "content": "Irises" }, "produced_by": { "id": "object/1/production", "carried_out_by": {"id":"person/1"} } 9 Getty’s Linked Data: The LOD Gateway
  • 10.
    And a second,related one: "@context": "https://linked.art/ns/v1/linked-art.json", "id": "person/1", "type": "Person", "identified_by": { "id": "person/1/name", "type": "Name", "content": "Vincent Van Gogh" } 10 Getty’s Linked Data: The LOD Gateway
  • 11.
    These could beseen as two separate documents: 11 Getty’s Linked Data: The LOD Gateway "@context": "https://linked.art/ns/v1/linked-art.json", "id": "person/1", "type": "Person", "identified_by": { "id": "person/1/name", "type": "Name", "content": "Vincent Van Gogh" } "@context": "https://linked.art/ns/v1/linked-art.json", "id": "object/1", "type": "HumanMadeObject", "identified_by": { "id": "object/1/name", "type": "Name", "content": "Irises" }, "produced_by": { "id": "object/1/production", "carried_out_by": {"id":"person/1"} }
  • 12.
    Or as asingle graph. 12 Getty’s Linked Data: The LOD Gateway
  • 13.
    From the pointof view of the data, these two structures are equivalent—they contain the same facts. But from a usability perspective, they make different things easy or hard. 13 Getty’s Linked Data: The LOD Gateway "@context": "https://linked.art/ns/v1/linked- art.json", "id": "person/1", "type": "Person", "identified_by": { "id": "person/1/name", "type": "Name", "content": "Vincent Van Gogh" } "@context": "https://linked.art/ns/v1/linked-art.js on", "id": "object/1", "type": "HumanMadeObject", "identified_by": { "id": "object/1/name", "type": "Name", "content": "Irises" }, "produced_by": { "id": "object/1/production", "carried_out_by": {"id":"person/1"} }
  • 14.
    Documents are optimizedfor Access: They provide a specific set of data bundled together by the data creator that provide all the facts you need…given a specific context. 14 Documents: For Access and Discovery "@context": "https://linked.art/ns/v1/linked-art.json", "id": "object/1", "type": "HumanMadeObject", "identified_by": { "id": "object/1/name", "type": "Name", "content": "Irises" }, "produced_by": { "id": "object/1/production", "carried_out_by": {"id":"person/1"} }
  • 15.
    Graphs, alternately, areoptimized for querying: Allowing a user to define a specific context based on novel criteria of interest, and returning that subset of facts. 15 Graphs: For Queries
  • 16.
    “What objects doesGetty have that have images larger than 1200px on the longest side that have been exhibited in both New York and Paris and were created by artists who lived before 1850?” and “What is the tombstone data about Irises?” 16 Imagine two Questions:
  • 17.
    At the Getty,we have never asked: “What objects does Getty have that have images larger than 1200px on the longest side that have been exhibited in both New York and Paris and were created by artists who lived before 1850? …but we ask What is the tombstone data about Irises? Several thousand times a day. 17 Imagine two Questions:
  • 18.
    Having an interfacefor documents lets us provide a simple, easily understandable record that maps well to known contexts. This is important, because people usually expect these contexts. It makes answering common questions simple. 18 Documents: For Access and Discovery
  • 19.
    It also mapsnicely to the sort of affordances that work well on the internet—REST APIs, cache control, JSON documents, webpages. This is also important, because using these well-known systems helps us make our systems fast and easy to build. 19 Documents: For Access and Discovery
  • 20.
    Research is different—eachscholar brings their own question and their own context. Meeting their need means empowering them to draw their own boundaries within the data. 20 Graphs: For Asking Questions
  • 21.
    Doing so iscomplex—it moves the burden of defining the relevant context to the user of the data, not the creator of the data. But it makes asking new questions possible, even if it might be inefficient or complicated. 21 Graphs: For Asking Questions
  • 22.
    The LOD Gatewayis a tool designed to allow for both use cases.. It allows you to create, update, and delete JSON-LD documents, and behind the scenes it will keep a triplestore in sync with those changes. 22 Meeting Both Needs
  • 23.
    This works forLinked.Art records, IIIF Manifest, Web Annotations: any JSON-LD document. If you POST it to the LOD Gateway, that record will be available at the URL defined in that document’s id property. 23 Meeting Both Needs
  • 24.
    It’s also RDF-aware:If there are nested children included the main document, it automatically makes those dereferencable, too. 24 LOD Gateway: RDF Aware https://example.com/object/1 https://example.com/object/1/identifier/1
  • 25.
    You can alsorequest documents in other RDF formats. 25 LOD Gateway: RDF Aware https://example.com/object/1?format=turtle
  • 26.
    It also providesboth a full SPARQL API and an embedded GUI for testing queries. It can be configured to use any SPARQL Triplestore—we use Fuseki in testing and Amazon Neptune in production. 26 LOD Gateway: SPARQL-Enabled
  • 27.
    This flexibility makesit simple to write and retrieve data in a form that matches your primary use case, but still allows you the flexibility to go beyond that—either for research or for unexpected features—without needing to rewrite your API. 27 Two Views: One Set of Facts
  • 28.
    You can alsoconfigure it to run without the RDF integration as a JSON document store. We do this all the time, because of another feature of the LOD Gatway: Change Logs! 28 LOD Gateway: now SPARQL-Free
  • 29.
    The third criticaluse for our data is synchronization across systems. A editor changes a record, which means the API needs updated, which means the website needs updated, and the search interfaces, and third-party systems… 29 LOD Gateway: Tracking Changes
  • 30.
    Every time youcreate, update, or delete a record in the LOD Gateway, it adds a entry to an Activity Stream. This lets a consuming system identify only the records that have been changed since the last time they synced. 30 LOD Gateway: Tracking Changes
  • 31.
    You can dothis for the whole dataset, for a given entity type, or even for a single entity. This happens automatically, every time you update a record in the LOD Gateway. It’s even smart enough to not generate a change event if the data didn’t change. 31 LOD Gateway: Tracking Changes
  • 32.
    These change logsfollow the W3C ActivityStream standard and are implemented using the patterns from the IIIF Change Discovery API. Using standards makes it easy for external consumers to build integrations against these flows. 32 LOD Gateway: ActivityStreams and Standards
  • 33.
    The change logonly describes which records changed. But for some kinds of data, it's valuable to also be able to see what has changed over time for a given record. To do so, the LOD Gateway also supports Memento, the standard underneath the Internet Archive. 33 LOD Gateway: ActivityStreams and Standards
  • 34.
    This feature letsyou automatically open older versions of the record—providing an audit log and the ability for scholars to understand how knowledge changes over time. 34 LOD Gateway: ActivityStreams and Standards
  • 35.
    How do weuse this? How can you use this? 35
  • 36.
  • 37.
  • 38.
    One tool, manyneeds. Building this tool has let a small team support 14 different APIs—and put in place new ones whenever we need. Our smallest instance is 250 records. Our largest is over 1 million. 38 LOD Gateway: Consistent Patterns, Consistent Tools
  • 39.
    Critical Infrastructure. The onlyway we’ve built what we have is using this tool. Every research tool, every API. 39 LOD Gateway: Consistent Patterns, Consistent Tools
  • 40.
    And now youcan, too. As of today, we’ve released this tool as open source software under the BSD-3 license. https://github.com/thegetty/lod-gateway 40 LOD Gateway: Consistent Patterns, Consistent Tools
  • 41.
    This is a“Third System”: This is heavily-tested infrastructure, built because we have made so many mistakes. It’s not perfect, but our hope is that it helps you avoid at least the mistakes we know about—and allows the brilliant modeling ecosystem CIDOC builds be used in production by others around the world. 41 LOD Gateway: Built on top of our mistakes
  • 42.
    Thank you! Find meor ask me questions at: dnewbury@getty.edu 42