Perseverance on Persistence

Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Perseverance on Persistence
a future-note about the past

OAI-ORE

2006
• OAI-ORE observation: Scholarly assets are
rapidly becoming compound, consisting of
multiple resources with various:
• Relationships
• Interdependencies
• How to convey this compound-ness in an
interoperable manner so that applications
can access, consume such assets?
http://www.openarchives.org/ore/1.0/toc

Address interoperability challenges from the perspective of the web
• The resource at the center of the universe
• The notion of a repository (or even of a web server) does
not exist in the architecture of the web
• Neither the notion of a Digital Object
• The tools of the interoperability trade are the primitives of the
web
ORE Insight 1 - Web-Centric Interoperability Paradigm

Tools of the Web-Centric Interoperability Trade
• Resource
• URI
• HTTP as the API: HEAD/GET, POST, PUT, DELETE
• Representation
• Media Type
• Link
• Content Negotiation
• Typed Link
• Controlled Vocabularies for Typed Links
W3C
Architecture of
the World
Wide Web
RDF, RDFS,
OWL

OAI-ORE in EDM
Europeana v1.0 2009

The web-centric ORE approach allowed using off-the-shelf web
tools to archive evolving compound objects
• Evolving versions of Resource Maps, Aggregated Resources
were captured in a web archive
• But how to use the URI of the Aggregation or Resource Map to
see the status of an Aggregation at a specific moment in the
past?
ORE Insight 2 – How to Access Temporal State of an Aggregation
H. Van de Sompel (2007) Compound Information Object Prototype Demonstration
https://www.dropbox.com/s/dd7xd427y90q4jx/CT_Watch_hvds_20070703.mov?dl=0

H. Van de Sompel, M. L. Nelson, R. Sanderson (2013) RFC7089 - HTTP Framework for Time-
Based Access to Resource States – Memento. https://tools.ietf.org/html/rfc7089
Memento

Tools of the Web-Centric Interoperability Trade – HTTP Stack
• Resource
• URI
• HTTP as the API
• Representation
• Media Types
• Link
• Content Negotiation, e.g. for preferred Media Type
• Typed Link
• Controlled Vocabularies for Typed Links
W3C
Architecture of
the World Wide
Web
HTTP Links,
IANA link
relation registry,
community link
relation types
HATEOAS – Hypermedia As The Engine Of Application State
http://en.wikipedia.org/wiki/HATEOAS

Original Resource and Mementos

Bridge from Present to Past

Bridge from Past to Present

timegate Link: Link to Your Own History
Can link to preferred web
archive, but also:
• Maintain your own
resource version history
• timegate link to your
own history
• Distributed management of
resource history
• Uniform access to
resource history across
systems
• Follow links across
systems subject to time

No timegate Link – Client Intelligence
Client uses TimeGate of its
preferred web archive, but:
• Internet Archive is
massive, yet substantial
unique materials in other
archives
• Introduce aggregated
TimeGate: Memento
Aggregator

Routing TimeGate Requests Using Machine Learning
Bornand, N., Balakireva, L., Van de Sompel, H. (2016) Routing Memento Requests Using Binary
Classifiers. JCDL16. https://arxiv.org/abs/1606.09136
• Memento Aggregator covers 20+ web archives
• Distributed systems problem: As the number of archives (and
incoming requests) grows, sending requests to each archive for
every incoming request is not feasible
• Response times
• Load on distributed archives
• After various optimization attempts, devised an approach using
binary classifiers per web archive:
• Trained on the basis of cached URIs, using URI features only
• Operational since 2016: 80% reduction in # queries. 1/3
reduction in response times. Recall 85%

From
Internet Archive
TodayToday Select Date Mar 20 2007 Apr 03 2007
Various Memento Tools (client/server)
https://github.com/machawk1/awesome-memento

Pockets of Persistence

Creating Pockets of Persistence
• With Memento’s time travel capability in place, what would it take to
support faithfully navigating the web of the Past?
• There are two major forces that hinder achieving this goal:
• Link rot: A link stops working all together
• Content drift: The linked content changes over time and may
eventually no longer be representative of the content that was
originally linked
• Without these forces at work, the web of the Present would be the
same as the web of the Past
• But that clearly is not the case

Hyperlinks in Theory

Hyperlinks in Reality

Link Rot

Link Rot - PMC
Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, et al. (2014) Scholarly
context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253

Content Drift

Content Drift
http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)

No Content Drift
http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)

Content Drift - PMC
Shawn Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, et al. (2016) Scholarly context
not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475

Creating Pockets of Persistence
• What would it take to really support faithfully navigating the web of
the Past?
• This challenge exists for the entire web. Some communities with well
managed collections care about addressing it:
• Scholarly communication
• Cultural heritage
• Legal publications
• Journalism
• Wikipedia
• Why?
• Link Rot: Quality of Service
• Content Drift: integrity of the record, reliable evidence, revisiting
the state of knowledge, transparency of editorial process, …

US Supreme Court Opinion – Link Rot Activism
http://ssnat.com

Two Types of Links from a Managed Collection

Take 1 – PID Approach
PID
for
B

Managed Collection => Managed Collection

PID Approach
Combat:
• Link Rot: Link to PID;
Redirect to current location
• Content Drift: Mint a PID
per version; Link to version
PID
With PID links:
• Web of Present = Web of
Past

URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102

cite-as Relation Type
Herbert Van de Sompel et al. (2018) cite-as: A Link Relation to Convey a Preferred URI for
Referencing. https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
http://signposting.org

PID Approach – Division of Labor

Managed Collection => Web at Large

PID Approach

-

Take 2 – Robust Links Approach

Snapshot Approach
Combat:
• Link Rot & Content Drift:
Custodian of A creates
snapshot of B, in web
archive or locally
Regarding links:
• Intuition suggests linking to
the snapshot of B …

Linking to Snapshot of B = Potentially Creating a Rotten Link
• Existing practice for linking to snapshots:
<a href=“URL of snapshot of B”>
• Problems with existing practice:
o Impossible to visit the original URI, if desired
o Requires the permanent existence/uptime of the archive that
holds the snapshot
- One link rot problem replaced by another
http://robustlinks.mementoweb.org/about/

Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
https://web.archive.org/web/20140214233752/https://www.mummify.it/

http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-
islamic-state-video/510074.html

http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET

Decorate the Link
• Proposed practice for linking to captures:
<a href=“URL of snapshot of B”
data-originalurl=“B”
data-versiondate=“datetime of snapshot of B”>
<a href=“B”
data-versionurl=“URL of snapshot of B”
data-versiondate=“datetime of snapshot of B”>
http://robustlinks.mementoweb.org/spec/

Robust Links: Link Decoration in Action
Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. In:
D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel
JavaScript makes the
link decorations actionable

Robust Links: Refuse to Die

Snapshot Approach – Division of Labor

Cool URI Approach
Combat:
• Link Rot: Link to B;
Redirect to current location
• Content Drift: Generic URI;
Version URIs
With Cool URI links:
• Tension between linking to
generic URI and version
URI

Cool URI Approach – Division of Labor

Robust Links Approach



Summary



PID RLLabor
-

Robust Links for Linked Data?
Sanderson, R., Ciccarese, P., and Young, B. (2017) Web Annotation Vocabulary
W3C Recommendation 23 February 2017. https://www.w3.org/TR/annotation-vocab/

Handling Resource Versions, Captures
B
B
t1
B
t2

Systems with Resource Versions

DBpedia Snapshot Archive Using HDT, TPF, Memento
Vander Sande, M., Verborgh, R., Hochstenbach, P., and Van de Sompel, H. (2017) Towards
sustainable publishing and querying of distributed Linked Data archives.
Temporal: subject URI access ; ?s ?p ?o queries ; SPARQL queries

Memento Tracer
http://tracer.mementoweb.org

Resource Capture: Tension Between Scale and Quality
• Web crawling: optimized for scale
• Problems with capturing resources accessible via interactive
affordances
• webrecorder.io: optimized for quality
• Personal archiving
• User records web navigation session
• Not used for archiving at scale
• LOCKSS: optimized for scholarly journals
• Pages in Publisher/Journal portals share lay-out, affordances
• Heuristics per publisher/journal to improve capture quality

Memento Tracer: New Sweet Spot Between Scale and Quality
• ~ web crawling: server side process to capture resources
• ~ LOCKSS: leverages insight that web publications in any given
portal are based on same template:
• share lay-out
• share interactive affordances
• ~ webrecorder.io: human guidance to achieve quality
• But, with Memento Tracer:
• user does not record a specific web publication
• user records heuristics that apply to a class of web publications

Memento Tracer

A Trace for slideshare Presentations
{ "portal_url_match":
"(slideshare.net)/([^/]+)/([^/]+)",
"actions": [{ "action_order": "1",
"value": "div.j-next-btn.arrow-right",
"type": "CSSSelector",
"action": "repeated_click",
"repeat_until": {
"condition": "changes",
"type": "resource_url"
}
},
{ "action_order": "2",
"value": "div.notranslate.transcript.add-
padding-right.j-transcript a",
"type": "CSSSelector",
"action": "click"
}
], …

Memento Tracer: Experimental
• Promising results, thus far
• Currently investigating challenges, including:
• User interface to support recording Traces for complex
sequences of interactions.
• Limitations of the browser event listener approach for recording
Traces.
• Language used to express Traces.
• Organization of the shared repository for Traces.
• Selection of a Trace for capturing a web publication in cases
where different page layouts and interactive affordances are
available for web publications that share a URI pattern.

Demo: Recording a Trace for a Web Publication
https://github.com/www.gorillatoolkit/pkg/mux

Demo: Capturing another Web Publication Using the Trace
https://github.com/mementoweb/node-solid-server

Demo: Playing Back the Captured Web Publication
Capture of https://github.com/mementoweb/node-solid-server

Perseverance on Persistence

More Related Content

What's hot

Similar to Perseverance on Persistence

More from Herbert Van de Sompel

Perseverance on Persistence