Summarize Your Archival Holdings With MementoMap

Sawood Alam, Internet Archive
Michael L. Nelson, Old Dominion University
Michele C. Weigle, Old Dominion University
Daniel Gomes, Arquivo.pt
Summarize Your
Archival Holdings
With MementoMap
IIPC Web Archiving Conference, June 16, 2021
#MementoMap
@ibnesayeed

@ibnesayeed 3
$ memgator -f cdxj http://si.edu/ | grep -v "^!" | cut -d'/' -f3 | sort | uniq -c | sort -nr
13263 web.archive.org
3590 wayback.archive-it.org
1202 web.archive.bibalex.org
651 webarchive.loc.gov
321 arquivo.pt
32 wayback.vefsafn.is
11 web.archive.org.au
3 archive.is
1 www.webarchive.org.uk
1 swap.stanford.edu
1 perma.cc
$ memgator -f cdxj http://odu.edu/ | grep -v "^!" | cut -d'/' -f3 | sort | uniq -c | sort -nr
3071 web.archive.org
796 wayback.archive-it.org
751 web.archive.bibalex.org
99 webarchive.loc.gov
26 arquivo.pt
2 archive.is
1 wayback.vefsafn.is
Cross-Archive Memento Lookup With MemGator
Although there are
13k+ mementos in IA,
there are also
mementos in 10 other
public web archives.
https://github.com/oduwsdl/MemGator
ODU is less popular, but
there are mementos in 7
different web archives.

@ibnesayeed
Who Would Have Thought to Lookup in the Icelandic
Web Archive for odu.edu Mementos?
4
http://wayback.vefsafn.is/wayback/20100810032449/http://odu.edu/

@ibnesayeed
Prevalence of Sample Query URI Sets in Archives
5
Sample
(1M URIs Each)
In
Archive-It
In
UKWA
In
Stanford
Union
{AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
MementoProxy 4.182% 0.408% 0.046% 4.527%
IAWayback 3.716% 0.519% 0.039% 4.165%
UKWayback 0.108% 0.034% 0.002% 0.134%
Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016

@ibnesayeed
Why Aggregate Small Archives?
● Wayback Machine does not cover everything
● Archives often have unique mementos (small overlap)
● Linguistic and geolocation diversity
● High-quality curated collections
● Restricted resources and private archives
6

@ibnesayeed
MemGator Broadcasting
7

@ibnesayeed
8

@ibnesayeed
9

@ibnesayeed
10

@ibnesayeed
11

@ibnesayeed
12

@ibnesayeed
MemGator Log Responses From Various Archives
13
93% of the requests
made from MemGator
to upstream archives
were wasteful.
Only about one third
of the requests to the
largest web archive
(IA) were a hit.

@ibnesayeed
Aggregation Is Great, But Broadcasting Is Wasteful
14
What do we want? Aggregate all archives, large or small
What’s the problem? Broadcasting is wasteful and problematic
What’s the solution? Selectively poll archives that are likely to
return good results for a lookup URI
How to identify those? Profile web archives
How to profile archives? MementoMap Framework
Sawood Alam, “MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing”, Doctoral Dissertation, ODU, 2020

@ibnesayeed
If Only Archives Could Tell What to Ask Them For
● Websites advertise their holdings using sitemap.xml, why can’t archives?
○ Archives have billions or even trillions of URI-Ms
○ Such exhaustive lists would go stale very quickly
● How about robots.txt?
○ It is compact, but is exclusion format, it does not tell what the site has
○ It assumes a single domain, patterns are for paths (not the domain name)
● How about well-known URIs?
○ Good for automated discovery of domain-specific metadata resources
● How about combining these ideas?
○ Introducing MementoMap!
15

@ibnesayeed
Memento Lookup Routing
16
Let us fix the broadcasting issue
with a more informed routing.

@ibnesayeed
Archive Profiling Strategies
● Complete URI-R Profiling (1 URI-R = 1 Profile Key) [Sanderson et al., TPDL 2012]
○ bbc.co.uk/images/logo.png?w=90
○ cnn.com/2014/03/15/?id=128734
● TLD-Only Profiling (1 TLD = 1 Profile Key) [AlSum, et al., TPDL 2013]
○ *.com
○ *.uk
● Middle Ground
○ *.cnn.com
○ *.co.uk
○ *.bbc.co.uk
○ bbc.co.uk/images/*
17
We explore
these strategies
in this work.
Top three archives after
IA produce full TimeMaps
52% of the time.

@ibnesayeed
MementoMap Framework Components
● Ingestion
○ CDX files/API
○ Fulltext search
○ Access logs
○ Sample URIs
● Summarization and Serialization
○ Resource constraints
○ Application-specific variants
● Memento Routing
○ Integration with aggregators
18

@ibnesayeed
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
19
2B URI-Rs that have
1-9 mementos each in
Arquivo.pt were never
requested from ODU’s
MemGator server.
43 URI-Rs were
requested thousands
of times each, but
had zero mementos
in Arquivo.pt.
45 URI-Rs had tens
of mementos each
that were requested
hundreds of times.

@ibnesayeed
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
20
Blind spot of a
usage-based
profile
Blind spot of a
content-based
profile

@ibnesayeed
Who Bears the Cost of Bad Routing Decisions?
21
Actual
Present in the Archive Not in the Archive
Predicted
Routed to the
Archive
True Positive (TP) False Positive (FP)
Not Routed to
the Archive
False Negative (FN) True Negative (TN)
FP: Wasteful (Infrastructure suffers)
FN: Disuse (Users suffer)

@ibnesayeed
URI Canonicalization and SURT
22
https://news.bbc.co.uk/images/Logo.png?width=200&height=80&rotate=90%C2%B0#top
http://www.news.BBC.co.uk/images/Logo.png?width=200&height=80&rotate=90%c2%b0#top
http://www.news.bbc.co.uk/images/Logo.png?rotate=90%c2%B0&width=200&height=80
http://NEWS.BBC.CO.UK:80//images//Logo.png?height=80&width=200&rotate=90%c2%b0#top
news.bbc.co.uk/images/Logo.png?height=80&rotate=90%C2%B0&width=200
uk,co,bbc,news,)/images/logo.png?height=80&rotate=90%c2%b0&width=200
Canonicalization
SURT

@ibnesayeed
CDX/CDXJ Summarization
23
http://archive.org/web/researcher/cdx_file_format.php

@ibnesayeed
SURT Representation With Wildcard
24
Original SURTs did not have wildcards.
We introduced it for dynamic profiling.
In practice the common “http://(” prefix
is removed.

@ibnesayeed
Shape of URI Key Tree of Arquivo.pt
25
Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019

@ibnesayeed
A MementoMap Example
26
!context ["http://oduwsdl.github.io/contexts/ukvs"]
!id {uri: "http://archive.example.org/"}
!fields {keys: ["surt"], values: ["frequency"]}
!meta {type: "MementoMap", name: "A Test Web Archive", year: 1996}
!meta {updated_at: "2018-09-03T13:27:52Z"}
* 54321/20000
com,* 10000+
org,arxiv)/ 100
org,arxiv)/* 2500~/900
org,arxiv)/pdf/* 0
uk,co,bbc)/images/* 300+/20-
https://github.com/oduwsdl/ORS/blob/master/ukvs.md
Goodbye HmPn/DLim static profiling policies, thanks to our SURT with wildcard.

@ibnesayeed
MementoMap
27
https://github.com/oduwsdl/MementoMap
$ mementomap
Usage: mementomap [-h] {generate,compact,lookup,batchlookup} ...
Positional Arguments:
{generate,compact,lookup,batchlookup}
generate Generate a MementoMap from a sorted file with the
first columns as SURT (e.g., CDX/CDXJ)
compact Compact a large MementoMap file into a small one
lookup Search for a URI/SURT into a MementoMap
batchlookup Search for a list of URIs/SURTs into a MementoMap
Optional Arguments:
-h, --help Show this help message and exit

@ibnesayeed
Processed Lines vs. Compacted MementoMap Growth
28
com,example)/a/1/x
com,example)/a/2
com,example)/a/3
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/a/*
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/*

@ibnesayeed
MementoMap Generation, Compaction, and Lookup
29
1.5% Relative Cost yields 60% Accuracy.
Arquivo.pt can save 60% wasted traffic by
publishing a 119MB summary file!

@ibnesayeed
Why Profile Archival Voids?
30
$ curl -I https://web.archive.org/web/https://quora.com/
HTTP/1.1 403 FORBIDDEN
Server: nginx/1.15.8
Date: Wed, 02 Dec 2020 20:39:33 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Server-Timing: captures_list;dur=0.150497
X-App-Server: wwwb-app58
X-ts: 403
The Internet Archive has
many “*.com” domains,
but it may not want to
capture or replay some.
Alam et al., “Profiling Web Archival Voids for Memento Routing”, JCDL 2021

@ibnesayeed
Archival Voids Profiles Reduce False Positives
31
org,arxiv)/abs/a 40
org,arxiv)/abs/b 23
org,arxiv)/abs/c 17
org,arxiv)/format/a 15
org,arxiv)/format/b 20
org,arxiv)/format/c 10
org,arxiv)/search/a 30
...
org,arxiv)/abs/* 80
org,arxiv)/format/* 45
org,arxiv)/search/* 60
org,arxiv)/* 185
org,arxiv)/abs/d
False Positive org,arxiv)/pdf/a
org,arxiv)/pdf/b
org,arxiv)/pdf/c
False Positive
org,arxiv)/* 185
org,arxiv)/pdf/* 0 How about summarizing frequently
accessed URIs an archive does not hold?

@ibnesayeed
404-Only Frequencies and Request Savings
32
An archival voids profile of 2.4k URIs, that were accessed hundreds of
times each or more, could have saved about 8.4% of wasted requests.

@ibnesayeed
Archival Voids Recommendations
33
● Keep archival voids profiles separate from archival holdings
● Update often
● Use specific keys with only high confidence
● Profile only resources that are high in demand
● Archives themselves are better sources of truth than external
observers

@ibnesayeed
Dissemination and Discovery Methods
34
GET /.well-known/mementomap HTTP/1.1
Host: arquivo.pt
Link: <https://arquivo.pt/path/to/mementomap.ukvs>;
rel="mementomap"
<link href="https://arquivo.pt/path/to/mementomap.ukvs"
rel="mementomap">
Well-known URI
Link Header
Link HTML Element

@ibnesayeed
MementoMap Adoption Path
● PWA, UKWA, and NLA have shown interest
● PyWB archival replay system is open for implementation
● MemGator and LANL’s Time Travel service are interested
● Big web archives can start with publishing archival voids
○ No need to profile IA
● Archives with access restrictions can have multiple
MementoMaps
● Third parties can create and publish MementoMaps of the
rest of the archives while they catch up
● Coexist with the ongoing IIPC-funded Bloom filters project
35

@ibnesayeed
MementoMap Call for Adoption
36
🕮
MementoMap Framework (Doctoral Dissertation)
https://digitalcommons.odu.edu/computerscience_etds/129/
Unified Key Value Store (UKVS)
https://github.com/oduwsdl/ORS/blob/master/ukvs.md
⚙
MementoMap CLI
https://github.com/oduwsdl/MementoMap
MemGator
https://github.com/oduwsdl/MemGator
$ mementomap generate --hcf=4.0 --pcf=2.0 index.cdx[j] mementomap.ukvs
# Provide sorted list of SURTs to STDIN if not using CDX[J] index
$ scp mementomap.ukvs ${WEBHOST}:${WEBROOT}/.well-known/mementomap
# Preferably, compress the file and allow content negotiation
✉
Email: sawood@archive.org
Twitter: @ibnesayeed
IIPC Slack: #mementomap

Summarize Your Archival Holdings With MementoMap

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Summarize Your Archival Holdings With MementoMap

Similar to Summarize Your Archival Holdings With MementoMap (20)

More from Sawood Alam

More from Sawood Alam (19)

Recently uploaded

Recently uploaded (20)

Summarize Your Archival Holdings With MementoMap