Profiling Web Archival Voids for Memento Routing

Sawood Alam
Internet Archive, San Francisco, CA
Michael L. Nelson and Michele C. Weigle
Old Dominion University, Norfolk, VA
Profiling Web Archival Voids
for Memento Routing
#MementoMap
#ArchivalVoids
@ibnesayeed
@WebSciDL
Supported in part by NSF Grant IIS-1526700
JCDL '21, September 27-30, 2021, Urbana-Champaign, Illinois
https://arxiv.org/abs/2108.03311

@ibnesayeed | @WebSciDL
MemGator Log Responses From Various Archives
2
93% of the requests
made from MemGator
to upstream archives
were wasteful.
Only about one third
of the requests to the
largest web archive
(IA) were a hit.

Who Bears the Cost of Bad Routing Decisions?
3
Actual
Present in the Archive Not in the Archive
Predicted
Routed to the
Archive
True Positive (TP) False Positive (FP)
Not Routed to
the Archive
False Negative (FN) True Negative (TN)
FP: Wasteful (Infrastructure suffers)
FN: Disuse (Users suffer)

What is Archived in Arquivo.pt?
What is Accessed from MemGator?
4
2B URI-Rs that have
1-9 mementos each in
Arquivo.pt were never
requested from ODU’s
MemGator server.
43 URI-Rs were
requested thousands
of times each, but
had zero mementos
in Arquivo.pt.
45 URI-Rs had tens
of mementos each
that were requested
hundreds of times.

What is Archived in Arquivo.pt?
What is Accessed from MemGator?
5
Blind spot of a
usage-based
profile
Blind spot of a
content-based
profile

MementoMap of Archival Holdings Profile
6
Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
1.5% Relative Cost yields 60%
Accuracy.
Arquivo.pt can save 60%
wasted traffic by publishing a
119MB summary file of their
1.8T CDX files (containing 5B
mementos of 2B URI-Rs)

Why Profile Archival Voids?
7
$ curl -I https://web.archive.org/web/https://quora.com/
HTTP/1.1 403 FORBIDDEN
Server: nginx/1.15.8
Date: Wed, 02 Dec 2020 20:39:33 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Server-Timing: captures_list;dur=0.150497
X-App-Server: wwwb-app58
X-ts: 403
The Internet Archive has
many “*.com” domains,
but it may not want to
capture or replay some.

Sources Accessing TimeMaps
8
LANL’s usage-based
Archival Holdings
profiling reduced
requests significantly.
Profiling Archival
Voids would improve
it even further.

Archival Voids Profiles Reduce False Positives
9
org,arxiv)/abs/a 40
org,arxiv)/abs/b 23
org,arxiv)/abs/c 17
org,arxiv)/format/a 15
org,arxiv)/format/b 20
org,arxiv)/format/c 10
org,arxiv)/search/a 30
...
org,arxiv)/abs/* 80
org,arxiv)/format/* 45
org,arxiv)/search/* 60
org,arxiv)/* 185
org,arxiv)/abs/d
False Positive org,arxiv)/pdf/a
org,arxiv)/pdf/b
org,arxiv)/pdf/c
False Positive
org,arxiv)/* 185
org,arxiv)/pdf/* 0
How about summarizing frequently
accessed URIs an archive does not hold?

Arquivo.pt Access Log Dataset
10

TimeMap Status Code Distribution and Fluctuations
11

Most Frequently Accessed URIs
12
Most of the traffic to
“fccn.pt” is originated
from the UptimeRobot
Always returned a “404 Not Found” response.

404-Only Frequencies and Request Savings
13
An archival voids profile of 2.4k URIs, that were accessed hundreds of
times each or more, could have saved about 8.4% of wasted requests.

Archival Voids Recommendations
14
● Keep Archival Voids profiles separate from Archival Holdings
● Update often
● Use specific keys with only high confidence
● Profile only resources that are high in demand
● Archives themselves are better sources of truth than external observers

Profiling Web Archival Voids for Memento Routing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Profiling Web Archival Voids for Memento Routing

Similar to Profiling Web Archival Voids for Memento Routing (20)

More from Sawood Alam

More from Sawood Alam (18)

Recently uploaded

Recently uploaded (20)

Profiling Web Archival Voids for Memento Routing