SlideShare a Scribd company logo
1 of 108
Download to read offline
Sawood Alam <@ibnesayeed>
Advisor: Michael L. Nelson
Members: Michele C. Weigle, Jian Wu, Sampath Jayarathna, and Erika F. Frydenlund
MementoMap
A Web Archive Profiling Framework
for Efficient Memento Routing
Doctoral Dissertation Defense, December 04, 2020
Old Dominion University, Norfolk, Virginia - 23529 (USA)
@ibnesayeed | @WebSciDL
● Introduction and Motivation
● Research Questions
● MementoMap Framework
● RQ1: Understanding Web Archives
○ Archival Holdings
○ Archival Voids
● RQ2: Serialization and Dissemination
● RQ3: Memento Routing
● Contributions, Future Work, and Conclusions
Outline
2
Introduction and Motivation
3
@ibnesayeed | @WebSciDL 4
Homepage of the Smithsonian Institution (SI)
@ibnesayeed | @WebSciDL 5
An Early Memento of SI in Wayback Machine
https://web.archive.org/web/19971210203441/http://www.si.edu/newstart.htm
A missing image
@ibnesayeed | @WebSciDL 6
The Earliest Memento of SI in Arquivo.pt
https://arquivo.pt/wayback/19961013204418/http://www.si.edu/
@ibnesayeed | @WebSciDL 7
List of SI Mementos in the Two Web Archives
https://web.archive.org/web/*/http://si.edu/ https://arquivo.pt/wayback/*/http://si.edu/
@ibnesayeed | @WebSciDL 8
List of SI Mementos From an Aggregator
http://timetravel.mementoweb.org/list/19950101000000/http://si.edu/
Mementos from 7
different web archives.
Arquivo.pt has the
first memento.
@ibnesayeed | @WebSciDL 9
$ curl -s https://web.archive.org/web/timemap/link/http://si.edu/
<http://www.si.edu:80/>; rel="original",
<https://web.archive.org/web/http://si.edu/>; rel="timegate",
<https://web.archive.org/web/timemap/link/http://si.edu/>;
rel="self"; type="application/link-format"; from="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970502110751/http://www.si.edu:80/>;
rel="first memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970502110751/http://www.si.edu:80/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970502110751/http://www.si.edu:80/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970728075821/http://www.si.edu:80/>;
rel="memento"; datetime="Mon, 28 Jul 1997 07:58:21 GMT",
<https://web.archive.org/web/19971210203635/http://www.si.edu:80/>;
rel="memento"; datetime="Wed, 10 Dec 1997 20:36:35 GMT",
[...TRUNCATED...]
TimeMap of SI From Wayback Machine
Original URI (URI-R)
Memento URI (URI-M)
@ibnesayeed | @WebSciDL 10
$ curl https://arquivo.pt/wayback/timemap/link/http://si.edu/
<https://arquivo.pt/wayback/timemap/link/http://si.edu/>;
rel="self"; type="application/link-format"; from="Sun, 13 Oct 1996 20:44:18 GMT",
<https://arquivo.pt/wayback/http://si.edu/>; rel="timegate",
<http://si.edu/>; rel="original",
<https://arquivo.pt/wayback/19961013204418mp_/http://www.si.edu/>;
rel="memento"; datetime="Sun, 13 Oct 1996 20:44:18 GMT"; collection="$root",
<https://arquivo.pt/wayback/20081025151519mp_/http://www.si.edu/>;
rel="memento"; datetime="Sat, 25 Oct 2008 15:15:19 GMT"; collection="$root",
<https://arquivo.pt/wayback/20090716053258mp_/http://www.si.edu/>;
rel="memento"; datetime="Thu, 16 Jul 2009 05:32:58 GMT"; collection="$root",
<https://arquivo.pt/wayback/20091014121540mp_/http://www.si.edu/>;
rel="memento"; datetime="Wed, 14 Oct 2009 12:15:40 GMT"; collection="$root",
<https://arquivo.pt/wayback/20100529165454mp_/http://www.si.edu/>;
rel="memento"; datetime="Sat, 29 May 2010 16:54:54 GMT"; collection="$root",
[...TRUNCATED...]
TimeMap of SI From Arquivo.pt
@ibnesayeed | @WebSciDL 11
$ curl https://memgator.cs.odu.edu/timemap/link/http://si.edu/
<http://si.edu/>; rel="original",
<https://memgator.cs.odu.edu/timemap/link/http://si.edu/>;
rel="self"; type="application/link-format",
<https://arquivo.pt/wayback/19961013204418mp_/http://www.si.edu/>;
rel="first memento"; datetime="Sun, 13 Oct 1996 20:44:18 GMT",
<https://webarchive.loc.gov/all/19970502110751/http://www.si.edu/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://wayback.archive-it.org/all/19970502110751/http://www.si.edu/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970502110751/http://www.si.edu:80/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970502110751/http://www.si.edu:80/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
<https://web.archive.org/web/19970502110751/http://www.si.edu:80/>;
rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT",
[...TRUNCATED...]
TimeMap of SI From a Memento Aggregator
@ibnesayeed | @WebSciDL 12
$ memgator -f cdxj http://si.edu/ | grep -v "^!" | cut -d'/' -f3 | sort | uniq -c | sort -nr
13263 web.archive.org
3590 wayback.archive-it.org
1202 web.archive.bibalex.org
651 webarchive.loc.gov
321 arquivo.pt
32 wayback.vefsafn.is
11 web.archive.org.au
3 archive.is
1 www.webarchive.org.uk
1 swap.stanford.edu
1 perma.cc
$ memgator -f cdxj http://odu.edu/ | grep -v "^!" | cut -d'/' -f3 | sort | uniq -c | sort -nr
3071 web.archive.org
796 wayback.archive-it.org
751 web.archive.bibalex.org
99 webarchive.loc.gov
26 arquivo.pt
2 archive.is
1 wayback.vefsafn.is
Cross-Archive Memento Lookup With MemGator
Although there are
13k+ mementos in IA,
there are also
mementos in 10 other
public web archives.
https://github.com/oduwsdl/MemGator
ODU is less popular, but
there are mementos in 7
different web archives.
@ibnesayeed | @WebSciDL
Who Would Have Thought to Lookup in the
Icelandic Web Archive for odu.edu Mementos?
13
http://wayback.vefsafn.is/wayback/20100810032449/http://odu.edu/
@ibnesayeed | @WebSciDL
Why Aggregate Small Archives?
● One trillion+ mementos in IA’s Wayback Machine do not
cover everything
● Archives often have unique mementos (small overlap)
● Linguistic and geolocation diversity
● High-quality curated collections
● Restricted resources and private archives
14
@ibnesayeed | @WebSciDL
MemGator Broadcasting
15
@ibnesayeed | @WebSciDL
MemGator Broadcasting
16
@ibnesayeed | @WebSciDL
MemGator Broadcasting
17
@ibnesayeed | @WebSciDL
MemGator Broadcasting
18
@ibnesayeed | @WebSciDL
MemGator Broadcasting
19
@ibnesayeed | @WebSciDL
MemGator Broadcasting
20
@ibnesayeed | @WebSciDL
MemGator Log Responses From Various Archives
21
93% of the requests
made from MemGator
to upstream archives
were wasteful.
Only about one third
of the requests to the
largest web archive
(IA) were a hit.
@ibnesayeed | @WebSciDL
Aggregation Is Great, But Broadcasting Is Wasteful
22
What do we want? Aggregate all archives, large or small
What’s the problem? Broadcasting is wasteful and problematic
What’s the solution? Selectively poll archives that are likely to
return good results for a lookup URI
How to identify those? Profile web archives
How to profile archives? MementoMap Framework
@ibnesayeed | @WebSciDL
Memento Lookup Routing
23
Let us fix the broadcasting issue
with a more informed routing.
@ibnesayeed | @WebSciDL
Archive Profiling Strategies
● Complete URI-R Profiling (1 URI-R = 1 Profile Key) [Sanderson et al., TPDL 2012]
○ bbc.co.uk/images/logo.png?w=90
○ cnn.com/2014/03/15/?id=128734
● TLD-Only Profiling (1 TLD = 1 Profile Key) [AlSum, et al., TPDL 2013]
○ *.com
○ *.uk
● Middle Ground
○ *.cnn.com
○ *.co.uk
○ *.bbc.co.uk
○ bbc.co.uk/images/*
24
We explore
these strategies
in this work.
Top three archives after
IA produce full TimeMaps
52% of the time.
@ibnesayeed | @WebSciDL
Related Work
25
Archive Profiling
● Sanderson et al., IIPC 2012
● AlSum et al., IJDL 2014
● Bornand et al., JCDL 2016
● Klein et al., JCDL 2019
Query Routing
● Gravano et al., SIGMOD 1997
● Callan et al., SIGMOD 1999
● Lu et al., CIKM 2003
● Meng et al., CSUR 2002
Bloom Filters
● Bloom, CACM 1970
● Majkowski, Cloudflare 2020
● Broder et al., Internet
Mathematics 2003
Web Archive Searching
● Gomes et al., TempWeb 2013
● Costa et al., TempWeb 2013
● Kanhabua et al., TPDL 2016
Archival Web Coverage
● Ainsworth et al., JCDL 2011
● Alkwai et al., ToIS 2017
● SalahEldeen et al., TPDL 2012
● Kelly et al., JCDL 2018
On-Premise Indexing
● Hammer et al., IJCIS 2000
● Kumar et al., RAIT 2016
Surface Web Crawling
● WorldWideWebSize.com
● Lawrence et al., Science 1998
● Alarifi et al., SRE 2012
● Khabsa et al., PLOS ONE 2014
Deep/Hidden Web Crawling
● Raghavan et al., VLDB 2001
● Ntoulas et al., JCDL 2005
● Wu et al., ICDE 2006
● Sheng et al., VLDB 2012
Focused Crawling
● Micarelli et al., The Adaptive Web
2007
● Bergmark et al., ECDL 2002
● Li et al., WI-IAT 2012
Research Questions
26
@ibnesayeed | @WebSciDL
● RQ1: Understanding Web Archives
a. How to Learn an Archive’s Holdings?
b. How to Learn an Archive’s Voids?
● RQ2: How to Summarize and Serialize Archival Holdings
and Voids for Dissemination?
● RQ3: How to Utilize MementoMaps for Memento Routing?
Research Questions
27
@ibnesayeed | @WebSciDL
RQ1a: How to Learn an Archive’s Holdings?
● Content-based Profiling
○ CDX Profiling [Alam, et al., TPDL 2015; Alam, et al., IJDL 2016]
○ Fulltext Search Profiling [Alam, et al., TPDL 2016]
● Usage-based Profiling
○ Sample URI Profiling
○ Response Cache Profiling [Bornand, et al., JCDL 2016]
28
@ibnesayeed | @WebSciDL
RQ1b: How to Learn an Archive’s Voids?
● Content-based Profiling
○ Collection Exclusion Policies
○ Access Control Lists (ACLs)
● Usage-based Profiling
○ Archive’s Access Log Profiling
○ Aggregator’s Access Log Profiling
29
@ibnesayeed | @WebSciDL
RQ2: How to Summarize and Serialize Archival
Holdings and Voids for Dissemination?
● Generation and Compaction
● Updates and Merger
○ Incremental updates
○ Distributed generation
● Pagination
○ Small chunks for storage and transportation
○ Time- or TLD-based organization
○ Holdings and Voids segregation
● Dissemination and Discovery
30
@ibnesayeed | @WebSciDL
RQ3: How to Utilize MementoMaps for Memento
Routing?
● Inverted Index
○ Precomputed index of MementoMaps
● Routing Score Estimation
○ Rank ordering each candidate archive
○ Routing to top-k archives
○ Routing to archives with score above certain threshold
● Machine Learning-Based Classifier
○ Models for individual web archives
31
MementoMap Framework
32
@ibnesayeed | @WebSciDL
MementoMap Framework Components
● Ingestion (RQ1)
○ CDX files/API
○ Fulltext search
○ Access logs
○ Sample URIs
● Summarization and Serialization (RQ2)
○ Resource constraints
○ Application-specific variants
● Memento Routing (RQ3)
○ Integration with aggregators
33
@ibnesayeed | @WebSciDL
Evaluation Plan
● Cost
○ Time
○ Storage space
○ Network bandwidth
○ Periodic updates
● Accuracy
○ Targeting for fewer false positives and false negatives in individual MementoMap
● Freshness
○ How often MementoMaps of a web archive need to be updated?
● Routing Efficiency
○ Accuracy of inverted index across multiple web archives
34
@ibnesayeed | @WebSciDL
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
35
2B URI-Rs that have
1-9 mementos each in
Arquivo.pt were never
requested from ODU’s
MemGator server.
43 URI-Rs were
requested thousands
of times each, but
had zero mementos
in Arquivo.pt.
45 URI-Rs had tens
of mementos each
that were requested
hundreds of times.
@ibnesayeed | @WebSciDL
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
36
Blind spot of a
usage-based
profile
Blind spot of a
content-based
profile
@ibnesayeed | @WebSciDL
Who Bears the Cost of Bad Routing Decisions?
37
Actual
Present in the Archive Not in the Archive
Predicted
Routed to the
Archive
True Positive (TP) False Positive (FP)
Not Routed to
the Archive
False Negative (FN) True Negative (TN)
FP: Wasteful (Infrastructure suffers)
FN: Disuse (Users suffer)
@ibnesayeed | @WebSciDL
Recall and Accuracy
38
Recall = TP / (TP + FN)
We do not report Precision because it does not capture TNs, which are crucial in Memento routing.
How many lookup URIs are routed among URIs that are present in an archive?
How many lookup URIs are correctly routed or not routed?
Accuracy = (TP + TN) / All
RQ1a:
How to Learn an Archive’s Holdings?
39
@ibnesayeed | @WebSciDL
URI Canonicalization and SURT
40
https://news.bbc.co.uk/images/Logo.png?width=200&height=80&rotate=90%C2%B0#top
http://www.news.BBC.co.uk/images/Logo.png?width=200&height=80&rotate=90%c2%b0#top
http://www.news.bbc.co.uk/images/Logo.png?rotate=90%c2%B0&width=200&height=80
http://NEWS.BBC.CO.UK:80//images//Logo.png?height=80&width=200&rotate=90%c2%b0#top
news.bbc.co.uk/images/Logo.png?height=80&rotate=90%C2%B0&width=200
uk,co,bbc,news,)/images/logo.png?height=80&rotate=90%c2%b0&width=200
Canonicalization
SURT
@ibnesayeed | @WebSciDL
CDX/CDXJ Summarization
41http://archive.org/web/researcher/cdx_file_format.php
@ibnesayeed | @WebSciDL
URI-Key Generation and Static Profiling Policies
42
● HmPn Policy (maximum “m” host segments and “n” path segments)
○ H3P0 (3 host and 0 path segments): uk,co,bbc,)/
○ HxP1 (All host and 1 path segments): uk,co,bbc,news,)/images
● DLim Policy (“RegisteredDomain[#SubDomain[/#Paths[/#Queries[/PathInitial]]]]”)
○ DDom (up to domain name): uk,co,bbc,)/
○ DIni: (up to path initial): uk,co,bbc,)/1/2/3/i
@ibnesayeed | @WebSciDL
Archives Dataset
43
Archive URI-Rs URI-Ms Index Size
Archive-It 1.9B 5.3B 1.8TB
UKWA 0.7B 1.7B 0.5TB
Stanford 12M 25M 8.3GB
@ibnesayeed | @WebSciDL
Sample Query URI Sets
44
Sample
(1M URIs Each)
In
Archive-It
In
UKWA
In
Stanford
Union
{AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
MementoProxy 4.182% 0.408% 0.046% 4.527%
IAWayback 3.716% 0.519% 0.039% 4.165%
UKWayback 0.108% 0.034% 0.002% 0.134%
Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
@ibnesayeed | @WebSciDL
CDX Size vs URI-M (UKWA 10 Years)
45
Alpha: 175 bytes per CDX line
Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
@ibnesayeed | @WebSciDL
URI-M vs URI-R (UKWA 10 Years)
46
Gamma: 2.46 K: 2.686
Beta: 0.911
Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
@ibnesayeed | @WebSciDL
Relative Space Cost (UKWA 7 Years)
47
Phi: 8.5e-07 -- 0.70583
Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
@ibnesayeed | @WebSciDL
Time Cost (UKWA 7 Years)
48
Tau: 5.7e-05 -- 6.2e-05
CDX: 45GB
URI-Ms: 181M
URI-Rs: 96M
Time: 3 hours
Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
@ibnesayeed | @WebSciDL
Resource Requirement
49Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
@ibnesayeed | @WebSciDL
Cost vs. Accuracy: UKWA
50
Archive-It and Stanford archives have similar trends.
@ibnesayeed | @WebSciDL
Profile Policy Groups: Cost vs. Accuracy
51
Group Group Relative Cost Accuracy
G1 H1P0/TLD Bound by # of TLDs ≈ 0.01
G2 H3P0, DDom, DSub, DPth, DQry < 0.01 ≈ 0.78
G3 DIni ≈ 2 * G2 ≈ 0.88
G4 HxP1 ≈ 5 * G3 ≈ 0.94
G5 Higher HmPn 0.4 - 0.7 Not Explored
G6 URIR 1.0 1.0
@ibnesayeed | @WebSciDL
Collecting CDX Is Difficult
52
https://memgator.cs.odu.edu/
MemGator Service at ODU
currently aggregates 16
web archives, but we have
CDX data only from 4.
However, some of these
archives have fulltext
search support, so we can
learn about their holdings.
⭐
⭐
⭐
⭐
@ibnesayeed | @WebSciDL
Who Knows Term Frequency for Estonian Nouns?
53
https://en.wiktionary.org/wiki/Category:Estonian_nouns
@ibnesayeed | @WebSciDL
Fulltext Search Profiling
54
Top Nouns
time
year
people
way
man
day
thing
child
mr
government
Random Dict
analogies
unbolt
consonant
coils
stolidly
cigar
decrepit
rhododendron
cannibal
honeydew
Dynamic Words Discovery
the ‫وﻛﺎﻟﺔ‬ war
angry ‫أﻧﺑﺎء‬ the
arab ‫اﻟﻌرﺑﻲ‬ middle
news ‫اﻟﻐﺎﺿب‬ east
service on arabic
a politics poetry
source war art
@ibnesayeed | @WebSciDL
Random Searcher Model (RSM)
55
Search for a word
@ibnesayeed | @WebSciDL
Random Searcher Model (RSM)
56
http://jeffreyhill.typepad.com/english/
http://www.nc-net.info/english.php
http://english.aljazeera.net/
http://twitter.com/AJEnglish
https://vimeo.com/248815105
http://www.bridge.edu/
http://www.wordreference.com/
https://www.facebook.com/aljazeera
http://www.elizabethangardens.org/
...
Collect resulting links
@ibnesayeed | @WebSciDL
Random Searcher Model (RSM)
57
http://jeffreyhill.typepad.com/english/
http://www.nc-net.info/english.php
http://english.aljazeera.net/
http://twitter.com/AJEnglish
https://vimeo.com/248815105
http://www.bridge.edu/
http://www.wordreference.com/
https://www.facebook.com/aljazeera
http://www.elizabethangardens.org/
...
Load a random result link
@ibnesayeed | @WebSciDL
Random Searcher Model (RSM)
58
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC
ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing
Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating
Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server
Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English
Webliography North Carolina Community College System 2012
http://jeffreyhill.typepad.com/english/
http://www.nc-net.info/english.php
http://english.aljazeera.net/
http://twitter.com/AJEnglish
https://vimeo.com/248815105
http://www.bridge.edu/
http://www.wordreference.com/
https://www.facebook.com/aljazeera
http://www.elizabethangardens.org/
...
Extract words from the page
@ibnesayeed | @WebSciDL
Random Searcher Model (RSM)
59
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC
ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing
Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating
Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server
Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English
Webliography North Carolina Community College System 2012
http://jeffreyhill.typepad.com/english/
http://www.nc-net.info/english.php
http://english.aljazeera.net/
http://twitter.com/AJEnglish
https://vimeo.com/248815105
http://www.bridge.edu/
http://www.wordreference.com/
https://www.facebook.com/aljazeera
http://www.elizabethangardens.org/
...
Select a word to search
@ibnesayeed | @WebSciDL
Random Searcher Model (RSM)
60
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC
ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing
Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating
Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server
Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English
Webliography North Carolina Community College System 2012
http://jeffreyhill.typepad.com/english/
http://www.nc-net.info/english.php
http://english.aljazeera.net/
http://twitter.com/AJEnglish
https://vimeo.com/248815105
http://www.bridge.edu/
http://www.wordreference.com/
https://www.facebook.com/aljazeera
http://www.elizabethangardens.org/
...
@ibnesayeed | @WebSciDL
RSM Modes Costs
61
Mode HTTP Cost Remarks
Static C
Suitable for specialized collection with known top
keywords
PopularityBiased 2C Human like model, but costly
EqualOpportunity 2C Human like model, but costly
Conservative C + 𝛿
(where 𝛿 << C)
Suitable for any collection and works without any
supplementary materials with very little overhead
“C” is the cost/number of search queries.
@ibnesayeed | @WebSciDL
Search Needed vs. Coverage
62
100% in 11K searches
100% in 27K searches
100% in 337K searches 100% in 1.9M searches
Alam et al., “Web Archive Profiling Through Fulltext Search”, TPDL 2016
@ibnesayeed | @WebSciDL
Accuracy, Recall, and Coverage (10-100%)
63
DMOZ IA Wayback
UK WaybackMemento Proxy
Low Accuracy (high FP) =>
Archives & Aggregator suffer
Low Recall (high FN) =>
Users suffer
Alam et al., “Web Archive Profiling Through Fulltext Search”, TPDL 2016
64
RQ1b:
How to Learn an Archive’s Voids?
@ibnesayeed | @WebSciDL
Why Profile Archival Voids?
65
$ curl -I https://web.archive.org/web/https://quora.com/
HTTP/1.1 403 FORBIDDEN
Server: nginx/1.15.8
Date: Wed, 02 Dec 2020 20:39:33 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Server-Timing: captures_list;dur=0.150497
X-App-Server: wwwb-app58
X-ts: 403
The Internet Archive has
many “*.com” domains,
but it may not want to
capture or replay some.
@ibnesayeed | @WebSciDL
Archival Voids Profiles Reduce False Positives
66
org,arxiv)/abs/a 40
org,arxiv)/abs/b 23
org,arxiv)/abs/c 17
org,arxiv)/format/a 15
org,arxiv)/format/b 20
org,arxiv)/format/c 10
org,arxiv)/search/a 30
...
org,arxiv)/abs/* 80
org,arxiv)/format/* 45
org,arxiv)/search/* 60
org,arxiv)/* 185
org,arxiv)/abs/d
False Positive org,arxiv)/pdf/a
org,arxiv)/pdf/b
org,arxiv)/pdf/c
False Positive
org,arxiv)/* 185
org,arxiv)/pdf/* 0
How about summarizing frequently
accessed URIs an archive does not hold?
@ibnesayeed | @WebSciDL
Arquivo.pt Access Log Dataset
67
@ibnesayeed | @WebSciDL
Most Frequently Accessed URIs
68
Most of the traffic to
“fccn.pt” is originated
from the UptimeRobot
Always returned a “404 Not Found” response.
@ibnesayeed | @WebSciDL
404-Only Frequencies and Request Savings
69
An archival voids profile of 2.4k URIs, that were accessed hundreds of
times each or more, could have saved about 8.4% of wasted requests.
@ibnesayeed | @WebSciDL
Archival Voids Recommendations
70
● Keep archival voids profiles separate from archival holdings
● Update often
● Use specific keys with only high confidence
● Profile only resources that are high in demand
● Archives themselves are better sources of truth than
external observers
RQ2:
How to Summarize and Serialize
Archival Holdings and Voids for
Dissemination?
71
@ibnesayeed | @WebSciDL
If Only Archives Could Tell What to Ask Them For
● Websites advertise their holdings using sitemap.xml, why can’t archives?
○ Archives have billions or even trillions of URI-Ms
○ Such exhaustive lists would go stale very quickly
● How about robots.txt?
○ It is compact, but is exclusion format, it does not tell what the site has
○ It assumes a single domain, patterns are for paths (not the domain name)
● How about well-known URIs?
○ Good for automated discovery of domain-specific metadata resources
● How about combining these ideas?
○ Introducing MementoMap!
72
@ibnesayeed | @WebSciDL
A MementoMap Example
73
!context ["http://oduwsdl.github.io/contexts/ukvs"]
!id {uri: "http://archive.example.org/"}
!fields {keys: ["surt"], values: ["frequency"]}
!meta {type: "MementoMap", name: "A Test Web Archive", year: 1996}
!meta {updated_at: "2018-09-03T13:27:52Z"}
* 54321/20000
com,* 10000+
org,arxiv)/ 100
org,arxiv)/* 2500~/900
org,arxiv)/pdf/* 0
uk,co,bbc)/images/* 300+/20-
https://github.com/oduwsdl/ORS/blob/master/ukvs.md
Goodbye HmPn/DLim static profiling policies, thanks to our SURT with wildcard.
@ibnesayeed | @WebSciDL
SURT Representation With Wildcard
74
Original SURTs did not have wildcards.
We introduced it for dynamic profiling.
In practice the common “http://(” prefix
is removed.
@ibnesayeed | @WebSciDL
Arquivo.pt Index Statistics
75Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
The Internet Archive is
about 150 times bigger
than Arquivo.pt.
@ibnesayeed | @WebSciDL
Top Arquivo.pt TLDs
76Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
Arquivo.pt was created to
archive sites of interest to
the Portuguese people.
One third of the Arquivo.pt
is other than *.pt pages,
hence routing based on top
TLDs would miss a lot.
@ibnesayeed | @WebSciDL
Who Would Have Thought
Arquivo.pt Has 10K+ .онлайн Sites?
77
“.онлайн”
(encoded as “xn--80asehdb”)
is an IDN gTLD which means
“.online”
@ibnesayeed | @WebSciDL
Cumulative Growth of URI-Ms and URI-Rs in Arquivo.pt
78Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
50% mementos
were captured in
the last two active
years alone.
@ibnesayeed | @WebSciDL
Shape of HxPx Key Tree of Arquivo.pt
79Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
@ibnesayeed | @WebSciDL
Incremental Children Reduction Rate
80Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
@ibnesayeed | @WebSciDL
Processed Lines vs. Compacted MementoMap Growth
81
com,example)/a/1/x
com,example)/a/2
com,example)/a/3
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/a/*
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/*
Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
@ibnesayeed | @WebSciDL
MementoMap Generation, Compaction, and Lookup
82Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
1.5% Relative Cost yields 60% Accuracy.
Arquivo.pt can save 60% wasted traffic by
publishing a 119MB summary file!
@ibnesayeed | @WebSciDL
Dissemination and Discovery Methods
83
GET /.well-known/mementomap HTTP/1.1
Host: arquivo.pt
Link: <https://arquivo.pt/path/to/mementomap.ukvs>;
rel="mementomap"
<link href="https://arquivo.pt/path/to/mementomap.ukvs"
rel="mementomap">
Well-known URI
Link Header
Link HTML Element
RQ3:
How to Utilize MementoMaps for
Memento Routing?
84
@ibnesayeed | @WebSciDL
Putting Archival Holdings and Voids to Work
85
● Combine MementoMaps (holdings and/or voids) of various
web archives
● Create an inverted index for efficient cross-archive lookup
● Define scores based on reported data in MementoMaps
● Predict whether a lookup URI should be routed to an archive
@ibnesayeed | @WebSciDL
How Densely Is a URI Subtree Archived?
* /10000000000+
edu,odu)/* /10000~
86
Where to go?
https://www.odu.edu/academics/graduation-commencement/graduation/FAQs
* /10000+
edu,odu)/* /1
@ibnesayeed | @WebSciDL
Density Score
87
A normalized score to assess how actively an archive is capturing a given URI space.
* /1000000
org,example)/a/b/c/* 200/100org,example)/a/b/c/d/e
µ = log(1 + 100) / log(1 + 1000000) ≈ 0.334
@ibnesayeed | @WebSciDL
How Close Are a Lookup URI and Its Corresponding
URI Key?
88
edu,odu,)/academics/graduation-commencement/* /100
Where to go?
https://www.odu.edu/academics/graduation-commencement/graduation/FAQs
edu,odu)/* /100
@ibnesayeed | @WebSciDL
Closeness Score
89
A normalized score to assess how closely the longest URI Key prefix matches with the lookup URI.
org,example)/a/b/c/*org,example)/a/b/c/d/e
χ = min(log(1 + 5), 1) / (1 + 7 − 5) ≈ 0.259
Ll
= 7 Lk
= 5
@ibnesayeed | @WebSciDL
How Likely Is the Lookup URI to Be Found
in Each of the Archives?
90
Which ones to select?
https://www.odu.edu/academics/graduation-commencement/graduation/FAQs
1.00
0.75
0.50
0.25
0.00
1.00
0.75
0.50
0.25
0.00
1.00
0.75
0.50
0.25
0.00
1.00
0.75
0.50
0.25
0.00
Cut-off
Threshold
@ibnesayeed | @WebSciDL
Routing Score
91
A normalized score to assess the likelihood of finding the lookup URI in an archive.
ρ = µ * χ ≈ 0.334 * 0.259 ≈ 0.087
Weighted Relative Routing ScoreRelative Routing Score
@ibnesayeed | @WebSciDL
MementoMaps of Different Archives
92
@ibnesayeed | @WebSciDL
Inverted Index of MementoMaps
93
@ibnesayeed | @WebSciDL
Inverted Index Lookup Result
94
Not profiled (default routing score)
Archival Void (“0” routing score)
Relative Routing Scores of all
archives add up to “1.0”
@ibnesayeed | @WebSciDL
Web Archive Routing Evaluation Dataset
95
Created from the MemGator access logs
@ibnesayeed | @WebSciDL
Archival Collection Diffusion
96
Simple TLD-based Memento routing is insufficient.
@ibnesayeed | @WebSciDL
Baseline Routing
97
Large Recall values and small request costs of top-1 and top-2 policies
show the effectiveness of our heuristic Routing Score.
@ibnesayeed | @WebSciDL
Cut-off Threshold Routing
98
Inclusion of Voids
profiles improve
Savings more
prominently than
Accuracy due to
frequency bias.
Poor prevalence of
sample URIs in
UKWA and Stanford
hurts their scores.
@ibnesayeed | @WebSciDL
Machine Learning-Based Routing
99
Classifier is biased towards Accuracy at the cost of poor Recall
due to poor prevalence of positive cases in the dataset.
Contributions, Future Work, and
Conclusions
100
@ibnesayeed | @WebSciDL
Tools Contributions in the Web Archiving Ecosystem
101
Open-Source Tools/Scripts
❖ https://github.com/oduwsdl/ipwb
❖ https://github.com/oduwsdl/Reconstructive
❖ https://github.com/oduwsdl/MemGator
❖ https://github.com/oduwsdl/archive_profiler
❖ https://github.com/oduwsdl/accesslog-parser
❖ https://github.com/oduwsdl/MementoMap
❖ https://github.com/oduwsdl/ORS
❖ https://jekyll.github.io/classifier-reborn/
@ibnesayeed | @WebSciDL
Contributions: Algorithms
● Random Searcher Model (RSM)
○ Utilize fulltext search interface to sample archival holdings
○ Supports multiple modes of operation
○ Discovery: 10% => Accuracy: 0.8, Recall: 0.9
● MementoMap Generation, Compaction, and Merger
○ Consumes a sorted list of URIs in SURT format
○ Allows configuration options to control compaction
○ Linear, single-pass, small constant memory footprint
(irrespective of the input size)
○ Accuracy > 0.6, Recall: 1.0, Relative Cost < 1.5%
102
@ibnesayeed | @WebSciDL
● URI Key: Extended SURT with the wildcard support to describe subtrees of the URI
space in the form of URI prefixes
● Archival Holdings: A measure to describe holdings of an archive
● Archival Voids: A measure to describe what an archive is missing
● Relative Cost: The ratio of the number of URI keys used to describe summarized
holdings of an archive over the total number of unique URI-Rs in the archive
● Frequency Score: A means to represent the number of URI-Ms and/or URI-Rs
under a URI key
● Density Score: A normalized score derived from the frequency score to describe
the archiving activity under a URI key
● Closeness Score: A normalized score to describe how similar or different two URI
keys are
● Routing Score: A normalized score to represent how likely it is that an archive has
a URI
Contributions: Terminologies and Metrics
103
@ibnesayeed | @WebSciDL
Contributions: Publications
104
RQ2
● TPDL 2015
● TCDL 2015
● IJDL 2016
● JCDL 2016
● TPDL 2016
● TPDL 2016
● JCDL 2017
● TCDL 2017
● JCDL 2018
● JCDL 2019 ⭐
● IJDL 2021
● RFC
● RFC
RQ3
● JCDL 2016
● TCDL 2017
● ESCAPE 2019
● JCDL 2021
● IJDL 2021
🥇 Best paper/poster award ⭐ Best paper nomination Italics = Planned/in progress
RQ1
● TPDL 2015
● TCDL 2015
● IJDL 2016
● TPDL 2016
● JCDL 2017 🥇
● JCDL 2018 🥇
● JCDL 2019
● WADL 2019
● ESCAPE 2019
● WADL 2020
● JCDL 2021
● IJDL 2021
@ibnesayeed | @WebSciDL
Future Work
● URI Keys for collection seed summarization and collection diversity measure
● Archival Voids to assess crawl jobs
● Explore UKVS file format use cases in web archiving and beyond
● Profiling other dimensions
○ Datetime
○ Language
● Standards for cross-archive collection exploration
● Alternate approaches to URI subtree rollups
● Other ML and Neural Net-based techniques for Memento routing
● Hierarchical network Memento routing
● Adoption of MementoMap framework by archives and aggregators
105
@ibnesayeed | @WebSciDL
Future Work: MementoMap Adoption Path
● PWA, UKWA, and NLA have shown interest
● PyWB archival replay system is open for implementation
● MemGator and LANL’s Time Travel service are interested
● Big web archives can start with publishing archival voids
○ No need to profile IA
● Archives with access restrictions can have multiple
MementoMaps
● Third parties can create and publish MementoMaps of the
rest of the archives while they catch up
106
107
<salam@cs.odu.edu> <sawood@archive.org>
@ibnesayeed | @WebSciDL
● Introduced challenges in Memento aggregation
○ Broadcasting can be evil
○ Profiling is desired for collection understanding and better Memento routing
○ Aggregation is useful, even for the small web archives
● MementoMap framework addresses three research questions
○ RQ1: How to learn about a web archive’s holdings and voids?
■ Holdings: CDX and fulltext search profiling
■ Voids: Access log profiling
○ RQ2: How to summarize and serialize archival holdings and voids?
■ MementoMap/UKVS serialization and dissemination
○ RQ3: How to utilize MementoMaps for informed Memento routing?
■ Inverted Index, Routing Score, Classifier
Conclusions
108
Over 96% Accuracy with 89% Recall or 68% Accuracy with 99% Recall
via a 120MB MementoMap file for an archive with 2B+ unique URI-Rs.

More Related Content

What's hot

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 

What's hot (20)

RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Data in RDF
Data in RDFData in RDF
Data in RDF
 
SPARQL Tutorial
SPARQL TutorialSPARQL Tutorial
SPARQL Tutorial
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
Reference Interview 101
Reference Interview 101Reference Interview 101
Reference Interview 101
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Introduction to digital curation
Introduction to digital curationIntroduction to digital curation
Introduction to digital curation
 
How to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsHow to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 steps
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
 

Similar to MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

Andrew Cox Research data management
Andrew Cox Research data managementAndrew Cox Research data management
Andrew Cox Research data management
Incisive_Events
 

Similar to MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing (20)

Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
Practical Data Management - ACRL DCIG Webinar
Practical Data Management - ACRL DCIG WebinarPractical Data Management - ACRL DCIG Webinar
Practical Data Management - ACRL DCIG Webinar
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Welcome to Consuming Linked Data tutorial WWW2010
Welcome to Consuming Linked Data tutorial WWW2010Welcome to Consuming Linked Data tutorial WWW2010
Welcome to Consuming Linked Data tutorial WWW2010
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
Andrew Cox Research data management
Andrew Cox Research data managementAndrew Cox Research data management
Andrew Cox Research data management
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Semantic wikis
Semantic wikisSemantic wikis
Semantic wikis
 
Digital presevation
Digital presevationDigital presevation
Digital presevation
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 
The Many Shapes of Archive-It
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-It
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Kaptur by mgramstadt, 6th Feb. 2012
Kaptur by mgramstadt, 6th Feb. 2012Kaptur by mgramstadt, 6th Feb. 2012
Kaptur by mgramstadt, 6th Feb. 2012
 
Developing WebQuests
Developing WebQuestsDeveloping WebQuests
Developing WebQuests
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 

More from Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

More from Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 

Recently uploaded

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Recently uploaded (20)

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

  • 1. Sawood Alam <@ibnesayeed> Advisor: Michael L. Nelson Members: Michele C. Weigle, Jian Wu, Sampath Jayarathna, and Erika F. Frydenlund MementoMap A Web Archive Profiling Framework for Efficient Memento Routing Doctoral Dissertation Defense, December 04, 2020 Old Dominion University, Norfolk, Virginia - 23529 (USA)
  • 2. @ibnesayeed | @WebSciDL ● Introduction and Motivation ● Research Questions ● MementoMap Framework ● RQ1: Understanding Web Archives ○ Archival Holdings ○ Archival Voids ● RQ2: Serialization and Dissemination ● RQ3: Memento Routing ● Contributions, Future Work, and Conclusions Outline 2
  • 4. @ibnesayeed | @WebSciDL 4 Homepage of the Smithsonian Institution (SI)
  • 5. @ibnesayeed | @WebSciDL 5 An Early Memento of SI in Wayback Machine https://web.archive.org/web/19971210203441/http://www.si.edu/newstart.htm A missing image
  • 6. @ibnesayeed | @WebSciDL 6 The Earliest Memento of SI in Arquivo.pt https://arquivo.pt/wayback/19961013204418/http://www.si.edu/
  • 7. @ibnesayeed | @WebSciDL 7 List of SI Mementos in the Two Web Archives https://web.archive.org/web/*/http://si.edu/ https://arquivo.pt/wayback/*/http://si.edu/
  • 8. @ibnesayeed | @WebSciDL 8 List of SI Mementos From an Aggregator http://timetravel.mementoweb.org/list/19950101000000/http://si.edu/ Mementos from 7 different web archives. Arquivo.pt has the first memento.
  • 9. @ibnesayeed | @WebSciDL 9 $ curl -s https://web.archive.org/web/timemap/link/http://si.edu/ <http://www.si.edu:80/>; rel="original", <https://web.archive.org/web/http://si.edu/>; rel="timegate", <https://web.archive.org/web/timemap/link/http://si.edu/>; rel="self"; type="application/link-format"; from="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970502110751/http://www.si.edu:80/>; rel="first memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970502110751/http://www.si.edu:80/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970502110751/http://www.si.edu:80/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970728075821/http://www.si.edu:80/>; rel="memento"; datetime="Mon, 28 Jul 1997 07:58:21 GMT", <https://web.archive.org/web/19971210203635/http://www.si.edu:80/>; rel="memento"; datetime="Wed, 10 Dec 1997 20:36:35 GMT", [...TRUNCATED...] TimeMap of SI From Wayback Machine Original URI (URI-R) Memento URI (URI-M)
  • 10. @ibnesayeed | @WebSciDL 10 $ curl https://arquivo.pt/wayback/timemap/link/http://si.edu/ <https://arquivo.pt/wayback/timemap/link/http://si.edu/>; rel="self"; type="application/link-format"; from="Sun, 13 Oct 1996 20:44:18 GMT", <https://arquivo.pt/wayback/http://si.edu/>; rel="timegate", <http://si.edu/>; rel="original", <https://arquivo.pt/wayback/19961013204418mp_/http://www.si.edu/>; rel="memento"; datetime="Sun, 13 Oct 1996 20:44:18 GMT"; collection="$root", <https://arquivo.pt/wayback/20081025151519mp_/http://www.si.edu/>; rel="memento"; datetime="Sat, 25 Oct 2008 15:15:19 GMT"; collection="$root", <https://arquivo.pt/wayback/20090716053258mp_/http://www.si.edu/>; rel="memento"; datetime="Thu, 16 Jul 2009 05:32:58 GMT"; collection="$root", <https://arquivo.pt/wayback/20091014121540mp_/http://www.si.edu/>; rel="memento"; datetime="Wed, 14 Oct 2009 12:15:40 GMT"; collection="$root", <https://arquivo.pt/wayback/20100529165454mp_/http://www.si.edu/>; rel="memento"; datetime="Sat, 29 May 2010 16:54:54 GMT"; collection="$root", [...TRUNCATED...] TimeMap of SI From Arquivo.pt
  • 11. @ibnesayeed | @WebSciDL 11 $ curl https://memgator.cs.odu.edu/timemap/link/http://si.edu/ <http://si.edu/>; rel="original", <https://memgator.cs.odu.edu/timemap/link/http://si.edu/>; rel="self"; type="application/link-format", <https://arquivo.pt/wayback/19961013204418mp_/http://www.si.edu/>; rel="first memento"; datetime="Sun, 13 Oct 1996 20:44:18 GMT", <https://webarchive.loc.gov/all/19970502110751/http://www.si.edu/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://wayback.archive-it.org/all/19970502110751/http://www.si.edu/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970502110751/http://www.si.edu:80/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970502110751/http://www.si.edu:80/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", <https://web.archive.org/web/19970502110751/http://www.si.edu:80/>; rel="memento"; datetime="Fri, 02 May 1997 11:07:51 GMT", [...TRUNCATED...] TimeMap of SI From a Memento Aggregator
  • 12. @ibnesayeed | @WebSciDL 12 $ memgator -f cdxj http://si.edu/ | grep -v "^!" | cut -d'/' -f3 | sort | uniq -c | sort -nr 13263 web.archive.org 3590 wayback.archive-it.org 1202 web.archive.bibalex.org 651 webarchive.loc.gov 321 arquivo.pt 32 wayback.vefsafn.is 11 web.archive.org.au 3 archive.is 1 www.webarchive.org.uk 1 swap.stanford.edu 1 perma.cc $ memgator -f cdxj http://odu.edu/ | grep -v "^!" | cut -d'/' -f3 | sort | uniq -c | sort -nr 3071 web.archive.org 796 wayback.archive-it.org 751 web.archive.bibalex.org 99 webarchive.loc.gov 26 arquivo.pt 2 archive.is 1 wayback.vefsafn.is Cross-Archive Memento Lookup With MemGator Although there are 13k+ mementos in IA, there are also mementos in 10 other public web archives. https://github.com/oduwsdl/MemGator ODU is less popular, but there are mementos in 7 different web archives.
  • 13. @ibnesayeed | @WebSciDL Who Would Have Thought to Lookup in the Icelandic Web Archive for odu.edu Mementos? 13 http://wayback.vefsafn.is/wayback/20100810032449/http://odu.edu/
  • 14. @ibnesayeed | @WebSciDL Why Aggregate Small Archives? ● One trillion+ mementos in IA’s Wayback Machine do not cover everything ● Archives often have unique mementos (small overlap) ● Linguistic and geolocation diversity ● High-quality curated collections ● Restricted resources and private archives 14
  • 21. @ibnesayeed | @WebSciDL MemGator Log Responses From Various Archives 21 93% of the requests made from MemGator to upstream archives were wasteful. Only about one third of the requests to the largest web archive (IA) were a hit.
  • 22. @ibnesayeed | @WebSciDL Aggregation Is Great, But Broadcasting Is Wasteful 22 What do we want? Aggregate all archives, large or small What’s the problem? Broadcasting is wasteful and problematic What’s the solution? Selectively poll archives that are likely to return good results for a lookup URI How to identify those? Profile web archives How to profile archives? MementoMap Framework
  • 23. @ibnesayeed | @WebSciDL Memento Lookup Routing 23 Let us fix the broadcasting issue with a more informed routing.
  • 24. @ibnesayeed | @WebSciDL Archive Profiling Strategies ● Complete URI-R Profiling (1 URI-R = 1 Profile Key) [Sanderson et al., TPDL 2012] ○ bbc.co.uk/images/logo.png?w=90 ○ cnn.com/2014/03/15/?id=128734 ● TLD-Only Profiling (1 TLD = 1 Profile Key) [AlSum, et al., TPDL 2013] ○ *.com ○ *.uk ● Middle Ground ○ *.cnn.com ○ *.co.uk ○ *.bbc.co.uk ○ bbc.co.uk/images/* 24 We explore these strategies in this work. Top three archives after IA produce full TimeMaps 52% of the time.
  • 25. @ibnesayeed | @WebSciDL Related Work 25 Archive Profiling ● Sanderson et al., IIPC 2012 ● AlSum et al., IJDL 2014 ● Bornand et al., JCDL 2016 ● Klein et al., JCDL 2019 Query Routing ● Gravano et al., SIGMOD 1997 ● Callan et al., SIGMOD 1999 ● Lu et al., CIKM 2003 ● Meng et al., CSUR 2002 Bloom Filters ● Bloom, CACM 1970 ● Majkowski, Cloudflare 2020 ● Broder et al., Internet Mathematics 2003 Web Archive Searching ● Gomes et al., TempWeb 2013 ● Costa et al., TempWeb 2013 ● Kanhabua et al., TPDL 2016 Archival Web Coverage ● Ainsworth et al., JCDL 2011 ● Alkwai et al., ToIS 2017 ● SalahEldeen et al., TPDL 2012 ● Kelly et al., JCDL 2018 On-Premise Indexing ● Hammer et al., IJCIS 2000 ● Kumar et al., RAIT 2016 Surface Web Crawling ● WorldWideWebSize.com ● Lawrence et al., Science 1998 ● Alarifi et al., SRE 2012 ● Khabsa et al., PLOS ONE 2014 Deep/Hidden Web Crawling ● Raghavan et al., VLDB 2001 ● Ntoulas et al., JCDL 2005 ● Wu et al., ICDE 2006 ● Sheng et al., VLDB 2012 Focused Crawling ● Micarelli et al., The Adaptive Web 2007 ● Bergmark et al., ECDL 2002 ● Li et al., WI-IAT 2012
  • 27. @ibnesayeed | @WebSciDL ● RQ1: Understanding Web Archives a. How to Learn an Archive’s Holdings? b. How to Learn an Archive’s Voids? ● RQ2: How to Summarize and Serialize Archival Holdings and Voids for Dissemination? ● RQ3: How to Utilize MementoMaps for Memento Routing? Research Questions 27
  • 28. @ibnesayeed | @WebSciDL RQ1a: How to Learn an Archive’s Holdings? ● Content-based Profiling ○ CDX Profiling [Alam, et al., TPDL 2015; Alam, et al., IJDL 2016] ○ Fulltext Search Profiling [Alam, et al., TPDL 2016] ● Usage-based Profiling ○ Sample URI Profiling ○ Response Cache Profiling [Bornand, et al., JCDL 2016] 28
  • 29. @ibnesayeed | @WebSciDL RQ1b: How to Learn an Archive’s Voids? ● Content-based Profiling ○ Collection Exclusion Policies ○ Access Control Lists (ACLs) ● Usage-based Profiling ○ Archive’s Access Log Profiling ○ Aggregator’s Access Log Profiling 29
  • 30. @ibnesayeed | @WebSciDL RQ2: How to Summarize and Serialize Archival Holdings and Voids for Dissemination? ● Generation and Compaction ● Updates and Merger ○ Incremental updates ○ Distributed generation ● Pagination ○ Small chunks for storage and transportation ○ Time- or TLD-based organization ○ Holdings and Voids segregation ● Dissemination and Discovery 30
  • 31. @ibnesayeed | @WebSciDL RQ3: How to Utilize MementoMaps for Memento Routing? ● Inverted Index ○ Precomputed index of MementoMaps ● Routing Score Estimation ○ Rank ordering each candidate archive ○ Routing to top-k archives ○ Routing to archives with score above certain threshold ● Machine Learning-Based Classifier ○ Models for individual web archives 31
  • 33. @ibnesayeed | @WebSciDL MementoMap Framework Components ● Ingestion (RQ1) ○ CDX files/API ○ Fulltext search ○ Access logs ○ Sample URIs ● Summarization and Serialization (RQ2) ○ Resource constraints ○ Application-specific variants ● Memento Routing (RQ3) ○ Integration with aggregators 33
  • 34. @ibnesayeed | @WebSciDL Evaluation Plan ● Cost ○ Time ○ Storage space ○ Network bandwidth ○ Periodic updates ● Accuracy ○ Targeting for fewer false positives and false negatives in individual MementoMap ● Freshness ○ How often MementoMaps of a web archive need to be updated? ● Routing Efficiency ○ Accuracy of inverted index across multiple web archives 34
  • 35. @ibnesayeed | @WebSciDL What is Archived in Arquivo.pt? What is Accessed from MemGator? 35 2B URI-Rs that have 1-9 mementos each in Arquivo.pt were never requested from ODU’s MemGator server. 43 URI-Rs were requested thousands of times each, but had zero mementos in Arquivo.pt. 45 URI-Rs had tens of mementos each that were requested hundreds of times.
  • 36. @ibnesayeed | @WebSciDL What is Archived in Arquivo.pt? What is Accessed from MemGator? 36 Blind spot of a usage-based profile Blind spot of a content-based profile
  • 37. @ibnesayeed | @WebSciDL Who Bears the Cost of Bad Routing Decisions? 37 Actual Present in the Archive Not in the Archive Predicted Routed to the Archive True Positive (TP) False Positive (FP) Not Routed to the Archive False Negative (FN) True Negative (TN) FP: Wasteful (Infrastructure suffers) FN: Disuse (Users suffer)
  • 38. @ibnesayeed | @WebSciDL Recall and Accuracy 38 Recall = TP / (TP + FN) We do not report Precision because it does not capture TNs, which are crucial in Memento routing. How many lookup URIs are routed among URIs that are present in an archive? How many lookup URIs are correctly routed or not routed? Accuracy = (TP + TN) / All
  • 39. RQ1a: How to Learn an Archive’s Holdings? 39
  • 40. @ibnesayeed | @WebSciDL URI Canonicalization and SURT 40 https://news.bbc.co.uk/images/Logo.png?width=200&height=80&rotate=90%C2%B0#top http://www.news.BBC.co.uk/images/Logo.png?width=200&height=80&rotate=90%c2%b0#top http://www.news.bbc.co.uk/images/Logo.png?rotate=90%c2%B0&width=200&height=80 http://NEWS.BBC.CO.UK:80//images//Logo.png?height=80&width=200&rotate=90%c2%b0#top news.bbc.co.uk/images/Logo.png?height=80&rotate=90%C2%B0&width=200 uk,co,bbc,news,)/images/logo.png?height=80&rotate=90%c2%b0&width=200 Canonicalization SURT
  • 41. @ibnesayeed | @WebSciDL CDX/CDXJ Summarization 41http://archive.org/web/researcher/cdx_file_format.php
  • 42. @ibnesayeed | @WebSciDL URI-Key Generation and Static Profiling Policies 42 ● HmPn Policy (maximum “m” host segments and “n” path segments) ○ H3P0 (3 host and 0 path segments): uk,co,bbc,)/ ○ HxP1 (All host and 1 path segments): uk,co,bbc,news,)/images ● DLim Policy (“RegisteredDomain[#SubDomain[/#Paths[/#Queries[/PathInitial]]]]”) ○ DDom (up to domain name): uk,co,bbc,)/ ○ DIni: (up to path initial): uk,co,bbc,)/1/2/3/i
  • 43. @ibnesayeed | @WebSciDL Archives Dataset 43 Archive URI-Rs URI-Ms Index Size Archive-It 1.9B 5.3B 1.8TB UKWA 0.7B 1.7B 0.5TB Stanford 12M 25M 8.3GB
  • 44. @ibnesayeed | @WebSciDL Sample Query URI Sets 44 Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union {AIT, UK, SU} DMOZ 4.097% 3.594% 0.034% 7.575% MementoProxy 4.182% 0.408% 0.046% 4.527% IAWayback 3.716% 0.519% 0.039% 4.165% UKWayback 0.108% 0.034% 0.002% 0.134% Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
  • 45. @ibnesayeed | @WebSciDL CDX Size vs URI-M (UKWA 10 Years) 45 Alpha: 175 bytes per CDX line Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
  • 46. @ibnesayeed | @WebSciDL URI-M vs URI-R (UKWA 10 Years) 46 Gamma: 2.46 K: 2.686 Beta: 0.911 Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
  • 47. @ibnesayeed | @WebSciDL Relative Space Cost (UKWA 7 Years) 47 Phi: 8.5e-07 -- 0.70583 Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
  • 48. @ibnesayeed | @WebSciDL Time Cost (UKWA 7 Years) 48 Tau: 5.7e-05 -- 6.2e-05 CDX: 45GB URI-Ms: 181M URI-Rs: 96M Time: 3 hours Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
  • 49. @ibnesayeed | @WebSciDL Resource Requirement 49Alam et al., “Web Archive Profiling Through CDX Summarization”, IJDL 2016
  • 50. @ibnesayeed | @WebSciDL Cost vs. Accuracy: UKWA 50 Archive-It and Stanford archives have similar trends.
  • 51. @ibnesayeed | @WebSciDL Profile Policy Groups: Cost vs. Accuracy 51 Group Group Relative Cost Accuracy G1 H1P0/TLD Bound by # of TLDs ≈ 0.01 G2 H3P0, DDom, DSub, DPth, DQry < 0.01 ≈ 0.78 G3 DIni ≈ 2 * G2 ≈ 0.88 G4 HxP1 ≈ 5 * G3 ≈ 0.94 G5 Higher HmPn 0.4 - 0.7 Not Explored G6 URIR 1.0 1.0
  • 52. @ibnesayeed | @WebSciDL Collecting CDX Is Difficult 52 https://memgator.cs.odu.edu/ MemGator Service at ODU currently aggregates 16 web archives, but we have CDX data only from 4. However, some of these archives have fulltext search support, so we can learn about their holdings. ⭐ ⭐ ⭐ ⭐
  • 53. @ibnesayeed | @WebSciDL Who Knows Term Frequency for Estonian Nouns? 53 https://en.wiktionary.org/wiki/Category:Estonian_nouns
  • 54. @ibnesayeed | @WebSciDL Fulltext Search Profiling 54 Top Nouns time year people way man day thing child mr government Random Dict analogies unbolt consonant coils stolidly cigar decrepit rhododendron cannibal honeydew Dynamic Words Discovery the ‫وﻛﺎﻟﺔ‬ war angry ‫أﻧﺑﺎء‬ the arab ‫اﻟﻌرﺑﻲ‬ middle news ‫اﻟﻐﺎﺿب‬ east service on arabic a politics poetry source war art
  • 55. @ibnesayeed | @WebSciDL Random Searcher Model (RSM) 55 Search for a word
  • 56. @ibnesayeed | @WebSciDL Random Searcher Model (RSM) 56 http://jeffreyhill.typepad.com/english/ http://www.nc-net.info/english.php http://english.aljazeera.net/ http://twitter.com/AJEnglish https://vimeo.com/248815105 http://www.bridge.edu/ http://www.wordreference.com/ https://www.facebook.com/aljazeera http://www.elizabethangardens.org/ ... Collect resulting links
  • 57. @ibnesayeed | @WebSciDL Random Searcher Model (RSM) 57 http://jeffreyhill.typepad.com/english/ http://www.nc-net.info/english.php http://english.aljazeera.net/ http://twitter.com/AJEnglish https://vimeo.com/248815105 http://www.bridge.edu/ http://www.wordreference.com/ https://www.facebook.com/aljazeera http://www.elizabethangardens.org/ ... Load a random result link
  • 58. @ibnesayeed | @WebSciDL Random Searcher Model (RSM) 58 Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012 http://jeffreyhill.typepad.com/english/ http://www.nc-net.info/english.php http://english.aljazeera.net/ http://twitter.com/AJEnglish https://vimeo.com/248815105 http://www.bridge.edu/ http://www.wordreference.com/ https://www.facebook.com/aljazeera http://www.elizabethangardens.org/ ... Extract words from the page
  • 59. @ibnesayeed | @WebSciDL Random Searcher Model (RSM) 59 Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012 http://jeffreyhill.typepad.com/english/ http://www.nc-net.info/english.php http://english.aljazeera.net/ http://twitter.com/AJEnglish https://vimeo.com/248815105 http://www.bridge.edu/ http://www.wordreference.com/ https://www.facebook.com/aljazeera http://www.elizabethangardens.org/ ... Select a word to search
  • 60. @ibnesayeed | @WebSciDL Random Searcher Model (RSM) 60 Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012 http://jeffreyhill.typepad.com/english/ http://www.nc-net.info/english.php http://english.aljazeera.net/ http://twitter.com/AJEnglish https://vimeo.com/248815105 http://www.bridge.edu/ http://www.wordreference.com/ https://www.facebook.com/aljazeera http://www.elizabethangardens.org/ ...
  • 61. @ibnesayeed | @WebSciDL RSM Modes Costs 61 Mode HTTP Cost Remarks Static C Suitable for specialized collection with known top keywords PopularityBiased 2C Human like model, but costly EqualOpportunity 2C Human like model, but costly Conservative C + 𝛿 (where 𝛿 << C) Suitable for any collection and works without any supplementary materials with very little overhead “C” is the cost/number of search queries.
  • 62. @ibnesayeed | @WebSciDL Search Needed vs. Coverage 62 100% in 11K searches 100% in 27K searches 100% in 337K searches 100% in 1.9M searches Alam et al., “Web Archive Profiling Through Fulltext Search”, TPDL 2016
  • 63. @ibnesayeed | @WebSciDL Accuracy, Recall, and Coverage (10-100%) 63 DMOZ IA Wayback UK WaybackMemento Proxy Low Accuracy (high FP) => Archives & Aggregator suffer Low Recall (high FN) => Users suffer Alam et al., “Web Archive Profiling Through Fulltext Search”, TPDL 2016
  • 64. 64 RQ1b: How to Learn an Archive’s Voids?
  • 65. @ibnesayeed | @WebSciDL Why Profile Archival Voids? 65 $ curl -I https://web.archive.org/web/https://quora.com/ HTTP/1.1 403 FORBIDDEN Server: nginx/1.15.8 Date: Wed, 02 Dec 2020 20:39:33 GMT Content-Type: text/html; charset=utf-8 Connection: keep-alive Server-Timing: captures_list;dur=0.150497 X-App-Server: wwwb-app58 X-ts: 403 The Internet Archive has many “*.com” domains, but it may not want to capture or replay some.
  • 66. @ibnesayeed | @WebSciDL Archival Voids Profiles Reduce False Positives 66 org,arxiv)/abs/a 40 org,arxiv)/abs/b 23 org,arxiv)/abs/c 17 org,arxiv)/format/a 15 org,arxiv)/format/b 20 org,arxiv)/format/c 10 org,arxiv)/search/a 30 ... org,arxiv)/abs/* 80 org,arxiv)/format/* 45 org,arxiv)/search/* 60 org,arxiv)/* 185 org,arxiv)/abs/d False Positive org,arxiv)/pdf/a org,arxiv)/pdf/b org,arxiv)/pdf/c False Positive org,arxiv)/* 185 org,arxiv)/pdf/* 0 How about summarizing frequently accessed URIs an archive does not hold?
  • 67. @ibnesayeed | @WebSciDL Arquivo.pt Access Log Dataset 67
  • 68. @ibnesayeed | @WebSciDL Most Frequently Accessed URIs 68 Most of the traffic to “fccn.pt” is originated from the UptimeRobot Always returned a “404 Not Found” response.
  • 69. @ibnesayeed | @WebSciDL 404-Only Frequencies and Request Savings 69 An archival voids profile of 2.4k URIs, that were accessed hundreds of times each or more, could have saved about 8.4% of wasted requests.
  • 70. @ibnesayeed | @WebSciDL Archival Voids Recommendations 70 ● Keep archival voids profiles separate from archival holdings ● Update often ● Use specific keys with only high confidence ● Profile only resources that are high in demand ● Archives themselves are better sources of truth than external observers
  • 71. RQ2: How to Summarize and Serialize Archival Holdings and Voids for Dissemination? 71
  • 72. @ibnesayeed | @WebSciDL If Only Archives Could Tell What to Ask Them For ● Websites advertise their holdings using sitemap.xml, why can’t archives? ○ Archives have billions or even trillions of URI-Ms ○ Such exhaustive lists would go stale very quickly ● How about robots.txt? ○ It is compact, but is exclusion format, it does not tell what the site has ○ It assumes a single domain, patterns are for paths (not the domain name) ● How about well-known URIs? ○ Good for automated discovery of domain-specific metadata resources ● How about combining these ideas? ○ Introducing MementoMap! 72
  • 73. @ibnesayeed | @WebSciDL A MementoMap Example 73 !context ["http://oduwsdl.github.io/contexts/ukvs"] !id {uri: "http://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {type: "MementoMap", name: "A Test Web Archive", year: 1996} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- https://github.com/oduwsdl/ORS/blob/master/ukvs.md Goodbye HmPn/DLim static profiling policies, thanks to our SURT with wildcard.
  • 74. @ibnesayeed | @WebSciDL SURT Representation With Wildcard 74 Original SURTs did not have wildcards. We introduced it for dynamic profiling. In practice the common “http://(” prefix is removed.
  • 75. @ibnesayeed | @WebSciDL Arquivo.pt Index Statistics 75Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019 The Internet Archive is about 150 times bigger than Arquivo.pt.
  • 76. @ibnesayeed | @WebSciDL Top Arquivo.pt TLDs 76Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019 Arquivo.pt was created to archive sites of interest to the Portuguese people. One third of the Arquivo.pt is other than *.pt pages, hence routing based on top TLDs would miss a lot.
  • 77. @ibnesayeed | @WebSciDL Who Would Have Thought Arquivo.pt Has 10K+ .онлайн Sites? 77 “.онлайн” (encoded as “xn--80asehdb”) is an IDN gTLD which means “.online”
  • 78. @ibnesayeed | @WebSciDL Cumulative Growth of URI-Ms and URI-Rs in Arquivo.pt 78Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019 50% mementos were captured in the last two active years alone.
  • 79. @ibnesayeed | @WebSciDL Shape of HxPx Key Tree of Arquivo.pt 79Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
  • 80. @ibnesayeed | @WebSciDL Incremental Children Reduction Rate 80Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
  • 81. @ibnesayeed | @WebSciDL Processed Lines vs. Compacted MementoMap Growth 81 com,example)/a/1/x com,example)/a/2 com,example)/a/3 com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/a/* com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/* Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
  • 82. @ibnesayeed | @WebSciDL MementoMap Generation, Compaction, and Lookup 82Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% wasted traffic by publishing a 119MB summary file!
  • 83. @ibnesayeed | @WebSciDL Dissemination and Discovery Methods 83 GET /.well-known/mementomap HTTP/1.1 Host: arquivo.pt Link: <https://arquivo.pt/path/to/mementomap.ukvs>; rel="mementomap" <link href="https://arquivo.pt/path/to/mementomap.ukvs" rel="mementomap"> Well-known URI Link Header Link HTML Element
  • 84. RQ3: How to Utilize MementoMaps for Memento Routing? 84
  • 85. @ibnesayeed | @WebSciDL Putting Archival Holdings and Voids to Work 85 ● Combine MementoMaps (holdings and/or voids) of various web archives ● Create an inverted index for efficient cross-archive lookup ● Define scores based on reported data in MementoMaps ● Predict whether a lookup URI should be routed to an archive
  • 86. @ibnesayeed | @WebSciDL How Densely Is a URI Subtree Archived? * /10000000000+ edu,odu)/* /10000~ 86 Where to go? https://www.odu.edu/academics/graduation-commencement/graduation/FAQs * /10000+ edu,odu)/* /1
  • 87. @ibnesayeed | @WebSciDL Density Score 87 A normalized score to assess how actively an archive is capturing a given URI space. * /1000000 org,example)/a/b/c/* 200/100org,example)/a/b/c/d/e µ = log(1 + 100) / log(1 + 1000000) ≈ 0.334
  • 88. @ibnesayeed | @WebSciDL How Close Are a Lookup URI and Its Corresponding URI Key? 88 edu,odu,)/academics/graduation-commencement/* /100 Where to go? https://www.odu.edu/academics/graduation-commencement/graduation/FAQs edu,odu)/* /100
  • 89. @ibnesayeed | @WebSciDL Closeness Score 89 A normalized score to assess how closely the longest URI Key prefix matches with the lookup URI. org,example)/a/b/c/*org,example)/a/b/c/d/e χ = min(log(1 + 5), 1) / (1 + 7 − 5) ≈ 0.259 Ll = 7 Lk = 5
  • 90. @ibnesayeed | @WebSciDL How Likely Is the Lookup URI to Be Found in Each of the Archives? 90 Which ones to select? https://www.odu.edu/academics/graduation-commencement/graduation/FAQs 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 Cut-off Threshold
  • 91. @ibnesayeed | @WebSciDL Routing Score 91 A normalized score to assess the likelihood of finding the lookup URI in an archive. ρ = µ * χ ≈ 0.334 * 0.259 ≈ 0.087 Weighted Relative Routing ScoreRelative Routing Score
  • 92. @ibnesayeed | @WebSciDL MementoMaps of Different Archives 92
  • 93. @ibnesayeed | @WebSciDL Inverted Index of MementoMaps 93
  • 94. @ibnesayeed | @WebSciDL Inverted Index Lookup Result 94 Not profiled (default routing score) Archival Void (“0” routing score) Relative Routing Scores of all archives add up to “1.0”
  • 95. @ibnesayeed | @WebSciDL Web Archive Routing Evaluation Dataset 95 Created from the MemGator access logs
  • 96. @ibnesayeed | @WebSciDL Archival Collection Diffusion 96 Simple TLD-based Memento routing is insufficient.
  • 97. @ibnesayeed | @WebSciDL Baseline Routing 97 Large Recall values and small request costs of top-1 and top-2 policies show the effectiveness of our heuristic Routing Score.
  • 98. @ibnesayeed | @WebSciDL Cut-off Threshold Routing 98 Inclusion of Voids profiles improve Savings more prominently than Accuracy due to frequency bias. Poor prevalence of sample URIs in UKWA and Stanford hurts their scores.
  • 99. @ibnesayeed | @WebSciDL Machine Learning-Based Routing 99 Classifier is biased towards Accuracy at the cost of poor Recall due to poor prevalence of positive cases in the dataset.
  • 100. Contributions, Future Work, and Conclusions 100
  • 101. @ibnesayeed | @WebSciDL Tools Contributions in the Web Archiving Ecosystem 101 Open-Source Tools/Scripts ❖ https://github.com/oduwsdl/ipwb ❖ https://github.com/oduwsdl/Reconstructive ❖ https://github.com/oduwsdl/MemGator ❖ https://github.com/oduwsdl/archive_profiler ❖ https://github.com/oduwsdl/accesslog-parser ❖ https://github.com/oduwsdl/MementoMap ❖ https://github.com/oduwsdl/ORS ❖ https://jekyll.github.io/classifier-reborn/
  • 102. @ibnesayeed | @WebSciDL Contributions: Algorithms ● Random Searcher Model (RSM) ○ Utilize fulltext search interface to sample archival holdings ○ Supports multiple modes of operation ○ Discovery: 10% => Accuracy: 0.8, Recall: 0.9 ● MementoMap Generation, Compaction, and Merger ○ Consumes a sorted list of URIs in SURT format ○ Allows configuration options to control compaction ○ Linear, single-pass, small constant memory footprint (irrespective of the input size) ○ Accuracy > 0.6, Recall: 1.0, Relative Cost < 1.5% 102
  • 103. @ibnesayeed | @WebSciDL ● URI Key: Extended SURT with the wildcard support to describe subtrees of the URI space in the form of URI prefixes ● Archival Holdings: A measure to describe holdings of an archive ● Archival Voids: A measure to describe what an archive is missing ● Relative Cost: The ratio of the number of URI keys used to describe summarized holdings of an archive over the total number of unique URI-Rs in the archive ● Frequency Score: A means to represent the number of URI-Ms and/or URI-Rs under a URI key ● Density Score: A normalized score derived from the frequency score to describe the archiving activity under a URI key ● Closeness Score: A normalized score to describe how similar or different two URI keys are ● Routing Score: A normalized score to represent how likely it is that an archive has a URI Contributions: Terminologies and Metrics 103
  • 104. @ibnesayeed | @WebSciDL Contributions: Publications 104 RQ2 ● TPDL 2015 ● TCDL 2015 ● IJDL 2016 ● JCDL 2016 ● TPDL 2016 ● TPDL 2016 ● JCDL 2017 ● TCDL 2017 ● JCDL 2018 ● JCDL 2019 ⭐ ● IJDL 2021 ● RFC ● RFC RQ3 ● JCDL 2016 ● TCDL 2017 ● ESCAPE 2019 ● JCDL 2021 ● IJDL 2021 🥇 Best paper/poster award ⭐ Best paper nomination Italics = Planned/in progress RQ1 ● TPDL 2015 ● TCDL 2015 ● IJDL 2016 ● TPDL 2016 ● JCDL 2017 🥇 ● JCDL 2018 🥇 ● JCDL 2019 ● WADL 2019 ● ESCAPE 2019 ● WADL 2020 ● JCDL 2021 ● IJDL 2021
  • 105. @ibnesayeed | @WebSciDL Future Work ● URI Keys for collection seed summarization and collection diversity measure ● Archival Voids to assess crawl jobs ● Explore UKVS file format use cases in web archiving and beyond ● Profiling other dimensions ○ Datetime ○ Language ● Standards for cross-archive collection exploration ● Alternate approaches to URI subtree rollups ● Other ML and Neural Net-based techniques for Memento routing ● Hierarchical network Memento routing ● Adoption of MementoMap framework by archives and aggregators 105
  • 106. @ibnesayeed | @WebSciDL Future Work: MementoMap Adoption Path ● PWA, UKWA, and NLA have shown interest ● PyWB archival replay system is open for implementation ● MemGator and LANL’s Time Travel service are interested ● Big web archives can start with publishing archival voids ○ No need to profile IA ● Archives with access restrictions can have multiple MementoMaps ● Third parties can create and publish MementoMaps of the rest of the archives while they catch up 106
  • 108. @ibnesayeed | @WebSciDL ● Introduced challenges in Memento aggregation ○ Broadcasting can be evil ○ Profiling is desired for collection understanding and better Memento routing ○ Aggregation is useful, even for the small web archives ● MementoMap framework addresses three research questions ○ RQ1: How to learn about a web archive’s holdings and voids? ■ Holdings: CDX and fulltext search profiling ■ Voids: Access log profiling ○ RQ2: How to summarize and serialize archival holdings and voids? ■ MementoMap/UKVS serialization and dissemination ○ RQ3: How to utilize MementoMaps for informed Memento routing? ■ Inverted Index, Routing Score, Classifier Conclusions 108 Over 96% Accuracy with 89% Recall or 68% Accuracy with 99% Recall via a 120MB MementoMap file for an archive with 2B+ unique URI-Rs.