SlideShare a Scribd company logo
Sawood Alam
Internet Archive, San Francisco, CA
Michael L. Nelson and Michele C. Weigle
Old Dominion University, Norfolk, VA
Profiling Web Archival Voids
for Memento Routing
#MementoMap
#ArchivalVoids
@ibnesayeed
@WebSciDL
Supported in part by NSF Grant IIS-1526700
JCDL '21, September 27-30, 2021, Urbana-Champaign, Illinois
https://arxiv.org/abs/2108.03311
@ibnesayeed | @WebSciDL
MemGator Log Responses From Various Archives
2
93% of the requests
made from MemGator
to upstream archives
were wasteful.
Only about one third
of the requests to the
largest web archive
(IA) were a hit.
@ibnesayeed | @WebSciDL
Who Bears the Cost of Bad Routing Decisions?
3
Actual
Present in the Archive Not in the Archive
Predicted
Routed to the
Archive
True Positive (TP) False Positive (FP)
Not Routed to
the Archive
False Negative (FN) True Negative (TN)
FP: Wasteful (Infrastructure suffers)
FN: Disuse (Users suffer)
@ibnesayeed | @WebSciDL
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
4
2B URI-Rs that have
1-9 mementos each in
Arquivo.pt were never
requested from ODU’s
MemGator server.
43 URI-Rs were
requested thousands
of times each, but
had zero mementos
in Arquivo.pt.
45 URI-Rs had tens
of mementos each
that were requested
hundreds of times.
@ibnesayeed | @WebSciDL
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
5
Blind spot of a
usage-based
profile
Blind spot of a
content-based
profile
@ibnesayeed | @WebSciDL
MementoMap of Archival Holdings Profile
6
Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019
https://arxiv.org/abs/1905.12607
1.5% Relative Cost yields 60%
Accuracy.
Arquivo.pt can save 60%
wasted traffic by publishing a
119MB summary file of their
1.8T CDX files (containing 5B
mementos of 2B URI-Rs)
@ibnesayeed | @WebSciDL
Why Profile Archival Voids?
7
$ curl -I https://web.archive.org/web/https://quora.com/
HTTP/1.1 403 FORBIDDEN
Server: nginx/1.15.8
Date: Wed, 02 Dec 2020 20:39:33 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Server-Timing: captures_list;dur=0.150497
X-App-Server: wwwb-app58
X-ts: 403
The Internet Archive has
many “*.com” domains,
but it may not want to
capture or replay some.
@ibnesayeed | @WebSciDL
Sources Accessing TimeMaps
8
LANL’s usage-based
Archival Holdings
profiling reduced
requests significantly.
Profiling Archival
Voids would improve
it even further.
@ibnesayeed | @WebSciDL
Archival Voids Profiles Reduce False Positives
9
org,arxiv)/abs/a 40
org,arxiv)/abs/b 23
org,arxiv)/abs/c 17
org,arxiv)/format/a 15
org,arxiv)/format/b 20
org,arxiv)/format/c 10
org,arxiv)/search/a 30
...
org,arxiv)/abs/* 80
org,arxiv)/format/* 45
org,arxiv)/search/* 60
org,arxiv)/* 185
org,arxiv)/abs/d
False Positive org,arxiv)/pdf/a
org,arxiv)/pdf/b
org,arxiv)/pdf/c
False Positive
org,arxiv)/* 185
org,arxiv)/pdf/* 0
How about summarizing frequently
accessed URIs an archive does not hold?
@ibnesayeed | @WebSciDL
Arquivo.pt Access Log Dataset
10
@ibnesayeed | @WebSciDL
TimeMap Status Code Distribution and Fluctuations
11
@ibnesayeed | @WebSciDL
Most Frequently Accessed URIs
12
Most of the traffic to
“fccn.pt” is originated
from the UptimeRobot
Always returned a “404 Not Found” response.
@ibnesayeed | @WebSciDL
404-Only Frequencies and Request Savings
13
An archival voids profile of 2.4k URIs, that were accessed hundreds of
times each or more, could have saved about 8.4% of wasted requests.
@ibnesayeed | @WebSciDL
Archival Voids Recommendations
14
● Keep Archival Voids profiles separate from Archival Holdings
● Update often
● Use specific keys with only high confidence
● Profile only resources that are high in demand
● Archives themselves are better sources of truth than external observers
https://arxiv.org/abs/2108.03311

More Related Content

What's hot

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
Sawood Alam
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Justin Brunelle
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Michael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Michael Nelson
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
Sawood Alam
 
Proposal for Text Mining PubAg
Proposal for Text Mining PubAgProposal for Text Mining PubAg
Proposal for Text Mining PubAg
Jake Lever
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Michael Nelson
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Databutest
 
Of Cataloging & Context
Of Cataloging & ContextOf Cataloging & Context
Of Cataloging & Contextcharper
 
RSS feeds using Millennium data
RSS feeds using Millennium dataRSS feeds using Millennium data
RSS feeds using Millennium data
Andrew Preater
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
Access Innovations, Inc.
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
maturban
 
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Gaurav Vaidya
 
RDM#2- The Distributed Web
RDM#2- The Distributed WebRDM#2- The Distributed Web
RDM#2- The Distributed Web
David Dias
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
Ankit Solanki
 
Node.js Interactive
Node.js InteractiveNode.js Interactive
Node.js Interactive
David Dias
 
A Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archives
jcdl2018
 
FamilySearch Developer Webinar
FamilySearch Developer WebinarFamilySearch Developer Webinar
FamilySearch Developer Webinar
David Hale
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
Justin Brunelle
 
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporterFetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Ray Schwartz
 

What's hot (20)

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Proposal for Text Mining PubAg
Proposal for Text Mining PubAgProposal for Text Mining PubAg
Proposal for Text Mining PubAg
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
Of Cataloging & Context
Of Cataloging & ContextOf Cataloging & Context
Of Cataloging & Context
 
RSS feeds using Millennium data
RSS feeds using Millennium dataRSS feeds using Millennium data
RSS feeds using Millennium data
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
 
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
 
RDM#2- The Distributed Web
RDM#2- The Distributed WebRDM#2- The Distributed Web
RDM#2- The Distributed Web
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
Node.js Interactive
Node.js InteractiveNode.js Interactive
Node.js Interactive
 
A Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archives
 
FamilySearch Developer Webinar
FamilySearch Developer WebinarFamilySearch Developer Webinar
FamilySearch Developer Webinar
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporterFetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
 

Similar to Profiling Web Archival Voids for Memento Routing

Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam
 
Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwareLinked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping Software
Pedro Szekely
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Michael Nelson
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
youthelectronix
 
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
National Information Standards Organization (NISO)
 
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web ArchivesOptimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Kritika Garg
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
Rewrapping your data – Providing a tech savvy interface to your data
Rewrapping your data – Providing a tech savvy interface to your dataRewrapping your data – Providing a tech savvy interface to your data
Rewrapping your data – Providing a tech savvy interface to your data
Markus Sköld
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
Michael Nelson
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory Institutions
E. Murphy
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Denis Shestakov
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
EUCLID project
 
Linked Data past, present and futures
Linked Datapast, present and futuresLinked Datapast, present and futures
Linked Data past, present and futures
Pierre-Yves Vandenbussche, Ph.D.
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
eby
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Michael Nelson
 
Mezi snem a realitou. Otevřená data českého webového archivu.
Mezi snem a realitou. Otevřená data českého webového archivu.Mezi snem a realitou. Otevřená data českého webového archivu.
Mezi snem a realitou. Otevřená data českého webového archivu.
Webarchive of National Library of the Czech Republic
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
IRJET Journal
 

Similar to Profiling Web Archival Voids for Memento Routing (20)

Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwareLinked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping Software
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
 
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
 
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web ArchivesOptimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Rewrapping your data – Providing a tech savvy interface to your data
Rewrapping your data – Providing a tech savvy interface to your dataRewrapping your data – Providing a tech savvy interface to your data
Rewrapping your data – Providing a tech savvy interface to your data
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory Institutions
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Linked Data past, present and futures
Linked Datapast, present and futuresLinked Datapast, present and futures
Linked Data past, present and futures
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Mezi snem a realitou. Otevřená data českého webového archivu.
Mezi snem a realitou. Otevřená data českého webového archivu.Mezi snem a realitou. Otevřená data českého webového archivu.
Mezi snem a realitou. Otevřená data českého webového archivu.
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 

More from Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
Sawood Alam
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
Sawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
Sawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
Sawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
Sawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
Sawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
Sawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Sawood Alam
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
Sawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
Sawood Alam
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
Sawood Alam
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
Sawood Alam
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Sawood Alam
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
Sawood Alam
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
Sawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
Sawood Alam
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
Sawood Alam
 

More from Sawood Alam (18)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 

Recently uploaded

Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 

Recently uploaded (20)

Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 

Profiling Web Archival Voids for Memento Routing

  • 1. Sawood Alam Internet Archive, San Francisco, CA Michael L. Nelson and Michele C. Weigle Old Dominion University, Norfolk, VA Profiling Web Archival Voids for Memento Routing #MementoMap #ArchivalVoids @ibnesayeed @WebSciDL Supported in part by NSF Grant IIS-1526700 JCDL '21, September 27-30, 2021, Urbana-Champaign, Illinois https://arxiv.org/abs/2108.03311
  • 2. @ibnesayeed | @WebSciDL MemGator Log Responses From Various Archives 2 93% of the requests made from MemGator to upstream archives were wasteful. Only about one third of the requests to the largest web archive (IA) were a hit.
  • 3. @ibnesayeed | @WebSciDL Who Bears the Cost of Bad Routing Decisions? 3 Actual Present in the Archive Not in the Archive Predicted Routed to the Archive True Positive (TP) False Positive (FP) Not Routed to the Archive False Negative (FN) True Negative (TN) FP: Wasteful (Infrastructure suffers) FN: Disuse (Users suffer)
  • 4. @ibnesayeed | @WebSciDL What is Archived in Arquivo.pt? What is Accessed from MemGator? 4 2B URI-Rs that have 1-9 mementos each in Arquivo.pt were never requested from ODU’s MemGator server. 43 URI-Rs were requested thousands of times each, but had zero mementos in Arquivo.pt. 45 URI-Rs had tens of mementos each that were requested hundreds of times.
  • 5. @ibnesayeed | @WebSciDL What is Archived in Arquivo.pt? What is Accessed from MemGator? 5 Blind spot of a usage-based profile Blind spot of a content-based profile
  • 6. @ibnesayeed | @WebSciDL MementoMap of Archival Holdings Profile 6 Alam et al., “MementoMap Framework for Flexible and Adaptive Web Archive Profiling”, JCDL 2019 https://arxiv.org/abs/1905.12607 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% wasted traffic by publishing a 119MB summary file of their 1.8T CDX files (containing 5B mementos of 2B URI-Rs)
  • 7. @ibnesayeed | @WebSciDL Why Profile Archival Voids? 7 $ curl -I https://web.archive.org/web/https://quora.com/ HTTP/1.1 403 FORBIDDEN Server: nginx/1.15.8 Date: Wed, 02 Dec 2020 20:39:33 GMT Content-Type: text/html; charset=utf-8 Connection: keep-alive Server-Timing: captures_list;dur=0.150497 X-App-Server: wwwb-app58 X-ts: 403 The Internet Archive has many “*.com” domains, but it may not want to capture or replay some.
  • 8. @ibnesayeed | @WebSciDL Sources Accessing TimeMaps 8 LANL’s usage-based Archival Holdings profiling reduced requests significantly. Profiling Archival Voids would improve it even further.
  • 9. @ibnesayeed | @WebSciDL Archival Voids Profiles Reduce False Positives 9 org,arxiv)/abs/a 40 org,arxiv)/abs/b 23 org,arxiv)/abs/c 17 org,arxiv)/format/a 15 org,arxiv)/format/b 20 org,arxiv)/format/c 10 org,arxiv)/search/a 30 ... org,arxiv)/abs/* 80 org,arxiv)/format/* 45 org,arxiv)/search/* 60 org,arxiv)/* 185 org,arxiv)/abs/d False Positive org,arxiv)/pdf/a org,arxiv)/pdf/b org,arxiv)/pdf/c False Positive org,arxiv)/* 185 org,arxiv)/pdf/* 0 How about summarizing frequently accessed URIs an archive does not hold?
  • 10. @ibnesayeed | @WebSciDL Arquivo.pt Access Log Dataset 10
  • 11. @ibnesayeed | @WebSciDL TimeMap Status Code Distribution and Fluctuations 11
  • 12. @ibnesayeed | @WebSciDL Most Frequently Accessed URIs 12 Most of the traffic to “fccn.pt” is originated from the UptimeRobot Always returned a “404 Not Found” response.
  • 13. @ibnesayeed | @WebSciDL 404-Only Frequencies and Request Savings 13 An archival voids profile of 2.4k URIs, that were accessed hundreds of times each or more, could have saved about 8.4% of wasted requests.
  • 14. @ibnesayeed | @WebSciDL Archival Voids Recommendations 14 ● Keep Archival Voids profiles separate from Archival Holdings ● Update often ● Use specific keys with only high confidence ● Profile only resources that are high in demand ● Archives themselves are better sources of truth than external observers https://arxiv.org/abs/2108.03311