SlideShare a Scribd company logo

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).

1 of 33
Download to read offline
MementoMap Framework
for Flexible and Adaptive
Web Archive Profiling
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
Fernando Melo, Daniel Bicho, and Daniel Gomes
FCT: Arquivo.pt, Lisbon, Portugal
@ibnesayeed @WebSciDL @PT_WebArchive
Supported by NSF Grant IIS-1526700
JCDL '19, June 4, 2019, Fort Worth, Urbana-Champaign, Illinois
@ibnesayeed 2
$ memgator -a archives.json -f cdxj example.com 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
198014 web.archive.org
13548 wayback.archive-it.org
1191 webarchive.loc.gov
1044 swap.stanford.edu
953 arquivo.pt
525 wayback.vefsafn.is
225 perma-archives.org
221 archive.md
23 www.webarchive.org.uk
$ memgator -a archives.json -f cdxj jcdl.org 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
410 web.archive.org
2 www.webarchive.org.uk
2 arquivo.pt
1 archive.md
Cross-archive Memento Lookup With MemGator
https://github.com/oduwsdl/MemGator
@ibnesayeed
Memento Aggregator
3
@ibnesayeed
Memento Aggregator
4
@ibnesayeed
Memento Aggregator
5
@ibnesayeed
Memento Aggregator
6

Recommended

Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingSawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
 

More Related Content

What's hot

Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsJustin Brunelle
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resourcesmaturban
 
Proposal for Text Mining PubAg
Proposal for Text Mining PubAgProposal for Text Mining PubAg
Proposal for Text Mining PubAgJake Lever
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationMartin Klein
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Databutest
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DoneHerbert Van de Sompel
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Herbert Van de Sompel
 

What's hot (20)

Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
 
Proposal for Text Mining PubAg
Proposal for Text Mining PubAgProposal for Text Mining PubAg
Proposal for Text Mining PubAg
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than Done
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013
 

Similar to MementoMap Framework for Flexible and Adaptive Web Archive Profiling

20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA KeynoteAxel Polleres
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Semantic Web
Semantic WebSemantic Web
Semantic Webhardchiu
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
The Place of Schema.org in Linked Ocean Data
The Place of Schema.org in Linked Ocean DataThe Place of Schema.org in Linked Ocean Data
The Place of Schema.org in Linked Ocean DataAdam Leadbetter
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked DataRichard Wallis
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to OmekaShawn Day
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeDan Brickley
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web PagesMichael Nelson
 
web 2.0, library systems and the library system
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library systemlisld
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State Universityyouthelectronix
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outlineIan Duncan
 

Similar to MementoMap Framework for Flexible and Adaptive Web Archive Profiling (20)

20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
The Place of Schema.org in Linked Ocean Data
The Place of Schema.org in Linked Ocean DataThe Place of Schema.org in Linked Ocean Data
The Place of Schema.org in Linked Ocean Data
 
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages
 
web 2.0, library systems and the library system
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library system
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Internet Mashups
Internet MashupsInternet Mashups
Internet Mashups
 

More from Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchSawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesSawood Alam
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesSawood Alam
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Sawood Alam
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Sawood Alam
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Sawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationSawood Alam
 

More from Sawood Alam (16)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 

Recently uploaded

Augmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical ProfessionalsAugmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical Professionalsthirdeyegen65
 
AWS Overview of AWS Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS  Clarify, Feature Store, Hyper parameter TuningAWS Overview of AWS  Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS Clarify, Feature Store, Hyper parameter TuningVarun Garg
 
Red shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's CyberspaceRed shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's Cyberspacesttyk
 
Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023Damar Juniarto
 
UGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptxUGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptxRitesh Sahu
 
Augmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & DefenseAugmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & Defensethirdeyegen65
 
Model Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdfModel Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdfgalfinprihardiputra0
 
Biometrics Technology Intresting PPT
Biometrics Technology Intresting PPTBiometrics Technology Intresting PPT
Biometrics Technology Intresting PPTPraveenKumarThota7
 
Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...ssuser7b7f4e
 

Recently uploaded (9)

Augmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical ProfessionalsAugmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical Professionals
 
AWS Overview of AWS Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS  Clarify, Feature Store, Hyper parameter TuningAWS Overview of AWS  Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS Clarify, Feature Store, Hyper parameter Tuning
 
Red shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's CyberspaceRed shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's Cyberspace
 
Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023
 
UGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptxUGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptx
 
Augmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & DefenseAugmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & Defense
 
Model Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdfModel Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdf
 
Biometrics Technology Intresting PPT
Biometrics Technology Intresting PPTBiometrics Technology Intresting PPT
Biometrics Technology Intresting PPT
 
Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...
 

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

  • 1. MementoMap Framework for Flexible and Adaptive Web Archive Profiling Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA Fernando Melo, Daniel Bicho, and Daniel Gomes FCT: Arquivo.pt, Lisbon, Portugal @ibnesayeed @WebSciDL @PT_WebArchive Supported by NSF Grant IIS-1526700 JCDL '19, June 4, 2019, Fort Worth, Urbana-Champaign, Illinois
  • 2. @ibnesayeed 2 $ memgator -a archives.json -f cdxj example.com > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 198014 web.archive.org 13548 wayback.archive-it.org 1191 webarchive.loc.gov 1044 swap.stanford.edu 953 arquivo.pt 525 wayback.vefsafn.is 225 perma-archives.org 221 archive.md 23 www.webarchive.org.uk $ memgator -a archives.json -f cdxj jcdl.org > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 410 web.archive.org 2 www.webarchive.org.uk 2 arquivo.pt 1 archive.md Cross-archive Memento Lookup With MemGator https://github.com/oduwsdl/MemGator
  • 9. @ibnesayeed Broadcasting is Evil 9 From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is wasteful, both clients & archives suffer!
  • 10. @ibnesayeed Memento Lookup Routing 10 Let’s fix the broadcasting issue with a more informed routing.
  • 11. @ibnesayeed MemGator Log Responses from Various Archives 11 93% of the requests made from MemGator to upstream archives were wasteful.
  • 12. @ibnesayeed What is Archived in Arquivo.pt? What is Accessed from MemGator? 12 Blind spot of a content-based profile Blind spot of a usage-based profile
  • 13. @ibnesayeed If Only Archives Could Tell When to Ask Them ● Websites advertise their holdings using sitemap.xml, why can’t archives? ○ Archives have billions or even hundreds of billions URI-Ms ○ Such exhaustive lists would go stale very quickly ● How about robots.txt? ○ It is compact, but is exclusion format, it does not tell what the site has ○ It assumes a single domain, patterns are for paths (not the domain name) ● How about combining the two ideas? ○ Introducing MementoMap! 13
  • 14. @ibnesayeed A MementoMap Example 14 !context ["http://oduwsdl.github.io/contexts/ukvs"] !id {uri: "http://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {type: "MementoMap", name: "A Test Web Archive", year: 1996} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- https://github.com/oduwsdl/ORS/blob/master/ukvs.md
  • 15. @ibnesayeed SURTs Representation with Wildcard 15 Original SURTs did not have wildcards. In practice the common “http://(” prefix is removed.
  • 16. @ibnesayeed Arquivo.pt Index Statistics 16 The Internet archive is about 150 times bigger than Arquivo.pt.
  • 17. @ibnesayeed Top Arquivo.pt TLDs 17 Arquivo.pt was created to archive sites of interest of Portuguese people. Over time web archives collect many things they didn’t intend to and miss a lot they would have liked to archive.
  • 18. @ibnesayeed Who Would have Thought Arquivo.pt has 10K+ .онлайн Sites? 18 “.онлайн” (encoded as “xn--80asehdb”) is an IDN gTLD which means “.online”
  • 19. @ibnesayeed Distribution of URI-Ms over URI-Rs in Arquivo.pt 19 70% mementos belong to only 30% URI-Rs.
  • 20. @ibnesayeed URI-M vs. URI-R Summary of Arquivo.pt 20
  • 21. @ibnesayeed Last two years are still in embargo period. Yearly URI-Rs, URI-Ms, and Status Codes in Arquivo.pt 21 Early years of data came from various other archives.
  • 22. @ibnesayeed Cumulative Growth of URI-Ms and URI-Rs in Arquivo.pt 22 50% mementos were captured in the last two active years alone.
  • 23. @ibnesayeed Most Archived URI-Rs in Arquivo.pt 23 Arquivo is obsessed with transparent single pixel images and corner graphics.
  • 24. @ibnesayeed Unique Items With Exact Host and Path Depths 24 Where do we draw the line? 5+ or 10+ deep?
  • 25. @ibnesayeed HxPx Host and Path Depth Statistics of Arquivo.pt 25
  • 26. @ibnesayeed Shape of HxPx Key Tree of Arquivo.pt 26
  • 29. @ibnesayeed Processed Lines vs. Compacted MementoMap Growth 29 com,example)/a/1/x com,example)/a/2 com,example)/a/3 com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/a/* com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/*
  • 30. @ibnesayeed MementoMap Generation, Compaction, and Lookup 30 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% waisted traffic by publishing 119MB summary file!
  • 31. @ibnesayeed Dissemination and Discovery Methods 31 GET /.well-known/mementomap HTTP/1.1 Host: arquivo.pt Link: <https://arquivo.pt/path/to/mementomap.ukvs>; rel="mementomap" <link href="https://arquivo.pt/path/to/mementomap.ukvs" rel="mementomap"> Well-known URI Link Header Link HTML Element
  • 32. @ibnesayeed Future Work ● Generate MementoMap on the whole index, not a sample ● Generate blacklists by processing access logs ● Incorporate MementoMap in replay systems ● Encourage archives and aggregators to adopt it 32
  • 33. @ibnesayeed Conclusions ● Described MementoMap - a flexible and efficient archive profiling framework ● Analyzed complete index of Arquivo.pt to understand nature of web archives ● Evaluated MementoMap against Arquivo.pt’s index ● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file) ● Proposed “mementomap” as a well-known URI suffix as well as a link relation for dissemination of MementoMap ● Implemented a single-pass, memory-efficient, and parallelization-friendly MementoMap generation/compaction algorithm ● Open-sourced the implementation ○ https://github.com/oduwsdl/MementoMap 33