MementoMap
An Archive Profile Dissemination Framework
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
@ibnesayeed @WebSciDL
Supported by NSF Grant IIS-1526700
WADL '19, June 6, 2019, Urbana-Champaign, Illinois
@ibnesayeed 2
$ memgator -a archives.json -f cdxj example.com 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
198014 web.archive.org
13548 wayback.archive-it.org
1191 webarchive.loc.gov
1044 swap.stanford.edu
953 arquivo.pt
525 wayback.vefsafn.is
225 perma-archives.org
221 archive.md
23 www.webarchive.org.uk
$ memgator -a archives.json -f cdxj jcdl.org 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
410 web.archive.org
2 www.webarchive.org.uk
2 arquivo.pt
1 archive.md
Cross-archive Memento Lookup With MemGator
https://github.com/oduwsdl/MemGator
@ibnesayeed
Memento Aggregator
3
@ibnesayeed
Memento Aggregator
4
@ibnesayeed
Memento Aggregator
5
@ibnesayeed
Memento Aggregator
6
@ibnesayeed
Memento Aggregator
7
@ibnesayeed
Memento Aggregator
8
@ibnesayeed
Broadcasting is Evil
9
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the
traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP
addr from where you're seeing the traffic? I presume the requests are for Memento
TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has
gotten really high, and also I was asked to remove an archive due to the traffic it was
causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an
aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the
container to change the archivelist ;)
Ilya
Broadcasting is wasteful, both clients & archives suffer!
@ibnesayeed
Memento Lookup Routing
10
Let’s fix the broadcasting issue
with a more informed routing.
@ibnesayeed
MemGator Log Responses from Various Archives
11
93% of the requests
made from MemGator
to upstream archives
were wasteful.
@ibnesayeed
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
12
Blind spot of a
content-based profile
Blind spot of a
usage-based profile
@ibnesayeed
If Only Archives Could Tell When to Ask Them
● Websites advertise their holdings using sitemap.xml, why can’t archives?
○ Archives have billions or even hundreds of billions URI-Ms
○ Such exhaustive lists would go stale very quickly
● How about robots.txt?
○ It is compact, but is exclusion format, it does not tell what the site has
○ It assumes a single domain, patterns are for paths (not the domain name)
● How about combining the two ideas?
○ Introducing MementoMap!
13
@ibnesayeed
A MementoMap Example
14
!context ["http://oduwsdl.github.io/contexts/ukvs"]
!id {uri: "http://archive.example.org/"}
!fields {keys: ["surt"], values: ["frequency"]}
!meta {type: "MementoMap", name: "A Test Web Archive", year: 1996}
!meta {updated_at: "2018-09-03T13:27:52Z"}
* 54321/20000
com,* 10000+
org,arxiv)/ 100
org,arxiv)/* 2500~/900
org,arxiv)/pdf/* 0
uk,co,bbc)/images/* 300+/20-
+ for a lower boundary
- for an upper boundary
~ for an approximate value
@ibnesayeed
Unified Key Value Store (UKVS)
15
https://github.com/oduwsdl/ORS/blob/master/ukvs.md
@ibnesayeed
UKVS Optional Fields vs. Key Columns
16
@ibnesayeed
UKVS Use Cases
● MementoMap
● CDX File/Server
● Archive ACL
● Archive Fixity
● Extended TimeMap
● … and many more
17
@ibnesayeed
SURTs Representation with Wildcard
18
Original SURTs did not have wildcards.
In practice the common “http://(” prefix
is removed.
@ibnesayeed
Arquivo.pt Index Statistics
19
The Internet archive is
about 150 times bigger
than Arquivo.pt.
@ibnesayeed
Shape of HxPx Key Tree of Arquivo.pt
20
@ibnesayeed
Who Would have Thought
Arquivo.pt has 10K+ .онлайн Sites?
21
“.онлайн”
(encoded as “xn--80asehdb”)
is an IDN gTLD which means
“.online”
@ibnesayeed
Most Archived URI-Rs in Arquivo.pt
22
Arquivo is obsessed with transparent single pixel images and corner graphics.
@ibnesayeed
Processed Lines vs. Compacted MementoMap Growth
23
com,example)/a/1/x
com,example)/a/2
com,example)/a/3
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/a/*
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/*
@ibnesayeed
MementoMap Generation, Compaction, and Lookup
24
1.5% Relative Cost yields 60% Accuracy.
Arquivo.pt can save 60% waisted traffic
by publishing 119MB summary file!
@ibnesayeed
Dissemination and Discovery Methods
25
GET /.well-known/mementomap HTTP/1.1
Host: arquivo.pt
Link: <https://arquivo.pt/path/to/mementomap.ukvs>;
rel="mementomap"
<link href="https://arquivo.pt/path/to/mementomap.ukvs"
rel="mementomap">
Well-known URI
Link Header
Link HTML Element
@ibnesayeed
Future Work
● Generate blacklists by processing access logs
● Incorporate MementoMap in replay systems
● Encourage archives and aggregators to adopt it
● Encourage use of UKVS in other archival and non-archival contexts
26
@ibnesayeed
Conclusions
● Described MementoMap - a flexible and efficient archive profiling framework
● Analyzed complete index of Arquivo.pt to understand nature of web archives
● Evaluated MementoMap against Arquivo.pt’s index
● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file)
● Proposed “mementomap” as a well-known URI suffix as well as a link relation
for dissemination of MementoMap
● Implemented a single-pass, memory-efficient, and parallelization-friendly
MementoMap generation/compaction algorithm
● Open-sourced the implementation
○ https://github.com/oduwsdl/MementoMap
27

MementoMap: An Archive Profile Dissemination Framework

  • 1.
    MementoMap An Archive ProfileDissemination Framework Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA @ibnesayeed @WebSciDL Supported by NSF Grant IIS-1526700 WADL '19, June 6, 2019, Urbana-Champaign, Illinois
  • 2.
    @ibnesayeed 2 $ memgator-a archives.json -f cdxj example.com > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 198014 web.archive.org 13548 wayback.archive-it.org 1191 webarchive.loc.gov 1044 swap.stanford.edu 953 arquivo.pt 525 wayback.vefsafn.is 225 perma-archives.org 221 archive.md 23 www.webarchive.org.uk $ memgator -a archives.json -f cdxj jcdl.org > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 410 web.archive.org 2 www.webarchive.org.uk 2 arquivo.pt 1 archive.md Cross-archive Memento Lookup With MemGator https://github.com/oduwsdl/MemGator
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    @ibnesayeed Broadcasting is Evil 9 From:Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is wasteful, both clients & archives suffer!
  • 10.
    @ibnesayeed Memento Lookup Routing 10 Let’sfix the broadcasting issue with a more informed routing.
  • 11.
    @ibnesayeed MemGator Log Responsesfrom Various Archives 11 93% of the requests made from MemGator to upstream archives were wasteful.
  • 12.
    @ibnesayeed What is Archivedin Arquivo.pt? What is Accessed from MemGator? 12 Blind spot of a content-based profile Blind spot of a usage-based profile
  • 13.
    @ibnesayeed If Only ArchivesCould Tell When to Ask Them ● Websites advertise their holdings using sitemap.xml, why can’t archives? ○ Archives have billions or even hundreds of billions URI-Ms ○ Such exhaustive lists would go stale very quickly ● How about robots.txt? ○ It is compact, but is exclusion format, it does not tell what the site has ○ It assumes a single domain, patterns are for paths (not the domain name) ● How about combining the two ideas? ○ Introducing MementoMap! 13
  • 14.
    @ibnesayeed A MementoMap Example 14 !context["http://oduwsdl.github.io/contexts/ukvs"] !id {uri: "http://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {type: "MementoMap", name: "A Test Web Archive", year: 1996} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- + for a lower boundary - for an upper boundary ~ for an approximate value
  • 15.
    @ibnesayeed Unified Key ValueStore (UKVS) 15 https://github.com/oduwsdl/ORS/blob/master/ukvs.md
  • 16.
  • 17.
    @ibnesayeed UKVS Use Cases ●MementoMap ● CDX File/Server ● Archive ACL ● Archive Fixity ● Extended TimeMap ● … and many more 17
  • 18.
    @ibnesayeed SURTs Representation withWildcard 18 Original SURTs did not have wildcards. In practice the common “http://(” prefix is removed.
  • 19.
    @ibnesayeed Arquivo.pt Index Statistics 19 TheInternet archive is about 150 times bigger than Arquivo.pt.
  • 20.
    @ibnesayeed Shape of HxPxKey Tree of Arquivo.pt 20
  • 21.
    @ibnesayeed Who Would haveThought Arquivo.pt has 10K+ .онлайн Sites? 21 “.онлайн” (encoded as “xn--80asehdb”) is an IDN gTLD which means “.online”
  • 22.
    @ibnesayeed Most Archived URI-Rsin Arquivo.pt 22 Arquivo is obsessed with transparent single pixel images and corner graphics.
  • 23.
    @ibnesayeed Processed Lines vs.Compacted MementoMap Growth 23 com,example)/a/1/x com,example)/a/2 com,example)/a/3 com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/a/* com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/*
  • 24.
    @ibnesayeed MementoMap Generation, Compaction,and Lookup 24 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% waisted traffic by publishing 119MB summary file!
  • 25.
    @ibnesayeed Dissemination and DiscoveryMethods 25 GET /.well-known/mementomap HTTP/1.1 Host: arquivo.pt Link: <https://arquivo.pt/path/to/mementomap.ukvs>; rel="mementomap" <link href="https://arquivo.pt/path/to/mementomap.ukvs" rel="mementomap"> Well-known URI Link Header Link HTML Element
  • 26.
    @ibnesayeed Future Work ● Generateblacklists by processing access logs ● Incorporate MementoMap in replay systems ● Encourage archives and aggregators to adopt it ● Encourage use of UKVS in other archival and non-archival contexts 26
  • 27.
    @ibnesayeed Conclusions ● Described MementoMap- a flexible and efficient archive profiling framework ● Analyzed complete index of Arquivo.pt to understand nature of web archives ● Evaluated MementoMap against Arquivo.pt’s index ● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file) ● Proposed “mementomap” as a well-known URI suffix as well as a link relation for dissemination of MementoMap ● Implemented a single-pass, memory-efficient, and parallelization-friendly MementoMap generation/compaction algorithm ● Open-sourced the implementation ○ https://github.com/oduwsdl/MementoMap 27