Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MementoMap: An Archive Profile Dissemination Framework

516 views

Published on

We introduce MementoMap, a framework to express and disseminate holdings of web archives (archive profiles) by themselves or third parties. The framework allows arbitrary, flexible, and dynamic levels of details in its entries that fit the needs of archives of different scales. This enables Memento aggregators to significantly reduce wasted traffic to web archives.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

MementoMap: An Archive Profile Dissemination Framework

  1. 1. MementoMap An Archive Profile Dissemination Framework Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA @ibnesayeed @WebSciDL Supported by NSF Grant IIS-1526700 WADL '19, June 6, 2019, Urbana-Champaign, Illinois
  2. 2. @ibnesayeed 2 $ memgator -a archives.json -f cdxj example.com > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 198014 web.archive.org 13548 wayback.archive-it.org 1191 webarchive.loc.gov 1044 swap.stanford.edu 953 arquivo.pt 525 wayback.vefsafn.is 225 perma-archives.org 221 archive.md 23 www.webarchive.org.uk $ memgator -a archives.json -f cdxj jcdl.org > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 410 web.archive.org 2 www.webarchive.org.uk 2 arquivo.pt 1 archive.md Cross-archive Memento Lookup With MemGator https://github.com/oduwsdl/MemGator
  3. 3. @ibnesayeed Memento Aggregator 3
  4. 4. @ibnesayeed Memento Aggregator 4
  5. 5. @ibnesayeed Memento Aggregator 5
  6. 6. @ibnesayeed Memento Aggregator 6
  7. 7. @ibnesayeed Memento Aggregator 7
  8. 8. @ibnesayeed Memento Aggregator 8
  9. 9. @ibnesayeed Broadcasting is Evil 9 From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is wasteful, both clients & archives suffer!
  10. 10. @ibnesayeed Memento Lookup Routing 10 Let’s fix the broadcasting issue with a more informed routing.
  11. 11. @ibnesayeed MemGator Log Responses from Various Archives 11 93% of the requests made from MemGator to upstream archives were wasteful.
  12. 12. @ibnesayeed What is Archived in Arquivo.pt? What is Accessed from MemGator? 12 Blind spot of a content-based profile Blind spot of a usage-based profile
  13. 13. @ibnesayeed If Only Archives Could Tell When to Ask Them ● Websites advertise their holdings using sitemap.xml, why can’t archives? ○ Archives have billions or even hundreds of billions URI-Ms ○ Such exhaustive lists would go stale very quickly ● How about robots.txt? ○ It is compact, but is exclusion format, it does not tell what the site has ○ It assumes a single domain, patterns are for paths (not the domain name) ● How about combining the two ideas? ○ Introducing MementoMap! 13
  14. 14. @ibnesayeed A MementoMap Example 14 !context ["http://oduwsdl.github.io/contexts/ukvs"] !id {uri: "http://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {type: "MementoMap", name: "A Test Web Archive", year: 1996} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- + for a lower boundary - for an upper boundary ~ for an approximate value
  15. 15. @ibnesayeed Unified Key Value Store (UKVS) 15 https://github.com/oduwsdl/ORS/blob/master/ukvs.md
  16. 16. @ibnesayeed UKVS Optional Fields vs. Key Columns 16
  17. 17. @ibnesayeed UKVS Use Cases ● MementoMap ● CDX File/Server ● Archive ACL ● Archive Fixity ● Extended TimeMap ● … and many more 17
  18. 18. @ibnesayeed SURTs Representation with Wildcard 18 Original SURTs did not have wildcards. In practice the common “http://(” prefix is removed.
  19. 19. @ibnesayeed Arquivo.pt Index Statistics 19 The Internet archive is about 150 times bigger than Arquivo.pt.
  20. 20. @ibnesayeed Shape of HxPx Key Tree of Arquivo.pt 20
  21. 21. @ibnesayeed Who Would have Thought Arquivo.pt has 10K+ .онлайн Sites? 21 “.онлайн” (encoded as “xn--80asehdb”) is an IDN gTLD which means “.online”
  22. 22. @ibnesayeed Most Archived URI-Rs in Arquivo.pt 22 Arquivo is obsessed with transparent single pixel images and corner graphics.
  23. 23. @ibnesayeed Processed Lines vs. Compacted MementoMap Growth 23 com,example)/a/1/x com,example)/a/2 com,example)/a/3 com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/a/* com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/*
  24. 24. @ibnesayeed MementoMap Generation, Compaction, and Lookup 24 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% waisted traffic by publishing 119MB summary file!
  25. 25. @ibnesayeed Dissemination and Discovery Methods 25 GET /.well-known/mementomap HTTP/1.1 Host: arquivo.pt Link: <https://arquivo.pt/path/to/mementomap.ukvs>; rel="mementomap" <link href="https://arquivo.pt/path/to/mementomap.ukvs" rel="mementomap"> Well-known URI Link Header Link HTML Element
  26. 26. @ibnesayeed Future Work ● Generate blacklists by processing access logs ● Incorporate MementoMap in replay systems ● Encourage archives and aggregators to adopt it ● Encourage use of UKVS in other archival and non-archival contexts 26
  27. 27. @ibnesayeed Conclusions ● Described MementoMap - a flexible and efficient archive profiling framework ● Analyzed complete index of Arquivo.pt to understand nature of web archives ● Evaluated MementoMap against Arquivo.pt’s index ● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file) ● Proposed “mementomap” as a well-known URI suffix as well as a link relation for dissemination of MementoMap ● Implemented a single-pass, memory-efficient, and parallelization-friendly MementoMap generation/compaction algorithm ● Open-sourced the implementation ○ https://github.com/oduwsdl/MementoMap 27

×