The document discusses generating profiles of web archives to efficiently route requests to memento aggregators. It examines different profiling policies and their costs and precision. Profiles were generated for three archives relating CDX size, URI-Ms, URI-Rs and URI keys. Policies that gained up to 80% routing accuracy with less than 1% relative cost while maintaining 0.9 recall are identified. Future work on language profiles, combination profiles and other uses of archive profiles is also discussed.
Web Archive Profiling for Efficient Memento Aggregation
1. Web Archive Profiling
For
Efficient Memento Aggregation
Sawood Alam
Old Dominion University, Norfolk, Virginia - 23529
Advisor: Michael L. Nelson
Doctoral Consortium JCDL’16
June 19, 2016
Supported in part by the International Internet Preservation Consortium (IIPC)
11. From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote,
but I suspect the traffic you're seeing is b/c it is deployed in http:
//oldweb.today/ can you share the IP addr from where you're seeing the
traffic? I presume the requests are for Memento TimeMaps? It should not
being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues
on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator,
as the traffic has gotten really high, and also I was asked to remove an
archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an
important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need
to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
12. Availability and Overlap
● Archives are sparse
● Broadcasting is wasteful, both clients and archives suffer
16. Why Small Archives Matter?
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum et al, TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
● Censorship
18. Research Questions
● What do individual web archives hold?
● How much do we need to know about an
archive’s holdings?
● What is the optimal level of summarization for
better accuracy and increased freshness?
● What are various ways to learn about archives’
holdings?
● How to store and update archives’ profiles to
efficiently scale?
19. Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R in
an archive
● Provides various statistics about the holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
36. Cost vs Accuracy
Group Policies Cost Accuracy
G1 H1P0/TLD Bound by # of TLDs ≈ 0.01
G2
H3P0, DDom, DSub,
DPth, DQry
< 0.01 ≈ 0.78
G3 DIni ≈ 2 * G2 ≈ 0.88
G4 HxP1 ≈ 5 * G3 ≈ 0.94
G5 Higher HmPn 0.4 -- 0.7 Not Explored
G6 URIR 1.0 1.0
37. Work Plan
✓ Baseline Profiling Through CDX Files
✓ Profile Serialization
✓ Fulltext Search Profiling
✓ Sample URI Dataset
➢ Instrumenting Memento Aggregator
➢ Multidimensional Profiling
38. Publications
TPDL15 Web Archive Profiling Through CDX Summarization
TCDL15 Profiling Web Archives - For Efficient Memento Query Routing
IJDL16 Web Archive Profiling Through CDX Summarization
JCDL16 Poster: MemGator - A Portable Concurrent Memento Aggregator
TPDL16 Web Archive Profiling Through Fulltext Search
RFC Object Resource Stream (ORS) and CDX-JSON (CDXJ) Formats
C4LJ MemGator - A Portable Concurrent Memento Aggregator Architecture
JCDL17 Scalable, Maintainable, and Extensible Web Archive Profile Serialization for Efficient Lookup
JCDL17 URI, Time, and Language Profiling from Live Archives via URI Sampling and Fulltex Search
SIGIR17 Memento Aggregator Routing Based on Probability Distribution of Memento Availability with
Archive Profiles
IJDL17 Archive X-Ray - Web Archive Profiling for Efficient Memento Aggregation
39. Future Work
● Language profiles
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Utilize archive profile to generate rank
ordered list of archive
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
40. Conclusions
● Generated profiles with different policies for three archives
● Examined cost-precision tradeoffs of various policies
● Related CDX Size, URI-M, URI-R, and URI-Key
● Gained up to 80% routing accuracy with <1% relative cost
while maintaining 0.9 recall