The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
1. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving
2. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
A major challenge in web archiving:
Scale vs. Quality
3. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!
https://twitter.com/brewster_kahle/status/1016003169589981184
4. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!
https://twitter.com/brewster_kahle/status/1118172506777509890
5. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!
https://twitter.com/brewster_kahle/status/1139700494748663809
6. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!!
https://twitter.com/brewster_kahle/status/1170820482104348672
7. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/*/http://cnn.com
8. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/20190808041346/https://www.cnn.com/
9. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
10. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!
https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
11. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!!
https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
12. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Scale?
https://twitter.com/mart1nkle1n/status/1136705116738904067
13. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
14. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer
http://tracer.mementoweb.org
15. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Inspired by:
• LOCKSS
• Same automated approach for resources of a class
• Webrecorder
• Manual recording of web resources
• Various attempts aimed at automating interactions/behaviors
• E.g., Brozzler, Browsertrix
16. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
17. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Implementation
• Client-side:
• Tracer Chrome extension leveraging Selenium IDE
• JSON-formatted Trace for download
• Server-side:
• Stormcrawler
• Selenium (Chrome) with Tracer plug-in
• WarcProxy
• file-system storage for WARC files
http://stormcrawler.net/
https://www.seleniumhq.org/projects/webdriver/
https://github.com/odie5533/WarcProxy
18. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
19. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
20. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
21. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
22. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
23. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Current Memento Tracer Capabilities
• Single clicks/links
• All links in an area
• Repeated click on links, with stop condition
• Slides
• Pagination
• Nested traces i.e., “trace in a trace”
• Trace for portal A follow link to portal B execute
trace for portal B
• Identification of page/portal for which a trace exists by URI
(pattern)
24. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Benefits
• Scalability
• Trace created once is applicable to all web resources of
the same class
• Traces shared via repository (edits, versioning)
• Quality
• Trace used as set of instructions for browser-based
capture framework
• Resource boundary explicit
• Tradeoff
• Quality vs performance
25. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Evaluation of Scalability & Quality
• Dataset made of GitHub repositories and Slideshare slide decks
• 17,646 GitHub repositories (via changelog.com)
• 12,280 Slideshare decks (via Explore feature)
• Archival goals:
• GitHub: get all repository files and ZIP file
• Slideshare: get all slides and notes
• Quality eval:
• Compare against Webrecorder
• Scalability eval:
• Large amount of high-quality captures
• Compare against crawl time of common crawler
26. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
• Not a trivial dimension to evaluate!
• Decision to evaluate by amount of URIs in live web version vs.
archived snapshot
• Based on manually generated snapshots with Webrecorder
• Random sample of 100 repos and slide decks
• Expectation:
• 100% of URIs from live web in archived snapshot
27. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
100 @ GitHub 100 @ Slideshare
28. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality at Scale
17,646 @ GitHub 12,280 @ Slideshare
29. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Cost of Quality at Scale
• Runtime difference between Memento Tracer and common web
crawler for the same amount of URIs
• Plus 20 seconds per URI, on average
• Faster than previous approaches, discovers many more URIs
30. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Take aways
• Memento Tracer aims at finding a balance between quality and scale
• Human in the loop, benefits from patterns of web resources
• Experiments provide indicators for high quality, reliability, scale
• Cost involved, slower than simple crawlers
• Optimizations possible, further potential and limitations to be
explored
31. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving