SlideShare a Scribd company logo
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
A major challenge in web archiving:
Scale vs. Quality
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!
https://twitter.com/brewster_kahle/status/1016003169589981184
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!
https://twitter.com/brewster_kahle/status/1118172506777509890
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!
https://twitter.com/brewster_kahle/status/1139700494748663809
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!!
https://twitter.com/brewster_kahle/status/1170820482104348672
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/*/http://cnn.com
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/20190808041346/https://www.cnn.com/
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!
https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!!
https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Scale?
https://twitter.com/mart1nkle1n/status/1136705116738904067
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer
http://tracer.mementoweb.org
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Inspired by:
• LOCKSS
• Same automated approach for resources of a class
• Webrecorder
• Manual recording of web resources
• Various attempts aimed at automating interactions/behaviors
• E.g., Brozzler, Browsertrix
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Implementation
• Client-side:
• Tracer Chrome extension leveraging Selenium IDE
• JSON-formatted Trace for download
• Server-side:
• Stormcrawler
• Selenium (Chrome) with Tracer plug-in
• WarcProxy
• file-system storage for WARC files
http://stormcrawler.net/
https://www.seleniumhq.org/projects/webdriver/
https://github.com/odie5533/WarcProxy
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Current Memento Tracer Capabilities
• Single clicks/links
• All links in an area
• Repeated click on links, with stop condition
• Slides
• Pagination
• Nested traces i.e., “trace in a trace”
• Trace for portal A  follow link to portal B  execute
trace for portal B
• Identification of page/portal for which a trace exists by URI
(pattern)
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Benefits
• Scalability
• Trace created once is applicable to all web resources of
the same class
• Traces shared via repository (edits, versioning)
• Quality
• Trace used as set of instructions for browser-based
capture framework
• Resource boundary explicit
• Tradeoff
• Quality vs performance
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Evaluation of Scalability & Quality
• Dataset made of GitHub repositories and Slideshare slide decks
• 17,646 GitHub repositories (via changelog.com)
• 12,280 Slideshare decks (via Explore feature)
• Archival goals:
• GitHub: get all repository files and ZIP file
• Slideshare: get all slides and notes
• Quality eval:
• Compare against Webrecorder
• Scalability eval:
• Large amount of high-quality captures
• Compare against crawl time of common crawler
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
• Not a trivial dimension to evaluate!
• Decision to evaluate by amount of URIs in live web version vs.
archived snapshot
• Based on manually generated snapshots with Webrecorder
• Random sample of 100 repos and slide decks
• Expectation:
• 100% of URIs from live web in archived snapshot
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
100 @ GitHub 100 @ Slideshare
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality at Scale
17,646 @ GitHub 12,280 @ Slideshare
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Cost of Quality at Scale
• Runtime difference between Memento Tracer and common web
crawler for the same amount of URIs
• Plus 20 seconds per URI, on average
• Faster than previous approaches, discovers many more URIs
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Take aways
• Memento Tracer aims at finding a balance between quality and scale
• Human in the loop, benefits from patterns of web resources
• Experiments provide indicators for high quality, reliability, scale
• Cost involved, slower than simple crawlers
• Optimizations possible, further potential and limitations to be
explored
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving

More Related Content

Similar to The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
The Frick Collection
 

Similar to The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving (20)

Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
 
Easter JISC metadata May25 DT
Easter JISC metadata May25 DTEaster JISC metadata May25 DT
Easter JISC metadata May25 DT
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Mobile Multi-domain Search over Structured Web Data
Mobile Multi-domain Search over Structured Web DataMobile Multi-domain Search over Structured Web Data
Mobile Multi-domain Search over Structured Web Data
 
Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library services
 
Scaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with ThanosScaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with Thanos
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post Implementation
 
Ocls 4th annual breakfast 2016
Ocls 4th annual breakfast 2016Ocls 4th annual breakfast 2016
Ocls 4th annual breakfast 2016
 
CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard WallisCILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard Wallis
 
BlogForever Project presentation at MTSR2013
BlogForever Project presentation at MTSR2013BlogForever Project presentation at MTSR2013
BlogForever Project presentation at MTSR2013
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Full stack development using javascript what and why - ajay chandravadiya
Full stack development using javascript   what and why - ajay chandravadiyaFull stack development using javascript   what and why - ajay chandravadiya
Full stack development using javascript what and why - ajay chandravadiya
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015
 
opacs.ppt
opacs.pptopacs.ppt
opacs.ppt
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
 

More from Martin Klein

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 
Uniform Access to Raw Mementos
Uniform Access to Raw MementosUniform Access to Raw Mementos
Uniform Access to Raw Mementos
 

Recently uploaded

Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
abhinandnam9997
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
aagad
 

Recently uploaded (12)

The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
Stay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design TrendsStay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design Trends
 
The Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI StudioThe Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI Studio
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

  • 1. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
  • 2. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 A major challenge in web archiving: Scale vs. Quality
  • 3. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale! https://twitter.com/brewster_kahle/status/1016003169589981184
  • 4. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!! https://twitter.com/brewster_kahle/status/1118172506777509890
  • 5. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!! https://twitter.com/brewster_kahle/status/1139700494748663809
  • 6. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!!! https://twitter.com/brewster_kahle/status/1170820482104348672
  • 7. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/*/http://cnn.com
  • 8. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/20190808041346/https://www.cnn.com/
  • 9. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  • 10. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity! https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
  • 11. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity!! https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
  • 12. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Scale? https://twitter.com/mart1nkle1n/status/1136705116738904067
  • 13. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale
  • 14. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale Memento Tracer http://tracer.mementoweb.org
  • 15. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org Inspired by: • LOCKSS • Same automated approach for resources of a class • Webrecorder • Manual recording of web resources • Various attempts aimed at automating interactions/behaviors • E.g., Brozzler, Browsertrix
  • 16. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org
  • 17. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Implementation • Client-side: • Tracer Chrome extension leveraging Selenium IDE • JSON-formatted Trace for download • Server-side: • Stormcrawler • Selenium (Chrome) with Tracer plug-in • WarcProxy • file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  • 18. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 19. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 20. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 21. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 22. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
  • 23. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Current Memento Tracer Capabilities • Single clicks/links • All links in an area • Repeated click on links, with stop condition • Slides • Pagination • Nested traces i.e., “trace in a trace” • Trace for portal A  follow link to portal B  execute trace for portal B • Identification of page/portal for which a trace exists by URI (pattern)
  • 24. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Benefits • Scalability • Trace created once is applicable to all web resources of the same class • Traces shared via repository (edits, versioning) • Quality • Trace used as set of instructions for browser-based capture framework • Resource boundary explicit • Tradeoff • Quality vs performance
  • 25. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Evaluation of Scalability & Quality • Dataset made of GitHub repositories and Slideshare slide decks • 17,646 GitHub repositories (via changelog.com) • 12,280 Slideshare decks (via Explore feature) • Archival goals: • GitHub: get all repository files and ZIP file • Slideshare: get all slides and notes • Quality eval: • Compare against Webrecorder • Scalability eval: • Large amount of high-quality captures • Compare against crawl time of common crawler
  • 26. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality • Not a trivial dimension to evaluate! • Decision to evaluate by amount of URIs in live web version vs. archived snapshot • Based on manually generated snapshots with Webrecorder • Random sample of 100 repos and slide decks • Expectation: • 100% of URIs from live web in archived snapshot
  • 27. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality 100 @ GitHub 100 @ Slideshare
  • 28. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality at Scale 17,646 @ GitHub 12,280 @ Slideshare
  • 29. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Cost of Quality at Scale • Runtime difference between Memento Tracer and common web crawler for the same amount of URIs • Plus 20 seconds per URI, on average • Faster than previous approaches, discovers many more URIs
  • 30. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Take aways • Memento Tracer aims at finding a balance between quality and scale • Human in the loop, benefits from patterns of web resources • Experiments provide indicators for high quality, reliability, scale • Cost involved, slower than simple crawlers • Optimizations possible, further potential and limitations to be explored
  • 31. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving