SlideShare a Scribd company logo
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Evaluating
Service Optimizations
Martin Klein
Lyudmila Balakireva
Harihar Shankar
Research Library
Los Alamos National Laboratory
https://arxiv.org/abs/1906.00058
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
http://timetravel.mementoweb.org/
2
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
http://timetravel.mementoweb.org/list/20160214085934/http://jcdl.org
3
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/
4
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
How does this work?
Memento Aggregator (simplistic view)
5
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Aggregator
6
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Aggregator
7
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• As the number of archives grows, sending requests to each archive
for every incoming request is not feasible
• Response times
• Memento infrastructure load
• Load on distributed archives
LANL Memento Aggregator - Problem
8
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• We could predict whether or not to issue a request to a specific
archive?
• By merely looking at the requested URI-R
• A binary classifier per archive
• We could train the classifiers using cached data?
• That would be pretty neat, indeed:
• Retrain classifiers as web archive collections evolve
• Not dependent on external data
• Querying classifiers probably way faster (msec) than polling
archives (sec)
What if…
9
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
10
We can! Published @ JCDL 2016
https://doi.org/10.1145/2910896.2910899
• ML models based on simple URI features
• Character count, n-grams, domain
• Common ML algorithms used per archive
• Logistic Regression, Multinomial Bayes, SVM
• Optimized for
• Prediction time, not training time
• Reduction of false positive rate
Results:
• Requests per URI-R: 3.96 vs 11
• Response time:
2.16s vs 3.08s
• Recall:
0.847
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
11
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
In Production…
12
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
13
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
14
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
15
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
16
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
17
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
18
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• How effective is the cache?
• What is the hit/miss ratio? Does it vary for different Memento
services?
• Is the cache freshness period appropriate?
• How effective is the ML process?
• What is the recall and the false positive rate?
• Do we need to retrain the models? How often?
Questions to Ask
19
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• Memento Aggregator currently covers
• 23 web archives
• 17 with native memento support
• 6 with by-proxy memento support
• Analysis of log files
• recorded between July 4th 2017 and October 17th 2018
• > 11m requests in total
• Approx. 2.6m requests against machine learning process
• Results in 2.6m lookups to populate cache
• Used as “truth” to assess ML prediction
Evaluation
20
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Cache Hit/Miss Rate
21
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Cache Hit/Miss Rate
22
humanshumans machines machinesMostly driven by
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Recall
23
0.847
0.727
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
False Positives
24
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Dynamic Web Archives
25
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• Memento Aggregator cache is very effective
• ~ 60% of requests served from cache
• Human-driven services benefit the most
• Machine learning process saves!
• Requests & time while at acceptable recall level
• Recall: 0.727
• Re-training seems necessary, frequency TBD
Optimization
• ML model trained on archival holdings, not usage logs/cache
• Beneficial for new archives
• Neural network classifier, based on simple URI features, show
promising results
Takeaways
26
Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Evaluating
Service Optimizations
Martin Klein
Lyudmila Balakireva
Harihar Shankar
Research Library
Los Alamos National Laboratory
https://arxiv.org/abs/1906.00058

More Related Content

More from Martin Klein

Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Martin Klein
 

More from Martin Klein (20)

An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 
Uniform Access to Raw Mementos
Uniform Access to Raw MementosUniform Access to Raw Mementos
Uniform Access to Raw Mementos
 
Robust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communicationRobust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communication
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
web_archive_interoperability_memento
web_archive_interoperability_mementoweb_archive_interoperability_memento
web_archive_interoperability_memento
 
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
 
Comparing Published Scientific Journal Articles to Their Pre-print Versions
Comparing Published Scientific Journal Articles  to Their Pre-print VersionsComparing Published Scientific Journal Articles  to Their Pre-print Versions
Comparing Published Scientific Journal Articles to Their Pre-print Versions
 
Preserving Born-Digital News Panel JCDL 2016
Preserving Born-Digital News Panel JCDL 2016Preserving Born-Digital News Panel JCDL 2016
Preserving Born-Digital News Panel JCDL 2016
 

Recently uploaded

Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
abhinandnam9997
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
aagad
 

Recently uploaded (12)

The Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI StudioThe Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI Studio
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
Stay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design TrendsStay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design Trends
 

Evaluating Memento Service Optimizations

  • 1. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058
  • 2. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/ 2
  • 3. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/list/20160214085934/http://jcdl.org 3
  • 4. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/ 4
  • 5. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA How does this work? Memento Aggregator (simplistic view) 5
  • 6. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 6
  • 7. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 7
  • 8. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • As the number of archives grows, sending requests to each archive for every incoming request is not feasible • Response times • Memento infrastructure load • Load on distributed archives LANL Memento Aggregator - Problem 8
  • 9. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • We could predict whether or not to issue a request to a specific archive? • By merely looking at the requested URI-R • A binary classifier per archive • We could train the classifiers using cached data? • That would be pretty neat, indeed: • Retrain classifiers as web archive collections evolve • Not dependent on external data • Querying classifiers probably way faster (msec) than polling archives (sec) What if… 9
  • 10. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 10 We can! Published @ JCDL 2016 https://doi.org/10.1145/2910896.2910899 • ML models based on simple URI features • Character count, n-grams, domain • Common ML algorithms used per archive • Logistic Regression, Multinomial Bayes, SVM • Optimized for • Prediction time, not training time • Reduction of false positive rate Results: • Requests per URI-R: 3.96 vs 11 • Response time: 2.16s vs 3.08s • Recall: 0.847
  • 11. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 11
  • 12. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA In Production… 12
  • 13. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 13
  • 14. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 14
  • 15. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 15
  • 16. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 16
  • 17. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 17
  • 18. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 18
  • 19. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • How effective is the cache? • What is the hit/miss ratio? Does it vary for different Memento services? • Is the cache freshness period appropriate? • How effective is the ML process? • What is the recall and the false positive rate? • Do we need to retrain the models? How often? Questions to Ask 19
  • 20. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator currently covers • 23 web archives • 17 with native memento support • 6 with by-proxy memento support • Analysis of log files • recorded between July 4th 2017 and October 17th 2018 • > 11m requests in total • Approx. 2.6m requests against machine learning process • Results in 2.6m lookups to populate cache • Used as “truth” to assess ML prediction Evaluation 20
  • 21. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 21
  • 22. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 22 humanshumans machines machinesMostly driven by
  • 23. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Recall 23 0.847 0.727
  • 24. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA False Positives 24
  • 25. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Dynamic Web Archives 25
  • 26. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator cache is very effective • ~ 60% of requests served from cache • Human-driven services benefit the most • Machine learning process saves! • Requests & time while at acceptable recall level • Recall: 0.727 • Re-training seems necessary, frequency TBD Optimization • ML model trained on archival holdings, not usage logs/cache • Beneficial for new archives • Neural network classifier, based on simple URI features, show promising results Takeaways 26
  • 27. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058