Evaluating Memento Service Optimizations

•Download as PPTX, PDF•

0 likes•1,483 views

The document discusses optimizations made to Memento services to improve efficiency. It describes using a cache that achieved a hit rate of around 60% and machine learning models to predict whether archives would have versions of requested URIs. The machine learning reduced the number of requests by around 40% while maintaining over 70% recall. The optimizations saved significant time and resources while maintaining an acceptable level of effectiveness.

Internet

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Evaluating
Service Optimizations
Martin Klein
Lyudmila Balakireva
Harihar Shankar
Research Library
Los Alamos National Laboratory
https://arxiv.org/abs/1906.00058

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
http://timetravel.mementoweb.org/
2

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/
4

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
How does this work?
Memento Aggregator (simplistic view)
5

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Aggregator
6

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Aggregator
7

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• As the number of archives grows, sending requests to each archive
for every incoming request is not feasible
• Response times
• Memento infrastructure load
• Load on distributed archives
LANL Memento Aggregator - Problem
8

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• We could predict whether or not to issue a request to a specific
archive?
• By merely looking at the requested URI-R
• A binary classifier per archive
• We could train the classifiers using cached data?
• That would be pretty neat, indeed:
• Retrain classifiers as web archive collections evolve
• Not dependent on external data
• Querying classifiers probably way faster (msec) than polling
archives (sec)
What if…
9

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
10
We can! Published @ JCDL 2016
https://doi.org/10.1145/2910896.2910899
• ML models based on simple URI features
• Character count, n-grams, domain
• Common ML algorithms used per archive
• Logistic Regression, Multinomial Bayes, SVM
• Optimized for
• Prediction time, not training time
• Reduction of false positive rate
Results:
• Requests per URI-R: 3.96 vs 11
• Response time:
2.16s vs 3.08s
• Recall:
0.847

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
11

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
In Production…
12

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
13

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
14

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
15

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
16

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
17

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
18

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• How effective is the cache?
• What is the hit/miss ratio? Does it vary for different Memento
services?
• Is the cache freshness period appropriate?
• How effective is the ML process?
• What is the recall and the false positive rate?
• Do we need to retrain the models? How often?
Questions to Ask
19

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• Memento Aggregator currently covers
• 23 web archives
• 17 with native memento support
• 6 with by-proxy memento support
• Analysis of log files
• recorded between July 4th 2017 and October 17th 2018
• > 11m requests in total
• Approx. 2.6m requests against machine learning process
• Results in 2.6m lookups to populate cache
• Used as “truth” to assess ML prediction
Evaluation
20

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Cache Hit/Miss Rate
21

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Cache Hit/Miss Rate
22
humanshumans machines machinesMostly driven by

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Recall
23
0.847
0.727

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
False Positives
24

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Dynamic Web Archives
25

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• Memento Aggregator cache is very effective
• ~ 60% of requests served from cache
• Human-driven services benefit the most
• Machine learning process saves!
• Requests & time while at acceptable recall level
• Recall: 0.727
• Re-training seems necessary, frequency TBD
Optimization
• ML model trained on archival holdings, not usage logs/cache
• Beneficial for new archives
• Neural network classifier, based on simple URI features, show
promising results
Takeaways
26

More from Martin Klein

An Institutional Perspective to Rescue Scholarly Orphans

Martin Klein

A Vision of the Library’s Role in Archiving Scholarly Artifacts

Martin Klein

First Steps in Research Data Management Under Constraints of a National Secur...

Martin Klein

Smart Routing of Memento Requests

Martin Klein

Building Event Collections from Crawling Web Archives

Martin Klein

A Web-Centric Pipeline for Archiving Scholarly Artifacts

Martin Klein

Focused Crawl of Web Archives to Build Event Collections

Martin Klein

Creating Topical Collections:Web Archives vs. Live Web

Martin Klein

Robust Linking to Web Resources

Martin Klein

Signposting for Repositories

Martin Klein

Discovering Scholarly Orphans Using ORCID

Martin Klein

Using the Memento Framework to Assess Content Drift in Scholarly Communication

Martin Klein

Uniform Access to Raw Mementos

Martin Klein

Robust Links - a proposed solution to reference rot in scholarly communication

Martin Klein

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...

Martin Klein

To the Rescue of the Orphans of Scholarly Communication

Martin Klein

web_archive_interoperability_memento

Martin Klein

As research and research communication nowadays happen on the web, scholarly articles increasingly link to resources that are not necessarily considered part of the scholarly record but are rather so-called web-at-large resources such as project websites, online debates, presentations, blogs, videos, etc. Our research (reported in PLOS ONE [1]) found overwhelming evidence for this trend and showed the severity of link rot for such references. Our more recent study [2] provides unprecedented insight into the vast extent of content drift for these references. We speak of content drift when the content of a referenced resource evolves after the publication of the referencing article, in many cases, beyond recognition. Reference rot, the combination of link rot and content drift, makes it impossible to revisit the context that surrounded these research papers as it was at the time of writing and must therefore be considered a significant detriment to scholarly communication. In order to introduce a level of persistence for the scholarly context we devised the Robust Links approach that consists of archiving referenced web-at-large resources and referencing them using Link Decoration [3]. The proposed approach is aimed at providing optimal guarantees that referenced web-at-large resources can be revisited as they were when a paper referenced them. In this presentation we will report on both studies, provide a reliable quantification of the reference rot problem and discuss our solution to address it. Robust Links are demonstrated in a recently published paper [4]. [1] http://dx.doi.org/10.1371/journal.pone.0115253 [2] http://dx.doi.org/10.1371/journal.pone.0167475 [3] http://robustlinks.mementoweb.org/spec/ [4] http://dx.doi.org/10.1045/november2015-vandesompel

Reference Rot in Scholarly Communication: A Reliable Quantification and a P...

Martin Klein

Comparing Published Scientific Journal Articles to Their Pre-print Versions

Martin Klein

Preserving Born-Digital News Panel JCDL 2016

Martin Klein

More from Martin Klein (20)

An Institutional Perspective to Rescue Scholarly Orphans

A Vision of the Library’s Role in Archiving Scholarly Artifacts

First Steps in Research Data Management Under Constraints of a National Secur...

Smart Routing of Memento Requests

Building Event Collections from Crawling Web Archives

A Web-Centric Pipeline for Archiving Scholarly Artifacts

Focused Crawl of Web Archives to Build Event Collections

Creating Topical Collections:Web Archives vs. Live Web

Robust Linking to Web Resources

Signposting for Repositories

Discovering Scholarly Orphans Using ORCID

Using the Memento Framework to Assess Content Drift in Scholarly Communication

Uniform Access to Raw Mementos

Robust Links - a proposed solution to reference rot in scholarly communication

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...

To the Rescue of the Orphans of Scholarly Communication

web_archive_interoperability_memento

Reference Rot in Scholarly Communication: A Reliable Quantification and a P...

Comparing Published Scientific Journal Articles to Their Pre-print Versions

Preserving Born-Digital News Panel JCDL 2016

Recently uploaded

Do you know — how to create stunning faceless videos in less than 60 seconds? The answer is — Intellivid AI Studio. This amazing and ultimate software is the world’s first AI tech that transforms any keyword into stunning faceless videos. The best part is it does this for any niche and language within 60 seconds or less. In today’s time, there are two types of audiences i.e. one who loves watching videos. While others love to create videos to earn some extra bucks. In this PDF you will learn everything about this amazing AI powered Software - Intellivid AI Studio.

The Best AI Powered Software - Intellivid AI Studio

Vikram Brahma (Online-Digital-Entrepreneur)

ER(Entity Relationship) Diagram for online shopping - TAE

Himani415946

Article writing on excessive use of internet.pptx

abhinandnam9997

The Use of AI in Indonesia Election 2024: A Case Study

Damar Juniarto

The+Prospects+of+E-Commerce+in+China.pptx

laozhuseo02

UTS毕业证原版定制【微信：176555708】【悉尼科技大学毕业证成绩单-学位证】【微信：176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。 ◆◆◆◆◆ — — — — — — — — 【留学教育】留学归国服务中心 — — — — — -◆◆◆◆◆ 【主营项目】一.毕业证【微信：176555708】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【微信：176555708】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分→ 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！学历顾问：微信：176555708

一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理

aagad

Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors. Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices. Features of Wireless Communication The evolution of wireless technology has brought many advancements with its effective features. The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication). Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.

1.Wireless Communication System_Wireless communication is a broad term that i...

JeyaPerumal1

Talk presented at Kubernetes Community Day, New York, May 2024. Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics. 1) Key patterns for Multi-cluster architectures 2) Architectural comparison of several OSS/ CNCF projects to address these patterns 3) Evolution trends for the APIs of these projects 4) Some design recommendations & guidelines for adopting/ deploying these solutions.

Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines

Sanjeev Rampal

History+of+E-commerce+Development+in+China-www.cfye-commerce.shop

laozhuseo02

How to Use Contact Form 7 Like a Pro.pptx

Gal Baras

The AI Powered Organization-Intro to AI-LAN.pdf

SiskaFitrianingrum

Discover the top 9 modern web design trends for 2024, from AI-driven interfaces to immersive 3D visuals. Stay ahead in the digital landscape by embracing minimalist aesthetics, dark mode, and voice user interfaces. These cutting-edge web design trends will enhance user experience and engagement, ensuring your website remains contemporary and competitive. Partner with a leading Web design Company to implement these trends effectively.

Stay Ahead with 2024's Top Web Design Trends

Synergic Softek Solutions

Recently uploaded (12)

The Best AI Powered Software - Intellivid AI Studio

ER(Entity Relationship) Diagram for online shopping - TAE

Article writing on excessive use of internet.pptx

The Use of AI in Indonesia Election 2024: A Case Study

The+Prospects+of+E-Commerce+in+China.pptx

一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理

1.Wireless Communication System_Wireless communication is a broad term that i...

Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines

History+of+E-commerce+Development+in+China-www.cfye-commerce.shop

How to Use Contact Form 7 Like a Pro.pptx

The AI Powered Organization-Intro to AI-LAN.pdf

Stay Ahead with 2024's Top Web Design Trends

Evaluating Memento Service Optimizations

1. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058

2. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/ 2

3. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/list/20160214085934/http://jcdl.org 3

4. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/ 4

5. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA How does this work? Memento Aggregator (simplistic view) 5

6. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 6

7. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 7

8. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • As the number of archives grows, sending requests to each archive for every incoming request is not feasible • Response times • Memento infrastructure load • Load on distributed archives LANL Memento Aggregator - Problem 8

9. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • We could predict whether or not to issue a request to a specific archive? • By merely looking at the requested URI-R • A binary classifier per archive • We could train the classifiers using cached data? • That would be pretty neat, indeed: • Retrain classifiers as web archive collections evolve • Not dependent on external data • Querying classifiers probably way faster (msec) than polling archives (sec) What if… 9

10. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 10 We can! Published @ JCDL 2016 https://doi.org/10.1145/2910896.2910899 • ML models based on simple URI features • Character count, n-grams, domain • Common ML algorithms used per archive • Logistic Regression, Multinomial Bayes, SVM • Optimized for • Prediction time, not training time • Reduction of false positive rate Results: • Requests per URI-R: 3.96 vs 11 • Response time: 2.16s vs 3.08s • Recall: 0.847

11. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 11

12. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA In Production… 12

13. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 13

14. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 14

15. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 15

16. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 16

17. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 17

18. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 18

19. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • How effective is the cache? • What is the hit/miss ratio? Does it vary for different Memento services? • Is the cache freshness period appropriate? • How effective is the ML process? • What is the recall and the false positive rate? • Do we need to retrain the models? How often? Questions to Ask 19

20. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator currently covers • 23 web archives • 17 with native memento support • 6 with by-proxy memento support • Analysis of log files • recorded between July 4th 2017 and October 17th 2018 • > 11m requests in total • Approx. 2.6m requests against machine learning process • Results in 2.6m lookups to populate cache • Used as “truth” to assess ML prediction Evaluation 20

21. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 21

22. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 22 humanshumans machines machinesMostly driven by

23. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Recall 23 0.847 0.727

24. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA False Positives 24

25. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Dynamic Web Archives 25

26. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator cache is very effective • ~ 60% of requests served from cache • Human-driven services benefit the most • Machine learning process saves! • Requests & time while at acceptable recall level • Recall: 0.727 • Re-training seems necessary, frequency TBD Optimization • ML model trained on archival holdings, not usage logs/cache • Beneficial for new archives • Neural network classifier, based on simple URI features, show promising results Takeaways 26

27. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058

Evaluating Memento Service Optimizations

Recommended

Recommended

More Related Content

More from Martin Klein

More from Martin Klein (20)

Recently uploaded

Recently uploaded (12)

Evaluating Memento Service Optimizations