SlideShare a Scribd company logo
TrendMachine:
Temporal Resilience of Web Pages
@WaybackMachine
IIPC Web Archiving Conference (WAC), May 03, 2023, Online
Sawood Alam
Mark Graham
Kritika Garg
Michele C. Weigle
Michael L. Nelson
Dietrich Ayala
Internet Archive
Internet Archive
Old Dominion University
Old Dominion University
Old Dominion University
Protocol Labs
@WebSciDL @ProtocolLabs
Supported in part by Protocol Labs and Filecoin Foundation
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2
Research Question
How healthy has a web page been
throughout its lifetime?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3
Temporal and Spatial Landscape of Archival Analysis
Long Duration
Single
Webpage
● TMVis
● Wayback Machine Changes
● TrendMachine
● MementoMap
● CDX Summary
● Archives Unleashed Toolkit
Webpage
Collection
● Memento Damage
● Archival ACID Test
● Reconstructive
● Warrick
● Wayback Machine Downloader
● Video Archiving Insights
Short Duration
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4
Modeling Web Page Health: Linear vs. S-Curve
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5
Sigmoid Function for Web Page Resilience
Spread: How far up or down the value can go from its starting position?
Shift: How soon any significant change in the value can begin?
Slope: How quickly the value reaches close to the maximum change?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6
TrendMachine: Composite Sigmoid Parameters of Resilience
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7
TrendMachine: Overview
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8
TrendMachine: Temporal Distribution of Archiving Activities
The page is archived
as few as one or zero
times and as many as
tens of thousands of
times in a single day.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9
Specimen Selection Algorithm
PRIORITY = ["2xx", "4xx", "5xx", "3xx"]
FOREACH st OF PRIORITY
IF st IN statuses(day)
specimen = statuses(day).match(st)[0]
BREAK
DAY1 DAY2 DAY3 DAY4
4xx 3xx 5xx 3xx
3xx 3xx 3xx 5xx
2xx 3xx 5xx 3xx
5xx 4xx 5xx
2xx 4xx
A 3xx specimen usually suggests that the URL is
redirecting to somewhere other than a variation of
the same URL.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10
Filling Missing Observations
Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6
Identical 2xx 2xx 2xx 4xx 2xx
Closest 2xx 2xx 2xx 4xx 4xx 2xx
Forward 2xx 2xx 2xx 2xx 4xx 2xx
Backward 2xx 2xx 4xx 4xx 4xx 2xx
ANY 2xx 2xx
Do not fill the gap if the
status codes before and
after are not identical.
Do not fill the gap if it is
larger than a configured
threshold.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11
TrendMachine: TimeMap Status Codes vs. Daily Specimens
Most of the self-redirect 3xx observations
(HTTP/HTTPS or WWW/Apex domain) are
eliminated in daily specimens.
About one third of the days since the first
observation have no captures, of which
some are filled using a filling policy.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12
TrendMachine: Resilience
● Resilience score is calculated using Sigmoid function on status codes of daily specimens
● Initial value of 0.5 and normalized between 0 and 1
● After the first few observations, Wayback Machine did not archive it for several months in 2002
● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving
● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org”
● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13
TrendMachine: Fixity
● Fixity score (normalized) is calculated using Sigmoid
function on content digests of daily specimens
● Content digest reported in CDX can be sensitive to
Content-Encoding, resulting in false alarms, even
when the underlying content remains unchanged
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14
TrendMachine: Chaos
● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all
status codes of the CDX data in which consecutive duplicates are removed in the numerator
● An alternate sliding-window calculation is performed on the last N observations as the score
becomes insensitive to recent changes on large TimeMaps
● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g.,
adoption of HTTPS and/or consolidation of WWW and Apex domain)
Chaos =
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx |
=
3
= 0.5
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15
TrendMachine: Status Code Transitions
● Large numbers along the major diagonal
indicate status code stability for extended
periods of time
● Large numbers in non-diagonal cells suggest
frequent changes in Resilience curve
● Web pages with high Resilience score for
extended periods usually exhibit large numbers
in the top-left cell (2xx -> 2xx)
● A large number in the 3xx -> 3xx cell usually
indicates extended periods of redirection to
other URLs (e.g., URL restructuring, login wall,
domain change, and parked domain)
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16
TrendMachine: Compare First and Last Mementos
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17
TrendMachine: Live Web Page With Headers
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18
Potential Use Cases
● Detect points of interest in a large TimeMap
● Sample captures/mementos from TimeMaps for visual summarization
● Detect archival sinks (like login pages, paywalls, and misconfigured redirects)
● Detect poor-quality pages like Soft-404 and parked domains
● Detect potential link-rot (and fix them when possible, like in a wiki page)
● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage
● Archival quality assurance
● Cluster pages of a large archival collection in different categories
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19
Future Work
● Report heuristics-based archival summary by combining various scores
● Report/embed captures/mementos that can be points of interest
● Calculate Fixity using less-sensitive digests (e.g., SimHash)
● Calculate Chaos after applying convolutions to smooth out alternate changes
● Allow alternate web page health models (not just Sigmoid functions)
● Deploy in production by integrating with Wayback Machine
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20
Summary
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
A mathematical model
to quantify temporal
health of a web page
Resilience, Fixity,
Chaos, Distributions,
Transitions, etc. reports
An interactive portal with
configuration options for
experiments
An evolving
open-source codebase
and demo deployment

More Related Content

What's hot

Kelainan Patologis Pada Antenatal Care
Kelainan Patologis Pada Antenatal CareKelainan Patologis Pada Antenatal Care
Kelainan Patologis Pada Antenatal CareDokter Tekno
 
Kb 4 monitoring dan evaluasi
Kb 4 monitoring dan evaluasiKb 4 monitoring dan evaluasi
Kb 4 monitoring dan evaluasipjj_kemenkes
 
[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...
[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...
[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...Amazon Web Services Korea
 
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021AWSKRUG - AWS한국사용자모임
 
Advanced networking on AWS | AWS Floor28
Advanced networking on AWS | AWS Floor28Advanced networking on AWS | AWS Floor28
Advanced networking on AWS | AWS Floor28Amazon Web Services
 
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더Amazon Web Services Korea
 
MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA NY.”R” DENGAN PERDARAHA...
MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA  NY.”R” DENGAN PERDARAHA...MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA  NY.”R” DENGAN PERDARAHA...
MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA NY.”R” DENGAN PERDARAHA...Warnet Raha
 
AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나
AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나 AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나
AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나 Amazon Web Services Korea
 
Pp obstipasi
Pp obstipasiPp obstipasi
Pp obstipasiGepy Gbu
 
Modul 6 kb 2 pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...
Modul 6 kb 2   pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...Modul 6 kb 2   pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...
Modul 6 kb 2 pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...pjj_kemenkes
 
Manajemen icu
Manajemen icuManajemen icu
Manajemen icuMaf ID
 
ASPEK PERLINDUNGAN HUKUM
 ASPEK PERLINDUNGAN HUKUM ASPEK PERLINDUNGAN HUKUM
ASPEK PERLINDUNGAN HUKUMDiandr
 
dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...
dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...
dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...Hikmah Ifayanti
 
BBLR (BAYI BERAT LAHIR RENDAH)
BBLR (BAYI BERAT LAHIR RENDAH)BBLR (BAYI BERAT LAHIR RENDAH)
BBLR (BAYI BERAT LAHIR RENDAH)Nenggar Sesanti
 
Kebutuhan Dasar Ibu Masa Nifas
Kebutuhan Dasar Ibu Masa NifasKebutuhan Dasar Ibu Masa Nifas
Kebutuhan Dasar Ibu Masa Nifaspjj_kemenkes
 
Bentuk program menjaga mutu perspektif
Bentuk program menjaga mutu perspektifBentuk program menjaga mutu perspektif
Bentuk program menjaga mutu perspektifBayu Fijrie
 
Implementasi Telemedicine di Indonesia
Implementasi Telemedicine di IndonesiaImplementasi Telemedicine di Indonesia
Implementasi Telemedicine di IndonesiaStefanus Nofa
 

What's hot (20)

Kelainan Patologis Pada Antenatal Care
Kelainan Patologis Pada Antenatal CareKelainan Patologis Pada Antenatal Care
Kelainan Patologis Pada Antenatal Care
 
Kb 4 monitoring dan evaluasi
Kb 4 monitoring dan evaluasiKb 4 monitoring dan evaluasi
Kb 4 monitoring dan evaluasi
 
[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...
[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...
[Retail & CPG Day 2019] 리테일/소비재 부문의 고객 경험 강화를 위한 기술변화 방향과 고객 사례 (ZIGZAG) - 김선...
 
PPT LTA KEBIDANAN
PPT LTA KEBIDANANPPT LTA KEBIDANAN
PPT LTA KEBIDANAN
 
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
 
Advanced networking on AWS | AWS Floor28
Advanced networking on AWS | AWS Floor28Advanced networking on AWS | AWS Floor28
Advanced networking on AWS | AWS Floor28
 
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
 
MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA NY.”R” DENGAN PERDARAHA...
MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA  NY.”R” DENGAN PERDARAHA...MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA  NY.”R” DENGAN PERDARAHA...
MANAJEMEN DAN PENDOKUMENTASIAN ASUHAN KEBIDANAN PADA NY.”R” DENGAN PERDARAHA...
 
Makalah bayi meninggal mendadak
Makalah bayi meninggal mendadakMakalah bayi meninggal mendadak
Makalah bayi meninggal mendadak
 
AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나
AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나 AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나
AWS 기반 소프트웨어 서비스(SaaS) -김용우 솔루션즈 아키텍트 :: AWS 파트너 테크시프트 세미나
 
Usia Lanjut
Usia Lanjut Usia Lanjut
Usia Lanjut
 
Pp obstipasi
Pp obstipasiPp obstipasi
Pp obstipasi
 
Modul 6 kb 2 pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...
Modul 6 kb 2   pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...Modul 6 kb 2   pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...
Modul 6 kb 2 pengaturan suhu, metabolisme, glukosa, perubahan sistem gastro...
 
Manajemen icu
Manajemen icuManajemen icu
Manajemen icu
 
ASPEK PERLINDUNGAN HUKUM
 ASPEK PERLINDUNGAN HUKUM ASPEK PERLINDUNGAN HUKUM
ASPEK PERLINDUNGAN HUKUM
 
dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...
dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...
dokumentasi kebidanan sistem pengumpulan data rekam medik dan sistem dokument...
 
BBLR (BAYI BERAT LAHIR RENDAH)
BBLR (BAYI BERAT LAHIR RENDAH)BBLR (BAYI BERAT LAHIR RENDAH)
BBLR (BAYI BERAT LAHIR RENDAH)
 
Kebutuhan Dasar Ibu Masa Nifas
Kebutuhan Dasar Ibu Masa NifasKebutuhan Dasar Ibu Masa Nifas
Kebutuhan Dasar Ibu Masa Nifas
 
Bentuk program menjaga mutu perspektif
Bentuk program menjaga mutu perspektifBentuk program menjaga mutu perspektif
Bentuk program menjaga mutu perspektif
 
Implementasi Telemedicine di Indonesia
Implementasi Telemedicine di IndonesiaImplementasi Telemedicine di Indonesia
Implementasi Telemedicine di Indonesia
 

Similar to TrendMachine: Temporal Resilience of Web Pages

Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performancekkjjkevin03
 
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Project COLA
 
WSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per SecondWSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per SecondWebStackAcademy
 
Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end developmentyouth Overturn
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingJelastic Multi-Cloud PaaS
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveKritika Garg
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in PracticeJaime Crespo
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMark Friedman
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Mark Friedman
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Mayank Srivastava
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server WSO2
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid Jalili
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmwareAnton An
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестированияCOMAQA.BY
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTVoltDB
 

Similar to TrendMachine: Temporal Resilience of Web Pages (20)

Big datainmemory pub
Big datainmemory pubBig datainmemory pub
Big datainmemory pub
 
Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performance
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
 
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
 
WSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per SecondWSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per Second
 
Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end development
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
 
IT Resilience Technical
IT Resilience TechnicalIT Resilience Technical
IT Resilience Technical
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windows
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmware
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестирования
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
 

More from Sawood Alam

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingSawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 

More from Sawood Alam (20)

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Recently uploaded

How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?Linksys Velop Login
 
Premier Mobile App Development Agency in USA.pdf
Premier Mobile App Development Agency in USA.pdfPremier Mobile App Development Agency in USA.pdf
Premier Mobile App Development Agency in USA.pdfappinfoedgeca
 
iThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWebiThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWebJie Liau
 
Development Lifecycle.pptx for the secure development of apps
Development Lifecycle.pptx for the secure development of appsDevelopment Lifecycle.pptx for the secure development of apps
Development Lifecycle.pptx for the secure development of appscristianmanaila2
 
Statistical Analysis of DNS Latencies.pdf
Statistical Analysis of DNS Latencies.pdfStatistical Analysis of DNS Latencies.pdf
Statistical Analysis of DNS Latencies.pdfOndejSur
 
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkkaudience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkklolsDocherty
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyDamar Juniarto
 
Pvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdfPvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdfPvtaan
 
Reggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsrahman018755
 
Cyber Security Services Unveiled: Strategies to Secure Your Digital Presence
Cyber Security Services Unveiled: Strategies to Secure Your Digital PresenceCyber Security Services Unveiled: Strategies to Secure Your Digital Presence
Cyber Security Services Unveiled: Strategies to Secure Your Digital PresencePC Doctors NET
 
Bug Bounty Blueprint : A Beginner's Guide
Bug Bounty Blueprint : A Beginner's GuideBug Bounty Blueprint : A Beginner's Guide
Bug Bounty Blueprint : A Beginner's GuideVarun Mithran
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxabhinandnam9997
 
Production 2024 sunderland culture final - Copy.pptx
Production 2024 sunderland culture final - Copy.pptxProduction 2024 sunderland culture final - Copy.pptx
Production 2024 sunderland culture final - Copy.pptxChloeMeadows1
 
Thank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirtsThank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirtsrahman018755
 
Case study on merger of Vodafone and Idea (VI).pptx
Case study on merger of Vodafone and Idea (VI).pptxCase study on merger of Vodafone and Idea (VI).pptx
Case study on merger of Vodafone and Idea (VI).pptxAnkitscribd
 
Topology of the Network class 8 .ppt pdf
Topology of the Network class 8 .ppt pdfTopology of the Network class 8 .ppt pdf
Topology of the Network class 8 .ppt pdfAnushkaTripathi61
 

Recently uploaded (16)

How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?
 
Premier Mobile App Development Agency in USA.pdf
Premier Mobile App Development Agency in USA.pdfPremier Mobile App Development Agency in USA.pdf
Premier Mobile App Development Agency in USA.pdf
 
iThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWebiThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWeb
 
Development Lifecycle.pptx for the secure development of apps
Development Lifecycle.pptx for the secure development of appsDevelopment Lifecycle.pptx for the secure development of apps
Development Lifecycle.pptx for the secure development of apps
 
Statistical Analysis of DNS Latencies.pdf
Statistical Analysis of DNS Latencies.pdfStatistical Analysis of DNS Latencies.pdf
Statistical Analysis of DNS Latencies.pdf
 
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkkaudience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
Pvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdfPvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdf
 
Reggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirts
 
Cyber Security Services Unveiled: Strategies to Secure Your Digital Presence
Cyber Security Services Unveiled: Strategies to Secure Your Digital PresenceCyber Security Services Unveiled: Strategies to Secure Your Digital Presence
Cyber Security Services Unveiled: Strategies to Secure Your Digital Presence
 
Bug Bounty Blueprint : A Beginner's Guide
Bug Bounty Blueprint : A Beginner's GuideBug Bounty Blueprint : A Beginner's Guide
Bug Bounty Blueprint : A Beginner's Guide
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
Production 2024 sunderland culture final - Copy.pptx
Production 2024 sunderland culture final - Copy.pptxProduction 2024 sunderland culture final - Copy.pptx
Production 2024 sunderland culture final - Copy.pptx
 
Thank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirtsThank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirts
 
Case study on merger of Vodafone and Idea (VI).pptx
Case study on merger of Vodafone and Idea (VI).pptxCase study on merger of Vodafone and Idea (VI).pptx
Case study on merger of Vodafone and Idea (VI).pptx
 
Topology of the Network class 8 .ppt pdf
Topology of the Network class 8 .ppt pdfTopology of the Network class 8 .ppt pdf
Topology of the Network class 8 .ppt pdf
 

TrendMachine: Temporal Resilience of Web Pages

  • 1. TrendMachine: Temporal Resilience of Web Pages @WaybackMachine IIPC Web Archiving Conference (WAC), May 03, 2023, Online Sawood Alam Mark Graham Kritika Garg Michele C. Weigle Michael L. Nelson Dietrich Ayala Internet Archive Internet Archive Old Dominion University Old Dominion University Old Dominion University Protocol Labs @WebSciDL @ProtocolLabs Supported in part by Protocol Labs and Filecoin Foundation
  • 2. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2 Research Question How healthy has a web page been throughout its lifetime?
  • 3. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3 Temporal and Spatial Landscape of Archival Analysis Long Duration Single Webpage ● TMVis ● Wayback Machine Changes ● TrendMachine ● MementoMap ● CDX Summary ● Archives Unleashed Toolkit Webpage Collection ● Memento Damage ● Archival ACID Test ● Reconstructive ● Warrick ● Wayback Machine Downloader ● Video Archiving Insights Short Duration
  • 4. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4 Modeling Web Page Health: Linear vs. S-Curve
  • 5. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5 Sigmoid Function for Web Page Resilience Spread: How far up or down the value can go from its starting position? Shift: How soon any significant change in the value can begin? Slope: How quickly the value reaches close to the maximum change?
  • 6. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6 TrendMachine: Composite Sigmoid Parameters of Resilience
  • 7. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7 TrendMachine: Overview Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/
  • 8. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8 TrendMachine: Temporal Distribution of Archiving Activities The page is archived as few as one or zero times and as many as tens of thousands of times in a single day.
  • 9. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9 Specimen Selection Algorithm PRIORITY = ["2xx", "4xx", "5xx", "3xx"] FOREACH st OF PRIORITY IF st IN statuses(day) specimen = statuses(day).match(st)[0] BREAK DAY1 DAY2 DAY3 DAY4 4xx 3xx 5xx 3xx 3xx 3xx 3xx 5xx 2xx 3xx 5xx 3xx 5xx 4xx 5xx 2xx 4xx A 3xx specimen usually suggests that the URL is redirecting to somewhere other than a variation of the same URL.
  • 10. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10 Filling Missing Observations Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6 Identical 2xx 2xx 2xx 4xx 2xx Closest 2xx 2xx 2xx 4xx 4xx 2xx Forward 2xx 2xx 2xx 2xx 4xx 2xx Backward 2xx 2xx 4xx 4xx 4xx 2xx ANY 2xx 2xx Do not fill the gap if the status codes before and after are not identical. Do not fill the gap if it is larger than a configured threshold.
  • 11. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11 TrendMachine: TimeMap Status Codes vs. Daily Specimens Most of the self-redirect 3xx observations (HTTP/HTTPS or WWW/Apex domain) are eliminated in daily specimens. About one third of the days since the first observation have no captures, of which some are filled using a filling policy.
  • 12. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12 TrendMachine: Resilience ● Resilience score is calculated using Sigmoid function on status codes of daily specimens ● Initial value of 0.5 and normalized between 0 and 1 ● After the first few observations, Wayback Machine did not archive it for several months in 2002 ● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving ● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org” ● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
  • 13. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13 TrendMachine: Fixity ● Fixity score (normalized) is calculated using Sigmoid function on content digests of daily specimens ● Content digest reported in CDX can be sensitive to Content-Encoding, resulting in false alarms, even when the underlying content remains unchanged
  • 14. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14 TrendMachine: Chaos ● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all status codes of the CDX data in which consecutive duplicates are removed in the numerator ● An alternate sliding-window calculation is performed on the last N observations as the score becomes insensitive to recent changes on large TimeMaps ● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g., adoption of HTTPS and/or consolidation of WWW and Apex domain) Chaos = | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | = 3 = 0.5 | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
  • 15. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15 TrendMachine: Status Code Transitions ● Large numbers along the major diagonal indicate status code stability for extended periods of time ● Large numbers in non-diagonal cells suggest frequent changes in Resilience curve ● Web pages with high Resilience score for extended periods usually exhibit large numbers in the top-left cell (2xx -> 2xx) ● A large number in the 3xx -> 3xx cell usually indicates extended periods of redirection to other URLs (e.g., URL restructuring, login wall, domain change, and parked domain)
  • 16. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16 TrendMachine: Compare First and Last Mementos
  • 17. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17 TrendMachine: Live Web Page With Headers
  • 18. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18 Potential Use Cases ● Detect points of interest in a large TimeMap ● Sample captures/mementos from TimeMaps for visual summarization ● Detect archival sinks (like login pages, paywalls, and misconfigured redirects) ● Detect poor-quality pages like Soft-404 and parked domains ● Detect potential link-rot (and fix them when possible, like in a wiki page) ● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage ● Archival quality assurance ● Cluster pages of a large archival collection in different categories
  • 19. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19 Future Work ● Report heuristics-based archival summary by combining various scores ● Report/embed captures/mementos that can be points of interest ● Calculate Fixity using less-sensitive digests (e.g., SimHash) ● Calculate Chaos after applying convolutions to smooth out alternate changes ● Allow alternate web page health models (not just Sigmoid functions) ● Deploy in production by integrating with Wayback Machine
  • 20. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20 Summary Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/ A mathematical model to quantify temporal health of a web page Resilience, Fixity, Chaos, Distributions, Transitions, etc. reports An interactive portal with configuration options for experiments An evolving open-source codebase and demo deployment