SlideShare a Scribd company logo
TrendMachine:
Temporal Resilience of Web Pages
@WaybackMachine
IIPC Web Archiving Conference (WAC), May 03, 2023, Online
Sawood Alam
Mark Graham
Kritika Garg
Michele C. Weigle
Michael L. Nelson
Dietrich Ayala
Internet Archive
Internet Archive
Old Dominion University
Old Dominion University
Old Dominion University
Protocol Labs
@WebSciDL @ProtocolLabs
Supported in part by Protocol Labs and Filecoin Foundation
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2
Research Question
How healthy has a web page been
throughout its lifetime?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3
Temporal and Spatial Landscape of Archival Analysis
Long Duration
Single
Webpage
● TMVis
● Wayback Machine Changes
● TrendMachine
● MementoMap
● CDX Summary
● Archives Unleashed Toolkit
Webpage
Collection
● Memento Damage
● Archival ACID Test
● Reconstructive
● Warrick
● Wayback Machine Downloader
● Video Archiving Insights
Short Duration
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4
Modeling Web Page Health: Linear vs. S-Curve
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5
Sigmoid Function for Web Page Resilience
Spread: How far up or down the value can go from its starting position?
Shift: How soon any significant change in the value can begin?
Slope: How quickly the value reaches close to the maximum change?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6
TrendMachine: Composite Sigmoid Parameters of Resilience
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7
TrendMachine: Overview
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8
TrendMachine: Temporal Distribution of Archiving Activities
The page is archived
as few as one or zero
times and as many as
tens of thousands of
times in a single day.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9
Specimen Selection Algorithm
PRIORITY = ["2xx", "4xx", "5xx", "3xx"]
FOREACH st OF PRIORITY
IF st IN statuses(day)
specimen = statuses(day).match(st)[0]
BREAK
DAY1 DAY2 DAY3 DAY4
4xx 3xx 5xx 3xx
3xx 3xx 3xx 5xx
2xx 3xx 5xx 3xx
5xx 4xx 5xx
2xx 4xx
A 3xx specimen usually suggests that the URL is
redirecting to somewhere other than a variation of
the same URL.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10
Filling Missing Observations
Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6
Identical 2xx 2xx 2xx 4xx 2xx
Closest 2xx 2xx 2xx 4xx 4xx 2xx
Forward 2xx 2xx 2xx 2xx 4xx 2xx
Backward 2xx 2xx 4xx 4xx 4xx 2xx
ANY 2xx 2xx
Do not fill the gap if the
status codes before and
after are not identical.
Do not fill the gap if it is
larger than a configured
threshold.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11
TrendMachine: TimeMap Status Codes vs. Daily Specimens
Most of the self-redirect 3xx observations
(HTTP/HTTPS or WWW/Apex domain) are
eliminated in daily specimens.
About one third of the days since the first
observation have no captures, of which
some are filled using a filling policy.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12
TrendMachine: Resilience
● Resilience score is calculated using Sigmoid function on status codes of daily specimens
● Initial value of 0.5 and normalized between 0 and 1
● After the first few observations, Wayback Machine did not archive it for several months in 2002
● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving
● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org”
● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13
TrendMachine: Fixity
● Fixity score (normalized) is calculated using Sigmoid
function on content digests of daily specimens
● Content digest reported in CDX can be sensitive to
Content-Encoding, resulting in false alarms, even
when the underlying content remains unchanged
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14
TrendMachine: Chaos
● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all
status codes of the CDX data in which consecutive duplicates are removed in the numerator
● An alternate sliding-window calculation is performed on the last N observations as the score
becomes insensitive to recent changes on large TimeMaps
● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g.,
adoption of HTTPS and/or consolidation of WWW and Apex domain)
Chaos =
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx |
=
3
= 0.5
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15
TrendMachine: Status Code Transitions
● Large numbers along the major diagonal
indicate status code stability for extended
periods of time
● Large numbers in non-diagonal cells suggest
frequent changes in Resilience curve
● Web pages with high Resilience score for
extended periods usually exhibit large numbers
in the top-left cell (2xx -> 2xx)
● A large number in the 3xx -> 3xx cell usually
indicates extended periods of redirection to
other URLs (e.g., URL restructuring, login wall,
domain change, and parked domain)
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16
TrendMachine: Compare First and Last Mementos
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17
TrendMachine: Live Web Page With Headers
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18
Potential Use Cases
● Detect points of interest in a large TimeMap
● Sample captures/mementos from TimeMaps for visual summarization
● Detect archival sinks (like login pages, paywalls, and misconfigured redirects)
● Detect poor-quality pages like Soft-404 and parked domains
● Detect potential link-rot (and fix them when possible, like in a wiki page)
● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage
● Archival quality assurance
● Cluster pages of a large archival collection in different categories
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19
Future Work
● Report heuristics-based archival summary by combining various scores
● Report/embed captures/mementos that can be points of interest
● Calculate Fixity using less-sensitive digests (e.g., SimHash)
● Calculate Chaos after applying convolutions to smooth out alternate changes
● Allow alternate web page health models (not just Sigmoid functions)
● Deploy in production by integrating with Wayback Machine
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20
Summary
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
A mathematical model
to quantify temporal
health of a web page
Resilience, Fixity,
Chaos, Distributions,
Transitions, etc. reports
An interactive portal with
configuration options for
experiments
An evolving
open-source codebase
and demo deployment

More Related Content

Similar to TrendMachine: Temporal Resilience of Web Pages

Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performancekkjjkevin03
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
Mayank Srivastava
 
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Project COLA
 
WSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per SecondWSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per Second
WebStackAcademy
 
Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end development
youth Overturn
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
Jelastic Multi-Cloud PaaS
 
IT Resilience Technical
IT Resilience TechnicalIT Resilience Technical
IT Resilience Technical
PT Datacomm Diangraha
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Kritika Garg
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
Jaime Crespo
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windows
Mark Friedman
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?
Mark Friedman
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Mayank Srivastava
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server WSO2
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid Jalili
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmware
Anton An
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
Livares Technologies Pvt Ltd
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестирования
COMAQA.BY
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
VoltDB
 

Similar to TrendMachine: Temporal Resilience of Web Pages (20)

Big datainmemory pub
Big datainmemory pubBig datainmemory pub
Big datainmemory pub
 
Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performance
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
 
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
 
WSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per SecondWSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per Second
 
Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end development
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
 
IT Resilience Technical
IT Resilience TechnicalIT Resilience Technical
IT Resilience Technical
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windows
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmware
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестирования
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
 

More from Sawood Alam

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
Sawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
Sawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
Sawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Sawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
Sawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
Sawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
Sawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
Sawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
Sawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
Sawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
Sawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
Sawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
Sawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Sawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
Sawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
Sawood Alam
 

More from Sawood Alam (20)

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Recently uploaded

Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 

Recently uploaded (20)

Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 

TrendMachine: Temporal Resilience of Web Pages

  • 1. TrendMachine: Temporal Resilience of Web Pages @WaybackMachine IIPC Web Archiving Conference (WAC), May 03, 2023, Online Sawood Alam Mark Graham Kritika Garg Michele C. Weigle Michael L. Nelson Dietrich Ayala Internet Archive Internet Archive Old Dominion University Old Dominion University Old Dominion University Protocol Labs @WebSciDL @ProtocolLabs Supported in part by Protocol Labs and Filecoin Foundation
  • 2. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2 Research Question How healthy has a web page been throughout its lifetime?
  • 3. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3 Temporal and Spatial Landscape of Archival Analysis Long Duration Single Webpage ● TMVis ● Wayback Machine Changes ● TrendMachine ● MementoMap ● CDX Summary ● Archives Unleashed Toolkit Webpage Collection ● Memento Damage ● Archival ACID Test ● Reconstructive ● Warrick ● Wayback Machine Downloader ● Video Archiving Insights Short Duration
  • 4. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4 Modeling Web Page Health: Linear vs. S-Curve
  • 5. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5 Sigmoid Function for Web Page Resilience Spread: How far up or down the value can go from its starting position? Shift: How soon any significant change in the value can begin? Slope: How quickly the value reaches close to the maximum change?
  • 6. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6 TrendMachine: Composite Sigmoid Parameters of Resilience
  • 7. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7 TrendMachine: Overview Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/
  • 8. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8 TrendMachine: Temporal Distribution of Archiving Activities The page is archived as few as one or zero times and as many as tens of thousands of times in a single day.
  • 9. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9 Specimen Selection Algorithm PRIORITY = ["2xx", "4xx", "5xx", "3xx"] FOREACH st OF PRIORITY IF st IN statuses(day) specimen = statuses(day).match(st)[0] BREAK DAY1 DAY2 DAY3 DAY4 4xx 3xx 5xx 3xx 3xx 3xx 3xx 5xx 2xx 3xx 5xx 3xx 5xx 4xx 5xx 2xx 4xx A 3xx specimen usually suggests that the URL is redirecting to somewhere other than a variation of the same URL.
  • 10. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10 Filling Missing Observations Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6 Identical 2xx 2xx 2xx 4xx 2xx Closest 2xx 2xx 2xx 4xx 4xx 2xx Forward 2xx 2xx 2xx 2xx 4xx 2xx Backward 2xx 2xx 4xx 4xx 4xx 2xx ANY 2xx 2xx Do not fill the gap if the status codes before and after are not identical. Do not fill the gap if it is larger than a configured threshold.
  • 11. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11 TrendMachine: TimeMap Status Codes vs. Daily Specimens Most of the self-redirect 3xx observations (HTTP/HTTPS or WWW/Apex domain) are eliminated in daily specimens. About one third of the days since the first observation have no captures, of which some are filled using a filling policy.
  • 12. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12 TrendMachine: Resilience ● Resilience score is calculated using Sigmoid function on status codes of daily specimens ● Initial value of 0.5 and normalized between 0 and 1 ● After the first few observations, Wayback Machine did not archive it for several months in 2002 ● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving ● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org” ● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
  • 13. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13 TrendMachine: Fixity ● Fixity score (normalized) is calculated using Sigmoid function on content digests of daily specimens ● Content digest reported in CDX can be sensitive to Content-Encoding, resulting in false alarms, even when the underlying content remains unchanged
  • 14. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14 TrendMachine: Chaos ● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all status codes of the CDX data in which consecutive duplicates are removed in the numerator ● An alternate sliding-window calculation is performed on the last N observations as the score becomes insensitive to recent changes on large TimeMaps ● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g., adoption of HTTPS and/or consolidation of WWW and Apex domain) Chaos = | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | = 3 = 0.5 | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
  • 15. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15 TrendMachine: Status Code Transitions ● Large numbers along the major diagonal indicate status code stability for extended periods of time ● Large numbers in non-diagonal cells suggest frequent changes in Resilience curve ● Web pages with high Resilience score for extended periods usually exhibit large numbers in the top-left cell (2xx -> 2xx) ● A large number in the 3xx -> 3xx cell usually indicates extended periods of redirection to other URLs (e.g., URL restructuring, login wall, domain change, and parked domain)
  • 16. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16 TrendMachine: Compare First and Last Mementos
  • 17. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17 TrendMachine: Live Web Page With Headers
  • 18. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18 Potential Use Cases ● Detect points of interest in a large TimeMap ● Sample captures/mementos from TimeMaps for visual summarization ● Detect archival sinks (like login pages, paywalls, and misconfigured redirects) ● Detect poor-quality pages like Soft-404 and parked domains ● Detect potential link-rot (and fix them when possible, like in a wiki page) ● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage ● Archival quality assurance ● Cluster pages of a large archival collection in different categories
  • 19. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19 Future Work ● Report heuristics-based archival summary by combining various scores ● Report/embed captures/mementos that can be points of interest ● Calculate Fixity using less-sensitive digests (e.g., SimHash) ● Calculate Chaos after applying convolutions to smooth out alternate changes ● Allow alternate web page health models (not just Sigmoid functions) ● Deploy in production by integrating with Wayback Machine
  • 20. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20 Summary Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/ A mathematical model to quantify temporal health of a web page Resilience, Fixity, Chaos, Distributions, Transitions, etc. reports An interactive portal with configuration options for experiments An evolving open-source codebase and demo deployment