The document describes TrendMachine, a tool for analyzing the temporal resilience of web pages over time. It uses sigmoid functions to model metrics like resilience, fixity, and chaos based on archival data for a page. TrendMachine generates interactive reports on the health, status code transitions, and archiving activity for a page throughout its lifetime. The tool is open source and its developers hope to integrate it more fully with the Wayback Machine to help prioritize crawls and identify archival issues.
Application of bradford's law of scattering to the literature of stellar Ghouse Modin Mamdapur
The present paper tests one of the important bibliometric laws of Bradford's Law of scattering for the literature related to ‘stellar physics’ for the period 1988–2013 as available in the Web of Science Core Collection database. A total of 2738 articles related to Stellar Physics published in journals in English language during the study period are retrieved. Data are analysed with respect to year-wise growth of articles, relative growth rate and doubling time of literature. The 2738 articles are scattered in 188 journals. A list of ranked journals was prepared and it was found that the Astrophysical Journal with 895 articles is the most productive journal publishing Stellar Physics literature followed by Monthly Notices of the Royal Astronomical Society with 507 articles and Astronomy and Astrophysics with 380 articles. In this study, theoretical aspects of Bradford's Law of Scattering are tested and found that the data do not fit to the present sample. The Leimkuhler model is tested and found to fit the data for the Bradford Multiplier (k) at 11.65. The Bradford law is also tested through graphical formulation by drawing the Bradford bibliograph and is found to confirm all the three characteristics.
Slides: Taking an Active Approach to Data GovernanceDATAVERSITY
A Look at How Riot Games Implemented Non-Invasive Data Governance
Riot Games created and runs “League of Legends,” the world’s most-played PC game and most viewed eSport — and is now transforming to become a multi-title publisher. To keep pace with this transformation and support a growing player base of millions, Riot Games is taking a page from Bob Seiner’s book, “Non-Invasive Data Governance: The Path of Least Resistance and Greatest Success” and leveraging the Alation Data Catalog to help guide accurate, well-governed analysis.
Bob Seiner will join Riot Games’ Chris Kudelka, Technical Product Manager, and Michael Leslie, Senior Data Governance Architect, and Alation’s John Wills, VP of Professional Service, for an inside look at Data Governance at one of the world’s leading gaming companies.
Join this webinar to learn:
• How Riot Games is implementing Non-Invasive Data Governance
• How this new approach to Data Governance helps to drive the business
• How the Alation Data Catalog helps Riot Games create the foundation for guiding accurate, well-governed data use
Introduction to Persistent Identifiers| www.eudat.eu | EUDAT
| www.eudat.eu | What are persistent identifiers? Why use persistent identifiers? Different persistent identifier systems; The HANDLE system; EPIC PID system; Policies; Use cases
Ver 2 July 2017
Application of bradford's law of scattering to the literature of stellar Ghouse Modin Mamdapur
The present paper tests one of the important bibliometric laws of Bradford's Law of scattering for the literature related to ‘stellar physics’ for the period 1988–2013 as available in the Web of Science Core Collection database. A total of 2738 articles related to Stellar Physics published in journals in English language during the study period are retrieved. Data are analysed with respect to year-wise growth of articles, relative growth rate and doubling time of literature. The 2738 articles are scattered in 188 journals. A list of ranked journals was prepared and it was found that the Astrophysical Journal with 895 articles is the most productive journal publishing Stellar Physics literature followed by Monthly Notices of the Royal Astronomical Society with 507 articles and Astronomy and Astrophysics with 380 articles. In this study, theoretical aspects of Bradford's Law of Scattering are tested and found that the data do not fit to the present sample. The Leimkuhler model is tested and found to fit the data for the Bradford Multiplier (k) at 11.65. The Bradford law is also tested through graphical formulation by drawing the Bradford bibliograph and is found to confirm all the three characteristics.
Slides: Taking an Active Approach to Data GovernanceDATAVERSITY
A Look at How Riot Games Implemented Non-Invasive Data Governance
Riot Games created and runs “League of Legends,” the world’s most-played PC game and most viewed eSport — and is now transforming to become a multi-title publisher. To keep pace with this transformation and support a growing player base of millions, Riot Games is taking a page from Bob Seiner’s book, “Non-Invasive Data Governance: The Path of Least Resistance and Greatest Success” and leveraging the Alation Data Catalog to help guide accurate, well-governed analysis.
Bob Seiner will join Riot Games’ Chris Kudelka, Technical Product Manager, and Michael Leslie, Senior Data Governance Architect, and Alation’s John Wills, VP of Professional Service, for an inside look at Data Governance at one of the world’s leading gaming companies.
Join this webinar to learn:
• How Riot Games is implementing Non-Invasive Data Governance
• How this new approach to Data Governance helps to drive the business
• How the Alation Data Catalog helps Riot Games create the foundation for guiding accurate, well-governed data use
Introduction to Persistent Identifiers| www.eudat.eu | EUDAT
| www.eudat.eu | What are persistent identifiers? Why use persistent identifiers? Different persistent identifier systems; The HANDLE system; EPIC PID system; Policies; Use cases
Ver 2 July 2017
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Project COLA
2/4 Webinar: How to Automate Deployment and Orchestration of Application (MiCADO introduction)
This part of the webinar provides information on how to develop cloud-native applications in MiCADO. It was presented by Jay DesLauriers (University of Westminster). The webinar took place on the 26th of September 2019. If you would like to have more information visit: https://micado-scale.eu
MiCADO is open-source and a highly customisable multi-cloud orchestration and auto-scaling framework for Docker containers, orchestrated by Kubernetes.
Developed by Project COLA funded by the European Commission (grant agreement no: 731574). https://project-cola.eu
WSA: Scaling Web Service to Handle Millions of Requests per SecondWebStackAcademy
In this presentation of Venkata Ramana Gollamudi, Principal Architect at Flipkart, as our esteemed speaker. With his expertise and experience, he'll guide us through the intricate process of “Scaling web service to handle millions of requests per second”.
Key Takeaways:
Understanding the specific challenges of scaling for Indian users
Strategies to ensure service availability, reduce latency, and maintain reliability
Leveraging tools like ELBs, K8s, message queues, and NoSQL databases effectively
Don't miss this unparalleled opportunity to gain invaluable insights and elevate your expertise in scaling web services to new heights.
We eagerly await your participation in this transformative webinar. Let's embark on the journey to mastering scalability together!
Website availability and performance have a direct business impact. Find out what kind of configurations can make WordPress sites scalable, highly available, bulletproof against any downtimes or malware attacks. Avoid a bad user experience caused by the downtime with a help of pre-packaged WordPress cluster that includes automated server scaling and database replication, built-in HTTP/3 ready CDN and SSL, integrated WAF and layer-7 anti-DDoS filtering, as well as a set of other features required for high availability and security of your sites.
More details based on the presentation are covered in the webinar https://youtu.be/NPyx2VBbUos
WordPress Cluster installation guide https://jelastic.com/blog/wordpress-hosting-enterprise-high-availability-auto-scaling/
WordPress Standalone Installation Guide https://jelastic.com/blog/wordpress-hosting-standalone-container/
How to migrate to Jelastic WordPress hosting https://jelastic.com/blog/migrate-wordpress-site/
Send a request to get access to Jelastic WordPress cluster https://jelastic.com/managed-auto-scalable-clusters-for-business/#wordpress
In this tutorial we will take Mediawiki, one of the most popular open-source wiki applications, and we will use it to discuss schema and scalability design problems we suffered on Wikipedia. We will discuss how they were solved (or how we plan to solve them).
Forget about basic normalization theory or performance promises made by vendors in an ideal world. All topics discussed will be based on *actual problems* found when trying to handle the infrastructure of one of the top 10 most popular websites.
* Target audience: Developers using MySQL from any programming language; or system administrators, devops and architects in charge of the data model of its application
* Requirements: Basic SQL. Ability to read PHP. Basic familiarity with how wikis/Wikipedia works.
* Session dynamic: You are expected to contribute actively- it will not be a lecture format
* Topics (practical cases):
- Case #0: Pages and revisions
- Case #2: Supporting 290 languages
- Case #3: An abnormal denormalization
- Case #4: Key-value system
- Case #5: Revisions and deletions
- Case #6: What's links here
- Case #7: A large table
- Case #8: Anecdotes: The ghost tables and Timestamps
- Case #9: Slots
https://www.percona.com/live/plam16/sessions/mysql-schema-design-practice
Why is this ASP.NET web app running slowly?Mark Friedman
This presentation attempts to make assumptions used in popular web performance tools like YSlow and webpagetest explicit. It also looks at the NavTiming API and explores ways to capture RUM measurements and correlate them with server-side metrics.
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
MetricsHub is a monitoring and scalability service for public clouds, allowing companies to continuously gather data from their systems and auto-scale their deployments to optimize service costs. Taking advantage of Cassandra rapid ingestion rates, reliable replication model, and easiness of deployment, Metrics Hub can handle billions of datapoints per day. During this session, you will learn about the architecture supporting this service, which combines the power of the PaaS + IaaS on the Windows Azure platform.
Docker в последняя время набрал огромную популярность как инструмент разработчиков и DevOps-специалистов, но все ещё не так активно используется для автоматизированного тестирования. Во время воркшопа я поделюсь несколькими сценариями, когда Docker может помочь автоматизировать то что ранее считалось непригодным к автоматизации. Также, мы попробуем создать свой собственный образ и запустить несколько контейнеров используя docker-compose.
Understanding the Top Four Use Cases for IoTVoltDB
Dheeraj Remella, Director of Solutions Architecture at VoltDB explains how to use real-time data to make more informed decisions, decrease event to action latency and limit downtime. He’ll discuss four in-production customer case studies and highlight how implementing VoltDB has dramatically increased competitive advantage for avoiding unplanned downtime, asset tracking, utilities - smart meter management, and fleet management
CDX Summary: Web Archival Collection InsightsSawood Alam
Large web archival collections are often opaque about their holdings. We created an open-source tool called, CDX Summary, to generate statistical reports based on URIs, hosts, TLDs, paths, query parameters, status codes, media types, date and time, etc. present in the CDX index of a collection of WARC files. Our tool also surfaces a configurable number of potentially good random memento samples from the collection for visual inspection, quality assurance, representative thumbnails generation, etc. The tool generates both human and machine readable reports with varying levels of details for different use cases. Furthermore, we implemented a Web Component that can render generated JSON summaries in HTML documents. Early exploration of CDX insights on Wayback Machine collections helped us improve our crawl operations.
Venue: TPDL 2022
Recording: https://www.youtube.com/watch?v=K5i3XShqW6A
Video Archiving and Playback in the Wayback MachineSawood Alam
At the Internet Archive (IA) we collect static and dynamic lists of seeds from various sources (like Save Page Now, Wikipedia EventStream, Cloudflare, etc.) for archiving. Some of these seeds include web pages with videos on them. Those URLs are curated based on certain criteria to identify potential videos that should be archived or excluded. Candidate video page URLs for archiving are placed in a queue (currently using Kafka) to be consumed by a separate process. We maintain a persistent database of videos we have already archived, which is used both for status tracking as well as a seen-check system to avoid duplicate downloads of large media files that usually do not change. We use youtube-dl (or one of its forks) to download videos and their metadata. We archive the container HTML page, associated video metadata, any transcriptions, thumbnails, and at least one of the many video files with different resolutions and formats. These pieces are stored in separate WARC records (some with “response” type and others as “metadata”). Some popular video streaming services do not have static links to embed video files, which makes it difficult to identify and serve video files corresponding to their container HTML pages on archival replay. To glue related pieces together for replay we are currently using a key-value store, but exploring ways to get away with an additional index. We are using a custom video player and perform necessary rewriting in the container HTML page for a more reliable video playback experience. We create a daily summary of metadata of videos that we have archived and load it in a custom-built Video Archiving Insights dashboard to identify any issues or biases, which are utilized as a feedback loop for quality assurance and to enhance our curation criteria and archiving strategies. We are always looking forward to ways to improve the system that works at scale as well as means to interoperate.
Recording: youtube.com/watch?v=6MiYKOq_DKo
More Related Content
Similar to TrendMachine: Temporal Resilience of Web Pages
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Project COLA
2/4 Webinar: How to Automate Deployment and Orchestration of Application (MiCADO introduction)
This part of the webinar provides information on how to develop cloud-native applications in MiCADO. It was presented by Jay DesLauriers (University of Westminster). The webinar took place on the 26th of September 2019. If you would like to have more information visit: https://micado-scale.eu
MiCADO is open-source and a highly customisable multi-cloud orchestration and auto-scaling framework for Docker containers, orchestrated by Kubernetes.
Developed by Project COLA funded by the European Commission (grant agreement no: 731574). https://project-cola.eu
WSA: Scaling Web Service to Handle Millions of Requests per SecondWebStackAcademy
In this presentation of Venkata Ramana Gollamudi, Principal Architect at Flipkart, as our esteemed speaker. With his expertise and experience, he'll guide us through the intricate process of “Scaling web service to handle millions of requests per second”.
Key Takeaways:
Understanding the specific challenges of scaling for Indian users
Strategies to ensure service availability, reduce latency, and maintain reliability
Leveraging tools like ELBs, K8s, message queues, and NoSQL databases effectively
Don't miss this unparalleled opportunity to gain invaluable insights and elevate your expertise in scaling web services to new heights.
We eagerly await your participation in this transformative webinar. Let's embark on the journey to mastering scalability together!
Website availability and performance have a direct business impact. Find out what kind of configurations can make WordPress sites scalable, highly available, bulletproof against any downtimes or malware attacks. Avoid a bad user experience caused by the downtime with a help of pre-packaged WordPress cluster that includes automated server scaling and database replication, built-in HTTP/3 ready CDN and SSL, integrated WAF and layer-7 anti-DDoS filtering, as well as a set of other features required for high availability and security of your sites.
More details based on the presentation are covered in the webinar https://youtu.be/NPyx2VBbUos
WordPress Cluster installation guide https://jelastic.com/blog/wordpress-hosting-enterprise-high-availability-auto-scaling/
WordPress Standalone Installation Guide https://jelastic.com/blog/wordpress-hosting-standalone-container/
How to migrate to Jelastic WordPress hosting https://jelastic.com/blog/migrate-wordpress-site/
Send a request to get access to Jelastic WordPress cluster https://jelastic.com/managed-auto-scalable-clusters-for-business/#wordpress
In this tutorial we will take Mediawiki, one of the most popular open-source wiki applications, and we will use it to discuss schema and scalability design problems we suffered on Wikipedia. We will discuss how they were solved (or how we plan to solve them).
Forget about basic normalization theory or performance promises made by vendors in an ideal world. All topics discussed will be based on *actual problems* found when trying to handle the infrastructure of one of the top 10 most popular websites.
* Target audience: Developers using MySQL from any programming language; or system administrators, devops and architects in charge of the data model of its application
* Requirements: Basic SQL. Ability to read PHP. Basic familiarity with how wikis/Wikipedia works.
* Session dynamic: You are expected to contribute actively- it will not be a lecture format
* Topics (practical cases):
- Case #0: Pages and revisions
- Case #2: Supporting 290 languages
- Case #3: An abnormal denormalization
- Case #4: Key-value system
- Case #5: Revisions and deletions
- Case #6: What's links here
- Case #7: A large table
- Case #8: Anecdotes: The ghost tables and Timestamps
- Case #9: Slots
https://www.percona.com/live/plam16/sessions/mysql-schema-design-practice
Why is this ASP.NET web app running slowly?Mark Friedman
This presentation attempts to make assumptions used in popular web performance tools like YSlow and webpagetest explicit. It also looks at the NavTiming API and explores ways to capture RUM measurements and correlate them with server-side metrics.
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
MetricsHub is a monitoring and scalability service for public clouds, allowing companies to continuously gather data from their systems and auto-scale their deployments to optimize service costs. Taking advantage of Cassandra rapid ingestion rates, reliable replication model, and easiness of deployment, Metrics Hub can handle billions of datapoints per day. During this session, you will learn about the architecture supporting this service, which combines the power of the PaaS + IaaS on the Windows Azure platform.
Docker в последняя время набрал огромную популярность как инструмент разработчиков и DevOps-специалистов, но все ещё не так активно используется для автоматизированного тестирования. Во время воркшопа я поделюсь несколькими сценариями, когда Docker может помочь автоматизировать то что ранее считалось непригодным к автоматизации. Также, мы попробуем создать свой собственный образ и запустить несколько контейнеров используя docker-compose.
Understanding the Top Four Use Cases for IoTVoltDB
Dheeraj Remella, Director of Solutions Architecture at VoltDB explains how to use real-time data to make more informed decisions, decrease event to action latency and limit downtime. He’ll discuss four in-production customer case studies and highlight how implementing VoltDB has dramatically increased competitive advantage for avoiding unplanned downtime, asset tracking, utilities - smart meter management, and fleet management
Similar to TrendMachine: Temporal Resilience of Web Pages (20)
CDX Summary: Web Archival Collection InsightsSawood Alam
Large web archival collections are often opaque about their holdings. We created an open-source tool called, CDX Summary, to generate statistical reports based on URIs, hosts, TLDs, paths, query parameters, status codes, media types, date and time, etc. present in the CDX index of a collection of WARC files. Our tool also surfaces a configurable number of potentially good random memento samples from the collection for visual inspection, quality assurance, representative thumbnails generation, etc. The tool generates both human and machine readable reports with varying levels of details for different use cases. Furthermore, we implemented a Web Component that can render generated JSON summaries in HTML documents. Early exploration of CDX insights on Wayback Machine collections helped us improve our crawl operations.
Venue: TPDL 2022
Recording: https://www.youtube.com/watch?v=K5i3XShqW6A
Video Archiving and Playback in the Wayback MachineSawood Alam
At the Internet Archive (IA) we collect static and dynamic lists of seeds from various sources (like Save Page Now, Wikipedia EventStream, Cloudflare, etc.) for archiving. Some of these seeds include web pages with videos on them. Those URLs are curated based on certain criteria to identify potential videos that should be archived or excluded. Candidate video page URLs for archiving are placed in a queue (currently using Kafka) to be consumed by a separate process. We maintain a persistent database of videos we have already archived, which is used both for status tracking as well as a seen-check system to avoid duplicate downloads of large media files that usually do not change. We use youtube-dl (or one of its forks) to download videos and their metadata. We archive the container HTML page, associated video metadata, any transcriptions, thumbnails, and at least one of the many video files with different resolutions and formats. These pieces are stored in separate WARC records (some with “response” type and others as “metadata”). Some popular video streaming services do not have static links to embed video files, which makes it difficult to identify and serve video files corresponding to their container HTML pages on archival replay. To glue related pieces together for replay we are currently using a key-value store, but exploring ways to get away with an additional index. We are using a custom video player and perform necessary rewriting in the container HTML page for a more reliable video playback experience. We create a daily summary of metadata of videos that we have archived and load it in a custom-built Video Archiving Insights dashboard to identify any issues or biases, which are utilized as a feedback loop for quality assurance and to enhance our curation criteria and archiving strategies. We are always looking forward to ways to improve the system that works at scale as well as means to interoperate.
Recording: youtube.com/watch?v=6MiYKOq_DKo
Profiling Web Archival Voids for Memento RoutingSawood Alam
Slides of the paper presentation, "Profiling Web Archival Voids for Memento Routing", for JCDL 2021.
Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson
Preprint: https://arxiv.org/abs/2108.03311
Recording: https://youtu.be/ImJWkndNoS8
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
Potential utilization of the emerging Web technology, Web Bundles, in Web archiving, presented at the IIPC WAC 2021 in Session 8 by Sawood Alam.
Recording: https://youtu.be/lQX9v9V0FRQ
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
Topic: Doctoral Dissertation Defense
Title: MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Student: Sawood Alam
University: Old Dominion University
Date: Friday, December 4, 2020
Supporting Web Archiving via Web PackagingSawood Alam
We describe challenges related to web archiving, replaying archived web resources, and verifying their authenticity. We show that Web Packaging has significant potential to help address these challenges and identify areas in which changes are needed in order to fully realize that potential.
Position Paper: https://arxiv.org/abs/1906.07104
Presented in the Internet Architecture Board's ESCAPE 2019 Workshop (Exploring Synergy between Content Aggregation and the Publisher Ecosystem)
https://www.iab.org/activities/workshops/escape-workshop/
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
We introduce MementoMap, a framework to express and disseminate holdings of web archives (archive profiles) by themselves or third parties. The framework allows arbitrary, flexible, and dynamic levels of details in its entries that fit the needs of archives of different scales. This enables Memento aggregators to significantly reduce wasted traffic to web archives.
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure an archived resource has remained unaltered since the time it was captured. Some web archives do not allow users to access fixity information and, more importantly, even if fixity information is available, it is provided by the same archive from which the archived resources are requested. In this research, we propose two approaches, namely Atomic and Block, to establish and check fixity of archived resources.
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
A brief introduction to WARC File Format used for long-term Web Archival preservation. These slides were initially prepared to give a guest lecture in the CS 531 Web Server Design (Fall 2018) course at Old Dominion University.
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
InterPlanetary Wayback (IPWB) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the InterPlanetary File System (IPFS) network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. IPWB splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returns, and combines the headers and payload from IPFS at the time of replay. We also explore the possibility of an index-free, fully decentralized collaborative web archiving system as the next step.
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
MemGator - A Portable Concurrent Memento Aggregator CLI and Server Written in Go.
The corresponding poster can be found at http://www.cs.odu.edu/~salam/presentations/memgator-jcdl16-poster.pdf
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
Live-leakage (zombie resource) is an issue in archival replay of web pages. This work proposes a mechanism to avoid such live-leakage using ServiceWorker. This work was presented in WADL 2017 on June 22 in Toronto, Ontario, Canada.
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
Live-leakage (zombie resource) is an issue in archival replay of web pages. This work proposes a mechanism to avoid such live-leakage using ServiceWorker. This work was presented in JCDL 2017 on June 20 in Toronto, Ontario, Canada.
Introducing Web Archiving and WSDL Research GroupSawood Alam
My talk to introduce Web Archiving and the Web Science and Digital Libraries Research Group to some invited students from India for a summer workshop in Old Dominion University, Norfolk, VA
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
1. TrendMachine:
Temporal Resilience of Web Pages
@WaybackMachine
IIPC Web Archiving Conference (WAC), May 03, 2023, Online
Sawood Alam
Mark Graham
Kritika Garg
Michele C. Weigle
Michael L. Nelson
Dietrich Ayala
Internet Archive
Internet Archive
Old Dominion University
Old Dominion University
Old Dominion University
Protocol Labs
@WebSciDL @ProtocolLabs
Supported in part by Protocol Labs and Filecoin Foundation
2. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2
Research Question
How healthy has a web page been
throughout its lifetime?
3. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3
Temporal and Spatial Landscape of Archival Analysis
Long Duration
Single
Webpage
● TMVis
● Wayback Machine Changes
● TrendMachine
● MementoMap
● CDX Summary
● Archives Unleashed Toolkit
Webpage
Collection
● Memento Damage
● Archival ACID Test
● Reconstructive
● Warrick
● Wayback Machine Downloader
● Video Archiving Insights
Short Duration
4. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4
Modeling Web Page Health: Linear vs. S-Curve
5. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5
Sigmoid Function for Web Page Resilience
Spread: How far up or down the value can go from its starting position?
Shift: How soon any significant change in the value can begin?
Slope: How quickly the value reaches close to the maximum change?
6. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6
TrendMachine: Composite Sigmoid Parameters of Resilience
7. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7
TrendMachine: Overview
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
8. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8
TrendMachine: Temporal Distribution of Archiving Activities
The page is archived
as few as one or zero
times and as many as
tens of thousands of
times in a single day.
9. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9
Specimen Selection Algorithm
PRIORITY = ["2xx", "4xx", "5xx", "3xx"]
FOREACH st OF PRIORITY
IF st IN statuses(day)
specimen = statuses(day).match(st)[0]
BREAK
DAY1 DAY2 DAY3 DAY4
4xx 3xx 5xx 3xx
3xx 3xx 3xx 5xx
2xx 3xx 5xx 3xx
5xx 4xx 5xx
2xx 4xx
A 3xx specimen usually suggests that the URL is
redirecting to somewhere other than a variation of
the same URL.
10. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10
Filling Missing Observations
Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6
Identical 2xx 2xx 2xx 4xx 2xx
Closest 2xx 2xx 2xx 4xx 4xx 2xx
Forward 2xx 2xx 2xx 2xx 4xx 2xx
Backward 2xx 2xx 4xx 4xx 4xx 2xx
ANY 2xx 2xx
Do not fill the gap if the
status codes before and
after are not identical.
Do not fill the gap if it is
larger than a configured
threshold.
11. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11
TrendMachine: TimeMap Status Codes vs. Daily Specimens
Most of the self-redirect 3xx observations
(HTTP/HTTPS or WWW/Apex domain) are
eliminated in daily specimens.
About one third of the days since the first
observation have no captures, of which
some are filled using a filling policy.
12. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12
TrendMachine: Resilience
● Resilience score is calculated using Sigmoid function on status codes of daily specimens
● Initial value of 0.5 and normalized between 0 and 1
● After the first few observations, Wayback Machine did not archive it for several months in 2002
● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving
● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org”
● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
13. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13
TrendMachine: Fixity
● Fixity score (normalized) is calculated using Sigmoid
function on content digests of daily specimens
● Content digest reported in CDX can be sensitive to
Content-Encoding, resulting in false alarms, even
when the underlying content remains unchanged
14. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14
TrendMachine: Chaos
● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all
status codes of the CDX data in which consecutive duplicates are removed in the numerator
● An alternate sliding-window calculation is performed on the last N observations as the score
becomes insensitive to recent changes on large TimeMaps
● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g.,
adoption of HTTPS and/or consolidation of WWW and Apex domain)
Chaos =
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx |
=
3
= 0.5
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
15. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15
TrendMachine: Status Code Transitions
● Large numbers along the major diagonal
indicate status code stability for extended
periods of time
● Large numbers in non-diagonal cells suggest
frequent changes in Resilience curve
● Web pages with high Resilience score for
extended periods usually exhibit large numbers
in the top-left cell (2xx -> 2xx)
● A large number in the 3xx -> 3xx cell usually
indicates extended periods of redirection to
other URLs (e.g., URL restructuring, login wall,
domain change, and parked domain)
16. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16
TrendMachine: Compare First and Last Mementos
17. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17
TrendMachine: Live Web Page With Headers
18. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18
Potential Use Cases
● Detect points of interest in a large TimeMap
● Sample captures/mementos from TimeMaps for visual summarization
● Detect archival sinks (like login pages, paywalls, and misconfigured redirects)
● Detect poor-quality pages like Soft-404 and parked domains
● Detect potential link-rot (and fix them when possible, like in a wiki page)
● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage
● Archival quality assurance
● Cluster pages of a large archival collection in different categories
19. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19
Future Work
● Report heuristics-based archival summary by combining various scores
● Report/embed captures/mementos that can be points of interest
● Calculate Fixity using less-sensitive digests (e.g., SimHash)
● Calculate Chaos after applying convolutions to smooth out alternate changes
● Allow alternate web page health models (not just Sigmoid functions)
● Deploy in production by integrating with Wayback Machine
20. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20
Summary
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
A mathematical model
to quantify temporal
health of a web page
Resilience, Fixity,
Chaos, Distributions,
Transitions, etc. reports
An interactive portal with
configuration options for
experiments
An evolving
open-source codebase
and demo deployment