Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present. In this talk we’ll explore how the HTTP Archive works, some of the ways people are using this dataset, and sneak a peek at things to come.
Rick Viscomi (@rick_viscomi) is an engineer with Google's developer relations team, focusing on web transparency and maintaining the HTTP Archive. In a past life Rick helped make YouTube fast and co-authored the O'Reilly book "Using WebPageTest".
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present.
Paul Calvano explores how the HTTP Archive works, how people are using this dataset, and some ways that Akamai has leveraged data within the HTTP Archive to help its customers.
The webinar will present the SemaGrow demonstrator “Web Crawler + AgroTagger”, in order to collect feedback, ideas and comments about the status of the development and how the demonstrator helps to overcome data problems.
SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission, aiming at developing algorithms, infrastructures and methodologies to cope with large data volumes and real time performance.
In this context, FAO is providing a component than can be used to crawl the Web, giving a meaning to discovered resources by using the AgroTagger, which can assign some AGROVOC URIs to resources gathered by a Web crawler.
The demonstrator is publicly available at https://github.com/agrisfao/agrotagger.
Python Web Scraper for ACM and Google Scholar.pptxASIMKHAN840563
The Python Web Scraper for ACM and Google Scholar is a powerful tool designed to automate the process of data extraction from two prominent platforms in the academic and research community. By leveraging web scraping techniques, this scraper enables users to efficiently gather and analyze a wide range of information, including research papers, conference proceedings, and academic publications.
Slides from a webinar on webware presented by Mike Qaissaunee and Gordon F. Snyder, Jr. (both of nctt.org). The webinar was hosted by MATEC NetWorks (http://www.matecnetworks.org/) and delivered via Elluminate. Visit MATEC NetWorks to watch the webinar.
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...Vangelis Banos
Blogs are a dynamic communication medium which has been
widely established on the web. The BlogForever project has
developed an innovative system to harvest, preserve, manage
and reuse blog content. This paper presents a key component
of the BlogForever platform, the web crawler. More
precisely, our work concentrates on techniques to automatically
extract content such as articles, authors, dates and
comments from blog posts. To achieve this goal, we introduce
a simple and robust algorithm to generate extraction
rules based on string matching using the blog’s web feed in
conjunction with blog hypertext. This approach leads to a
scalable blog data extraction process. Furthermore, we show
how we integrate a web browser into the web harvesting process
in order to support the data extraction from blogs with
JavaScript generated content.
Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present. In this talk we’ll explore how the HTTP Archive works, some of the ways people are using this dataset, and sneak a peek at things to come.
Rick Viscomi (@rick_viscomi) is an engineer with Google's developer relations team, focusing on web transparency and maintaining the HTTP Archive. In a past life Rick helped make YouTube fast and co-authored the O'Reilly book "Using WebPageTest".
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present.
Paul Calvano explores how the HTTP Archive works, how people are using this dataset, and some ways that Akamai has leveraged data within the HTTP Archive to help its customers.
The webinar will present the SemaGrow demonstrator “Web Crawler + AgroTagger”, in order to collect feedback, ideas and comments about the status of the development and how the demonstrator helps to overcome data problems.
SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission, aiming at developing algorithms, infrastructures and methodologies to cope with large data volumes and real time performance.
In this context, FAO is providing a component than can be used to crawl the Web, giving a meaning to discovered resources by using the AgroTagger, which can assign some AGROVOC URIs to resources gathered by a Web crawler.
The demonstrator is publicly available at https://github.com/agrisfao/agrotagger.
Python Web Scraper for ACM and Google Scholar.pptxASIMKHAN840563
The Python Web Scraper for ACM and Google Scholar is a powerful tool designed to automate the process of data extraction from two prominent platforms in the academic and research community. By leveraging web scraping techniques, this scraper enables users to efficiently gather and analyze a wide range of information, including research papers, conference proceedings, and academic publications.
Slides from a webinar on webware presented by Mike Qaissaunee and Gordon F. Snyder, Jr. (both of nctt.org). The webinar was hosted by MATEC NetWorks (http://www.matecnetworks.org/) and delivered via Elluminate. Visit MATEC NetWorks to watch the webinar.
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...Vangelis Banos
Blogs are a dynamic communication medium which has been
widely established on the web. The BlogForever project has
developed an innovative system to harvest, preserve, manage
and reuse blog content. This paper presents a key component
of the BlogForever platform, the web crawler. More
precisely, our work concentrates on techniques to automatically
extract content such as articles, authors, dates and
comments from blog posts. To achieve this goal, we introduce
a simple and robust algorithm to generate extraction
rules based on string matching using the blog’s web feed in
conjunction with blog hypertext. This approach leads to a
scalable blog data extraction process. Furthermore, we show
how we integrate a web browser into the web harvesting process
in order to support the data extraction from blogs with
JavaScript generated content.
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
At teowaki we have a system for API use analytics using Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will speak of the alternatives we evaluated and how we are using Redis and Bigquery to solve our problem.
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
Code for Startup MVP (Ruby on Rails) Session 1Henry S
First Session on Learning to Code for Startup MVP's using Ruby on Rails.
This session covers the web architecture, Git/GitHub and makes a real rails app that is deployed to Heroku at the end.
Thanks,
Henry
Case Study for Ego-centric Citation NetworkMike Taylor
Patent Citation Network Research Tool used to build and analyze technology landscape Ego-centric Citation Network and Social Citation Network. visit us for more
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...Robert Meusel
Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of largescale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from dierent points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare dierent versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at dierent points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
Potential utilization of the emerging Web technology, Web Bundles, in Web archiving, presented at the IIPC WAC 2021 in Session 8 by Sawood Alam.
Recording: https://youtu.be/lQX9v9V0FRQ
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra BIGSEA
Presentation given by Ignacio Blanquer, EUBra-BIGSEA EU coordinator at the Digital Infrastructures for Research conference held in Krakow, Poland, from 28th to 30th September 2016. Presentation overview available at http://www.digitalinfrastructures.eu/content/eubra-bigsea-cloud-services-qos-guarantees-big-data-analytics
How to Optimize Your Drupal Site with Structured ContentAcquia
<p>With the advent of real-time marketing technologies and design methodologies like atomic design, web pages are no longer just “pages” – they are collections of modular, dynamic data that can be rearranged according to the context of the user.</p>
<p>To provide optimized user experiences, marketers and publishers need to enrich websites with additional structure (taxonomy and metadata). By adding metadata, content becomes machine-understandable, which leads to better interoperability, SEO, and accessibility.</p>
<p>Structured content is also one of the foundations of real-time personalization; By tagging and describing content with metadata, personalization engines like Acquia Lift can provide more relevant content to individual users.</p>
<p>In this webinar, we will discuss:</p>
<ul>
<li>How to further enrich your Drupal website with structure</li>
<li>Taxonomy best practices for dynamic content and how to configure auto-tagging in your Drupal site</li>
<li>How to leverage Microdata and the schema.org vocabulary to improve SEO through rich results</li>
<li>How to improve the social shareability of your content through the use of Twitter Cards and OpenGraph tags</li>
<li>Why Drupal 8 is the best CMS platform for managing structured content</li>
</ul>
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Bastian Grimm
My talk from SMX 2017 in New York covering best practices on how to successfully naviate through the various types of migrations (protocal migrations, frontend migrations, etc.) from an SEO perspective.
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
At teowaki we have a system for API usage analytics, with Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.
In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
At teowaki we have a system for API use analytics using Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will speak of the alternatives we evaluated and how we are using Redis and Bigquery to solve our problem.
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
Code for Startup MVP (Ruby on Rails) Session 1Henry S
First Session on Learning to Code for Startup MVP's using Ruby on Rails.
This session covers the web architecture, Git/GitHub and makes a real rails app that is deployed to Heroku at the end.
Thanks,
Henry
Case Study for Ego-centric Citation NetworkMike Taylor
Patent Citation Network Research Tool used to build and analyze technology landscape Ego-centric Citation Network and Social Citation Network. visit us for more
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...Robert Meusel
Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of largescale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from dierent points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare dierent versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at dierent points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
Potential utilization of the emerging Web technology, Web Bundles, in Web archiving, presented at the IIPC WAC 2021 in Session 8 by Sawood Alam.
Recording: https://youtu.be/lQX9v9V0FRQ
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra BIGSEA
Presentation given by Ignacio Blanquer, EUBra-BIGSEA EU coordinator at the Digital Infrastructures for Research conference held in Krakow, Poland, from 28th to 30th September 2016. Presentation overview available at http://www.digitalinfrastructures.eu/content/eubra-bigsea-cloud-services-qos-guarantees-big-data-analytics
How to Optimize Your Drupal Site with Structured ContentAcquia
<p>With the advent of real-time marketing technologies and design methodologies like atomic design, web pages are no longer just “pages” – they are collections of modular, dynamic data that can be rearranged according to the context of the user.</p>
<p>To provide optimized user experiences, marketers and publishers need to enrich websites with additional structure (taxonomy and metadata). By adding metadata, content becomes machine-understandable, which leads to better interoperability, SEO, and accessibility.</p>
<p>Structured content is also one of the foundations of real-time personalization; By tagging and describing content with metadata, personalization engines like Acquia Lift can provide more relevant content to individual users.</p>
<p>In this webinar, we will discuss:</p>
<ul>
<li>How to further enrich your Drupal website with structure</li>
<li>Taxonomy best practices for dynamic content and how to configure auto-tagging in your Drupal site</li>
<li>How to leverage Microdata and the schema.org vocabulary to improve SEO through rich results</li>
<li>How to improve the social shareability of your content through the use of Twitter Cards and OpenGraph tags</li>
<li>Why Drupal 8 is the best CMS platform for managing structured content</li>
</ul>
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Bastian Grimm
My talk from SMX 2017 in New York covering best practices on how to successfully naviate through the various types of migrations (protocal migrations, frontend migrations, etc.) from an SEO perspective.
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
At teowaki we have a system for API usage analytics, with Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.
In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.
Similar to Lessons Learned From the Longitudinal Sampling of a Large Web Archive (20)
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
1. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Lessons Learned From the Longitudinal
Sampling of a Large Web Archive
Kritika Garg1
, Sawood Alam2
, Michele C. Weigle1
, Michael L. Nelson1
, Corentin Barreau2
, Mark Graham2
, Dietrich Ayala3
2023 IIPC Web Archiving Conference (WAC)
May 3, 2023
1
Web Science & Digital Libraries Research Group, Old Dominion University, Norfolk, Virginia - USA (@WebSciDL)
2
Wayback Machine, Internet Archive, San Francisco, California, USA (@internetarchive)
3
Protocol Labs, San Francisco, California - USA (@protocollabs)
1
2. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
We documented the strategies and lessons learned from
sampling the archived web by collecting 27.3 million URLs
with 3.8 billion archived pages in each of the 26 years of
the Internet Archive's existence, from 1996 to 2021.
2
Overview
3. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
1996 1996 2003
https://www.washingtonpost.com/archive/politics/2003/11/24/on-the-
web-research-work-proves-ephemeral/959c882f-9ad0-4b36-88cd-fb7
411db118d/
3
The motivation for this work was to obtain a "representative sample of the web" that could be used to
revisit fundamental questions regarding the web, such as "how long does a web page last?" The
commonly cited answer is “44-100 days on average”, all of which are from research that dates back
to 1996--2003.
http://web.archive.org/web/19970215093036/http://www.sciam.com:8
0/0397issue/0397kahle.html
http://web.archive.org/web/19971011050140/http://www.archi
ve.org/sciam_article.html
4. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Curated representative sample using the archived web
27 million URLs
Reduce the number of
domains with a single URL
Downsampled URLs of
over-represented domains
285 million URLs
Sampled 285M URLs from IA's
ZipNum index file that contains
every 6000th line of the CDX
index.
These include URLs of
embedded resources, such as
images, CSS, and JavaScript.
92 million URLs
Filtered the URLs for HTML
pages to limit our samples to
web pages.
Also filtered any invalid URLs
and likely URL Aliases.
Upsample URLs from early
years.
4
Initial Goal: Dataset of 25M URLs (1M URLs for each year of Internet Archive)
5. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
https://brs53.dx.am/scripts/jquery.min.js
https://wam.ae/js/ar/markets.js
https:///?dn=renunciationguide.com&flrdr=yes&nxte=css
https://*/robots.txt
https://*/robots.txt
https://174.127.81.0/t/87/3/15/4-320x240.jpg
https://174.127.81.0/t/87/73/25/1-320x240.jpg
https://mf.ag/2121_de.gif?exp=24559886473100
https://127.0.0.1/bb1750.html
https://163.30.44.17/principal_test
https://notiche.com.ar/index.php?limitstart=42
Archive index contains all kinds of URLs
5
We sampled 285M URLs from IA's ZipNum index file of August 2021 that contains every 6000th line of the
CDX index which includes URLs of embedded resources, such as images, CSS, and JavaScript.
JavaScript
CSS
Images
robots.txt
HTML
6. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Filtered the URLs for likely HTML pages based on extensions
To limit our samples to web pages, we filtered the URLs to 107M likely HTML pages
(based on trailing slash and filename extensions).
6
Heuristic Example URL
trailing slash/no ext https://www.youtube.com/
.do http://example.com/register.do
.php[0-9] https://notiche.com.ar/index.php
.aspx https://cigaroasis.asia/contact.aspx
.cgi https://0009.ir/cgi-sys/suspendedpage.cgi
.pl https://007thunderballpoker.com/11-5g-suited-poker-chip/pai-gow-poker-rules.pl
.asp https://0000028.cnelc.com/productshop/newpro.asp
.jsp https://006bai.net/404.jsp
.cfm https://001ok.com/adventure_nz.cfm?nft=1&p=4&t=4
.[a-z]html https://city-sat.asia/thread28004.html
.htm http://1st-international.com:80/profiles/16/PersonalBO893.htm
7. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Datetime of the first archive and MIME type using IA's CDX
output: surt timestamp original-URL mimetype statuscode digest length
7
We collected first entry of CDX for all the 107M likely HTML pages to determine the time of the first archive
and MIME type of the URL
Memento datetime of
the first archive
MIME type
8. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
86.12% URLs (92M) were correctly predicted as HTML
8
Accuracy of each heuristic used to predict HTML:
trailing slash/no ext 83.7%
.do 85.1%
.php[0-9] 88.7%
.aspx 90.0%
.cgi 90.1%
.pl 91.8%
.asp 93.7%
.jsp 93.7%
.cfm 96.7%
.[a-z]html 97.8%
.htm 98.3%
MIME-type distribution of 107M likely HTML URLs
9. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Significant increase in web and archiving capacity over time
9
We grouped the 92 million URLs with "text/html" MIME types based on the year it was first archive.
2001-2021 exceeding 1M
URLs require downsampling
1996-2000 < 1M
URLs require
upsampling
2021 has only 8
months of data as
the Index file used is
from August 2021
1996 has partial data
as IA’s Wayback
Machine started in
October 1996
10. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Increase in deep links archived over the years;
Extracted root from deep links to upsample earlier years
We identify the ~20M domains in 92M sample with no root URLs. We extracted hostnames to form root URLs and then
added these missing root URLs to our sample.
For example:
https://reddit.com/r/argentina/comments/1ruebz/cient%c3%adficos_chubutensesi → https://reddit.com/
10
Upsampling allowed
us to populate URLs
in early years, which
holds more interest
for our study
11. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Long tail distribution; 70% of the domains have just one URL
11
Distribution of Number of URLs for each domain in 2016 sample
For example,
In 2016 sample, 1.7M domains (79%) have just a single URL
0000-00-00.com
00000000000.cn
000000008.com
blumen-konzelmann.de
ip-37-187-129.eu
jdpiao.com
jsygzh.com
kkradnik.com
sayyum.com
schuimrubbergigant.nl
spd-wuppertal-katernberg.de
tokelezea.com
zzzzy.com
zzzzyyyyggggtest1.com
zzzzz7.com
Some of the long tail
of URLs features
domains that are
likely not part of most
users' experience,
although we can't be
sure for foreign sites
We kept only 10% of domains with a single
URL for yearly samples with longer tails
12. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Popular domains (e.g., amazon.com, yahoo.com)
are over-represented
12
1996-2000
Domain No. of URLS
amazon.com 16.8K
yahoo.com 13.5K
geocities.com 12.1K
infospace.com 9.6K
aol.com 5.2K
tripod.com 2.8K
msn.com 2.8K
wunderground.com 2.8K
excite.com 2.7K
surfers-paradise.com 2.6K
2001
Domain No. of URLS
yahoo.com 11.7K
geocities.com 6.6K
free.fr 3.3K
tripod.com 3.2K
amazon.com 2.0K
angelfire.com 2.0K
hypermart.net 1.9K
homestead.com 1.8K
sun.com 1.8K
sina.com.cn 1.8K
2002
Domain No. of URLS
yahoo.com 11.4K
geocities.com 5.7K
2ch.net 4.3K
amazon.com 3.2K
daum.net 3.2K
free.fr 3.0K
sohu.com 2.3K
infoseek.co.jp 2.0K
sina.com.cn 2.0K
yahoo.co.jp 1.9K
Clustered the early years to
reach 1M URLs !
13. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
logarithmic-scale downsampling to reduce over-sampled domains
13
https://github.com/adelcambre
https://github.com/akitaonrails/i18n_demo_app/tree/maste
r
https://github.com/alx
https://github.com/anotherjesse/s3/watchers
https://github.com/280north/cappuccino/issues
https://github.com/aaronrussell/gh_repo_recommender
https://github.com/00amy/intelligent-tutoring-system
https://github.com/00lenon/thediamondknight
https://github.com/01045972746/tensor-example
…
1.3M URLs for github.com reduced to 234 URLs
https://peaceinspire.com/2007/07/28
https://peaceinspire.com/song-lyrics/english-songs
https://peaceinspire.com/2008/09/01/give-the-lord-your-hea
rt
3 URLs for peaceinspire.com stays as it is
We don't require 1.3 million github URLs!
Having a lot of URLs for a single domain eventually has
diminishing returns. It is sufficient to say that we can have
coverage of github.com with a smaller quantity.
sample_urls = min(N, K * log(N) + C)
N = number of URLs sharing a domain
C Include up to C URLs from the same domain
log(N) + C Beyond C, sample URLs on a log scale
K * log(N) + C Sample K multiples of the log scale to relax the downsampling
min(N, K * log(N) + C) Ensure that the samples are not more than available URLs
under a domain
14. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
logarithmic-scale downsampling to reach around 1M URLs for each year
14
We applied this techniques to every domain in the yearly
sample. We adjusted the parameters K & C to get almost
1M URLs in total for each year while ensuring fairness in
the domain representation.
sample_urls = min(N, K * log(N) + C)
N = number of URLs sharing a domain
15. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Downsampling largely flattened out the large discrepancy in the number of
URLs per domain
15
No. Domain R-URLs
1 yahoo.com 489
2 blogspot.com 463
3 google.com 459
4 amazon.com 453
5 wikipedia.org 439
6 house.gov 417
7 msn.com 401
8 yahoo.co.jp 396
9 wordpress.com 394
10 cnn.com 391
11 ca.gov 387
12 ebay.com 383
13 go.com 382
14 amazon.de 382
15 microsoft.com 380
16 senate.gov 379
17 sina.com.cn 378
18 amazon.co.uk 377
19 nih.gov 376
20 amazon.co.jp 376
Our formula does not strictly
maintain the ordering. So
yahoo.com which was in 8th
place before downsampling is
now ranked 1. The two
rankings are highly correlated
(99.4%)
No. Domain URLs
1 google.com 1.8M
2 github.com 1.3M
3 reddit.com 1.1M
4 youtube.com 866.6K
5 tumblr.com 685.3K
6 wordpress.com 577.4K
7 blogspot.com 521.2K
8 yahoo.com 456.9K
9 facebook.com 308.3K
10 instagram.com 245.3K
11 bebo.com 229.7K
12 amazon.com 200.5K
13 url.cn 196.2K
14 webshots.com 191.6K
15 twitpic.com 160.5K
16 wikipedia.org 147.7K
17 webs.com 140.6K
18 verizon.net 140.1K
19 qq.com 139.0K
20 hyves.nl 124.6K
Top 20 domains before downsampling Top 20 domains after downsampling
We have 7M unique domains
in the 27.3M URL sample.
16. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Dereferencing and downloading content from archive for
27.3M URLs is expensive
16
http://facebook.com
URL
TimeMap
Mementos
27.3M TimeMaps
1.4TB storage for TimeMaps
Cost: ~22 days (0.07s/TimeMap)
27.3M URLs
3.93B total mementos
Cost to download: ~172 yrs
(1.40s/memento)
17. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3
17
Overtime, root URLs tend to collect more mementos than deep links
18. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3
18
Overtime, root URLs tend to collect more mementos than deep links
Later years have
fewer root URLs
Early years have
fewer deep links
For all yearly samples, deep links
are below the diagonal except for
2021.
This is because of some deep links
that have more than 100M
mementos
19. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs 19
Some deep links not crawled/rarely crawled
http://web.archive.org/web/20050523203823/http://www.msnbc.com/ http://web.archive.org/web/20050523203823/http://www.msnbc.com/id/7954620/
www.msnbc.com/id/7954620/ exists on the live web (even if it redirects), but
no live web page links to it (not indexed in Google). The page is not archived
even though we discovered it from a memento.
$ curl -ILks http://www.msnbc.com/id/7954620/ |
grep -i "^HTTP|^location:"
HTTP/1.1 301 Moved Permanently
Location: https://www.msnbc.com/id/7954620/
HTTP/1.1 301 Moved Permanently
Location: http://www.nbcnews.com/id/7954620/
HTTP/1.1 301 Moved Permanently
Location: https://www.nbcnews.com/id/7954620/
HTTP/1.1 301 Moved Permanently
Location: https://www.nbcnews.com/id/wbna7954620
HTTP/1.1 200 OK
20. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Most root URLs first discovered in the early years (1996-2002) are
still linked on the live web and are still being crawled by IA
○ Seems less true for root URLs discovered post-2002
○ This could be due to domain drop catching. Drop catching gives the appearance that URL is alive.
20
$ curl -i http://www.aggressivecars.com/
HTTP/1.1 302 Found
content-length: 0
date: Thu, 21 Apr 2022 00:10:12 GMT
location:
https://www.hugedomains.com/domain_profile.cfm?d=aggressivecars.com
21. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Starting around 2016, most-archived URLs no longer correlate with user
experiences (e.g., Yahoo, Wikipedia) but are now service/framework URLs
21
URL
No. of
mementos
https://securelb.imodules.com/s/1858/b
p/interior.aspx?cid=1063&gid=2&pgid=
418&sid=1858 32.6M
https://cognac.fr/?cookie_accepted=fals
e 5.6M
https://yastatic.net/safeframe-bundles/0
.69/1-1-0/render.html 5.4M
https://fbsbx.com/captcha/recaptcha/ifra
me?compact=0&referer=https://www.fa
cebook.com 3.6M
https://sarahdaisy.com/cgi-sys/suspend
edpage.cgi 3.4M
2019
These URLs are not part of the standard user
experience!
URL
No. of
mementos
https://youtube.com/ 2.7M
https://tu06.com/ 1.3M
https://fasthorses.biz/logi
n.aspx 1.3M
https://ameriplanhealth.c
om/members.aspx 1.1M
https://wap.lunarstorm.s
e/log/log_outside.aspx 750.0K
2005
URL
No. of
mementos
https://bloomberg.com/ 3.8M
https://genealogy.com/ 3.6M
https://royalkona.com/ 2.3M
https://msn.com/ 2.3M
https://fma.com/ 1.9M
1996
Root URLs are most popular which
seems to be part of the standard
user experience
Deep links starts appearing in most
popular URLs
22. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Summary
We employed various sampling strategies to curate our representative sample of the web. The final dataset contains TimeMaps of 27.3 million URLs
comprising 3.8 billion archived pages from 1996 to 2021.
Challenges and Lessons Learned:
1. Archive’s index contains more than HTML pages. We correctly predicted 86% of HTML pages using extensions.
2. Web and Archiving capacity have significantly increased over time, so we had fewer URLs in the early years.
3. Percentage of deep links archived compared to root URLs has increased over the years.
4. Our initial sample was dominated by long tail (domains with only 1 URL), so we reduced it by 90%.
5. Popular domains such as Yahoo, Amazon, and Twitter were over-represented. We applied domain-based logarithmic-scale downsampling.
6. Expensive computational and storage expense of dereferencing and downloading content from 27.3M URLs (and their ~4B mementos)
7. Root URLs tend to collect more mementos than deep links. Most root URLs from the early years are still crawled by IA.
8. Popular URLs in IA after 2016 are no longer end user HTML pages but include service/framework URLs.
22