SlideShare a Scribd company logo
1 of 22
Download to read offline
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Lessons Learned From the Longitudinal
Sampling of a Large Web Archive
Kritika Garg1
, Sawood Alam2
, Michele C. Weigle1
, Michael L. Nelson1
, Corentin Barreau2
, Mark Graham2
, Dietrich Ayala3
2023 IIPC Web Archiving Conference (WAC)
May 3, 2023
1
Web Science & Digital Libraries Research Group, Old Dominion University, Norfolk, Virginia - USA (@WebSciDL)
2
Wayback Machine, Internet Archive, San Francisco, California, USA (@internetarchive)
3
Protocol Labs, San Francisco, California - USA (@protocollabs)
1
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
We documented the strategies and lessons learned from
sampling the archived web by collecting 27.3 million URLs
with 3.8 billion archived pages in each of the 26 years of
the Internet Archive's existence, from 1996 to 2021.
2
Overview
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
1996 1996 2003
https://www.washingtonpost.com/archive/politics/2003/11/24/on-the-
web-research-work-proves-ephemeral/959c882f-9ad0-4b36-88cd-fb7
411db118d/
3
The motivation for this work was to obtain a "representative sample of the web" that could be used to
revisit fundamental questions regarding the web, such as "how long does a web page last?" The
commonly cited answer is “44-100 days on average”, all of which are from research that dates back
to 1996--2003.
http://web.archive.org/web/19970215093036/http://www.sciam.com:8
0/0397issue/0397kahle.html
http://web.archive.org/web/19971011050140/http://www.archi
ve.org/sciam_article.html
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Curated representative sample using the archived web
27 million URLs
Reduce the number of
domains with a single URL
Downsampled URLs of
over-represented domains
285 million URLs
Sampled 285M URLs from IA's
ZipNum index file that contains
every 6000th line of the CDX
index.
These include URLs of
embedded resources, such as
images, CSS, and JavaScript.
92 million URLs
Filtered the URLs for HTML
pages to limit our samples to
web pages.
Also filtered any invalid URLs
and likely URL Aliases.
Upsample URLs from early
years.
4
Initial Goal: Dataset of 25M URLs (1M URLs for each year of Internet Archive)
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
https://brs53.dx.am/scripts/jquery.min.js
https://wam.ae/js/ar/markets.js
https:///?dn=renunciationguide.com&flrdr=yes&nxte=css
https://*/robots.txt
https://*/robots.txt
https://174.127.81.0/t/87/3/15/4-320x240.jpg
https://174.127.81.0/t/87/73/25/1-320x240.jpg
https://mf.ag/2121_de.gif?exp=24559886473100
https://127.0.0.1/bb1750.html
https://163.30.44.17/principal_test
https://notiche.com.ar/index.php?limitstart=42
Archive index contains all kinds of URLs
5
We sampled 285M URLs from IA's ZipNum index file of August 2021 that contains every 6000th line of the
CDX index which includes URLs of embedded resources, such as images, CSS, and JavaScript.
JavaScript
CSS
Images
robots.txt
HTML
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Filtered the URLs for likely HTML pages based on extensions
To limit our samples to web pages, we filtered the URLs to 107M likely HTML pages
(based on trailing slash and filename extensions).
6
Heuristic Example URL
trailing slash/no ext https://www.youtube.com/
.do http://example.com/register.do
.php[0-9] https://notiche.com.ar/index.php
.aspx https://cigaroasis.asia/contact.aspx
.cgi https://0009.ir/cgi-sys/suspendedpage.cgi
.pl https://007thunderballpoker.com/11-5g-suited-poker-chip/pai-gow-poker-rules.pl
.asp https://0000028.cnelc.com/productshop/newpro.asp
.jsp https://006bai.net/404.jsp
.cfm https://001ok.com/adventure_nz.cfm?nft=1&p=4&t=4
.[a-z]html https://city-sat.asia/thread28004.html
.htm http://1st-international.com:80/profiles/16/PersonalBO893.htm
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Datetime of the first archive and MIME type using IA's CDX
output: surt timestamp original-URL mimetype statuscode digest length
7
We collected first entry of CDX for all the 107M likely HTML pages to determine the time of the first archive
and MIME type of the URL
Memento datetime of
the first archive
MIME type
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
86.12% URLs (92M) were correctly predicted as HTML
8
Accuracy of each heuristic used to predict HTML:
trailing slash/no ext 83.7%
.do 85.1%
.php[0-9] 88.7%
.aspx 90.0%
.cgi 90.1%
.pl 91.8%
.asp 93.7%
.jsp 93.7%
.cfm 96.7%
.[a-z]html 97.8%
.htm 98.3%
MIME-type distribution of 107M likely HTML URLs
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Significant increase in web and archiving capacity over time
9
We grouped the 92 million URLs with "text/html" MIME types based on the year it was first archive.
2001-2021 exceeding 1M
URLs require downsampling
1996-2000 < 1M
URLs require
upsampling
2021 has only 8
months of data as
the Index file used is
from August 2021
1996 has partial data
as IA’s Wayback
Machine started in
October 1996
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Increase in deep links archived over the years;
Extracted root from deep links to upsample earlier years
We identify the ~20M domains in 92M sample with no root URLs. We extracted hostnames to form root URLs and then
added these missing root URLs to our sample.
For example:
https://reddit.com/r/argentina/comments/1ruebz/cient%c3%adficos_chubutensesi → https://reddit.com/
10
Upsampling allowed
us to populate URLs
in early years, which
holds more interest
for our study
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Long tail distribution; 70% of the domains have just one URL
11
Distribution of Number of URLs for each domain in 2016 sample
For example,
In 2016 sample, 1.7M domains (79%) have just a single URL
0000-00-00.com
00000000000.cn
000000008.com
blumen-konzelmann.de
ip-37-187-129.eu
jdpiao.com
jsygzh.com
kkradnik.com
sayyum.com
schuimrubbergigant.nl
spd-wuppertal-katernberg.de
tokelezea.com
zzzzy.com
zzzzyyyyggggtest1.com
zzzzz7.com
Some of the long tail
of URLs features
domains that are
likely not part of most
users' experience,
although we can't be
sure for foreign sites
We kept only 10% of domains with a single
URL for yearly samples with longer tails
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Popular domains (e.g., amazon.com, yahoo.com)
are over-represented
12
1996-2000
Domain No. of URLS
amazon.com 16.8K
yahoo.com 13.5K
geocities.com 12.1K
infospace.com 9.6K
aol.com 5.2K
tripod.com 2.8K
msn.com 2.8K
wunderground.com 2.8K
excite.com 2.7K
surfers-paradise.com 2.6K
2001
Domain No. of URLS
yahoo.com 11.7K
geocities.com 6.6K
free.fr 3.3K
tripod.com 3.2K
amazon.com 2.0K
angelfire.com 2.0K
hypermart.net 1.9K
homestead.com 1.8K
sun.com 1.8K
sina.com.cn 1.8K
2002
Domain No. of URLS
yahoo.com 11.4K
geocities.com 5.7K
2ch.net 4.3K
amazon.com 3.2K
daum.net 3.2K
free.fr 3.0K
sohu.com 2.3K
infoseek.co.jp 2.0K
sina.com.cn 2.0K
yahoo.co.jp 1.9K
Clustered the early years to
reach 1M URLs !
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
logarithmic-scale downsampling to reduce over-sampled domains
13
https://github.com/adelcambre
https://github.com/akitaonrails/i18n_demo_app/tree/maste
r
https://github.com/alx
https://github.com/anotherjesse/s3/watchers
https://github.com/280north/cappuccino/issues
https://github.com/aaronrussell/gh_repo_recommender
https://github.com/00amy/intelligent-tutoring-system
https://github.com/00lenon/thediamondknight
https://github.com/01045972746/tensor-example
…
1.3M URLs for github.com reduced to 234 URLs
https://peaceinspire.com/2007/07/28
https://peaceinspire.com/song-lyrics/english-songs
https://peaceinspire.com/2008/09/01/give-the-lord-your-hea
rt
3 URLs for peaceinspire.com stays as it is
We don't require 1.3 million github URLs!
Having a lot of URLs for a single domain eventually has
diminishing returns. It is sufficient to say that we can have
coverage of github.com with a smaller quantity.
sample_urls = min(N, K * log(N) + C)
N = number of URLs sharing a domain
C Include up to C URLs from the same domain
log(N) + C Beyond C, sample URLs on a log scale
K * log(N) + C Sample K multiples of the log scale to relax the downsampling
min(N, K * log(N) + C) Ensure that the samples are not more than available URLs
under a domain
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
logarithmic-scale downsampling to reach around 1M URLs for each year
14
We applied this techniques to every domain in the yearly
sample. We adjusted the parameters K & C to get almost
1M URLs in total for each year while ensuring fairness in
the domain representation.
sample_urls = min(N, K * log(N) + C)
N = number of URLs sharing a domain
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Downsampling largely flattened out the large discrepancy in the number of
URLs per domain
15
No. Domain R-URLs
1 yahoo.com 489
2 blogspot.com 463
3 google.com 459
4 amazon.com 453
5 wikipedia.org 439
6 house.gov 417
7 msn.com 401
8 yahoo.co.jp 396
9 wordpress.com 394
10 cnn.com 391
11 ca.gov 387
12 ebay.com 383
13 go.com 382
14 amazon.de 382
15 microsoft.com 380
16 senate.gov 379
17 sina.com.cn 378
18 amazon.co.uk 377
19 nih.gov 376
20 amazon.co.jp 376
Our formula does not strictly
maintain the ordering. So
yahoo.com which was in 8th
place before downsampling is
now ranked 1. The two
rankings are highly correlated
(99.4%)
No. Domain URLs
1 google.com 1.8M
2 github.com 1.3M
3 reddit.com 1.1M
4 youtube.com 866.6K
5 tumblr.com 685.3K
6 wordpress.com 577.4K
7 blogspot.com 521.2K
8 yahoo.com 456.9K
9 facebook.com 308.3K
10 instagram.com 245.3K
11 bebo.com 229.7K
12 amazon.com 200.5K
13 url.cn 196.2K
14 webshots.com 191.6K
15 twitpic.com 160.5K
16 wikipedia.org 147.7K
17 webs.com 140.6K
18 verizon.net 140.1K
19 qq.com 139.0K
20 hyves.nl 124.6K
Top 20 domains before downsampling Top 20 domains after downsampling
We have 7M unique domains
in the 27.3M URL sample.
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Dereferencing and downloading content from archive for
27.3M URLs is expensive
16
http://facebook.com
URL
TimeMap
Mementos
27.3M TimeMaps
1.4TB storage for TimeMaps
Cost: ~22 days (0.07s/TimeMap)
27.3M URLs
3.93B total mementos
Cost to download: ~172 yrs
(1.40s/memento)
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3
17
Overtime, root URLs tend to collect more mementos than deep links
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3
18
Overtime, root URLs tend to collect more mementos than deep links
Later years have
fewer root URLs
Early years have
fewer deep links
For all yearly samples, deep links
are below the diagonal except for
2021.
This is because of some deep links
that have more than 100M
mementos
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs 19
Some deep links not crawled/rarely crawled
http://web.archive.org/web/20050523203823/http://www.msnbc.com/ http://web.archive.org/web/20050523203823/http://www.msnbc.com/id/7954620/
www.msnbc.com/id/7954620/ exists on the live web (even if it redirects), but
no live web page links to it (not indexed in Google). The page is not archived
even though we discovered it from a memento.
$ curl -ILks http://www.msnbc.com/id/7954620/ |
grep -i "^HTTP|^location:"
HTTP/1.1 301 Moved Permanently
Location: https://www.msnbc.com/id/7954620/
HTTP/1.1 301 Moved Permanently
Location: http://www.nbcnews.com/id/7954620/
HTTP/1.1 301 Moved Permanently
Location: https://www.nbcnews.com/id/7954620/
HTTP/1.1 301 Moved Permanently
Location: https://www.nbcnews.com/id/wbna7954620
HTTP/1.1 200 OK
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Most root URLs first discovered in the early years (1996-2002) are
still linked on the live web and are still being crawled by IA
○ Seems less true for root URLs discovered post-2002
○ This could be due to domain drop catching. Drop catching gives the appearance that URL is alive.
20
$ curl -i http://www.aggressivecars.com/
HTTP/1.1 302 Found
content-length: 0
date: Thu, 21 Apr 2022 00:10:12 GMT
location:
https://www.hugedomains.com/domain_profile.cfm?d=aggressivecars.com
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Starting around 2016, most-archived URLs no longer correlate with user
experiences (e.g., Yahoo, Wikipedia) but are now service/framework URLs
21
URL
No. of
mementos
https://securelb.imodules.com/s/1858/b
p/interior.aspx?cid=1063&gid=2&pgid=
418&sid=1858 32.6M
https://cognac.fr/?cookie_accepted=fals
e 5.6M
https://yastatic.net/safeframe-bundles/0
.69/1-1-0/render.html 5.4M
https://fbsbx.com/captcha/recaptcha/ifra
me?compact=0&referer=https://www.fa
cebook.com 3.6M
https://sarahdaisy.com/cgi-sys/suspend
edpage.cgi 3.4M
2019
These URLs are not part of the standard user
experience!
URL
No. of
mementos
https://youtube.com/ 2.7M
https://tu06.com/ 1.3M
https://fasthorses.biz/logi
n.aspx 1.3M
https://ameriplanhealth.c
om/members.aspx 1.1M
https://wap.lunarstorm.s
e/log/log_outside.aspx 750.0K
2005
URL
No. of
mementos
https://bloomberg.com/ 3.8M
https://genealogy.com/ 3.6M
https://royalkona.com/ 2.3M
https://msn.com/ 2.3M
https://fma.com/ 1.9M
1996
Root URLs are most popular which
seems to be part of the standard
user experience
Deep links starts appearing in most
popular URLs
● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Summary
We employed various sampling strategies to curate our representative sample of the web. The final dataset contains TimeMaps of 27.3 million URLs
comprising 3.8 billion archived pages from 1996 to 2021.
Challenges and Lessons Learned:
1. Archive’s index contains more than HTML pages. We correctly predicted 86% of HTML pages using extensions.
2. Web and Archiving capacity have significantly increased over time, so we had fewer URLs in the early years.
3. Percentage of deep links archived compared to root URLs has increased over the years.
4. Our initial sample was dominated by long tail (domains with only 1 URL), so we reduced it by 90%.
5. Popular domains such as Yahoo, Amazon, and Twitter were over-represented. We applied domain-based logarithmic-scale downsampling.
6. Expensive computational and storage expense of dereferencing and downloading content from 27.3M URLs (and their ~4B mementos)
7. Root URLs tend to collect more mementos than deep links. Most root URLs from the early years are still crawled by IA.
8. Popular URLs in IA after 2016 are no longer end user HTML pages but include service/framework URLs.
22

More Related Content

Similar to Lessons Learned From the Longitudinal Sampling of a Large Web Archive

"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...Ahmed AlSum
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scaleMark Schroering
 
API analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionAPI analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkMike Taylor
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...Robert Meusel
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114fnothaft
 
Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...Toni Hermoso Pulido
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra BIGSEA
 
Browserscope oscon 2011
Browserscope oscon 2011Browserscope oscon 2011
Browserscope oscon 2011lsimon
 
How to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured ContentHow to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured ContentAcquia
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisMarcus Hanwell
 
Hackolade Tutorial - part 12 - Create a REST API model
Hackolade Tutorial - part  12 - Create a REST API modelHackolade Tutorial - part  12 - Create a REST API model
Hackolade Tutorial - part 12 - Create a REST API modelPascalDesmarets1
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Bastian Grimm
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
api analytics redis bigquery. Lrug
api analytics redis bigquery. Lrugapi analytics redis bigquery. Lrug
api analytics redis bigquery. Lrugjavier ramirez
 

Similar to Lessons Learned From the Longitudinal Sampling of a Large Web Archive (20)

"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 
API analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionAPI analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters edition
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation Network
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 
Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analyticsEUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
EUBra-BIGSEA: Cloud services with QoS guarantees for Big Data analytics
 
Browserscope oscon 2011
Browserscope oscon 2011Browserscope oscon 2011
Browserscope oscon 2011
 
How to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured ContentHow to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured Content
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Hackolade Tutorial - part 12 - Create a REST API model
Hackolade Tutorial - part  12 - Create a REST API modelHackolade Tutorial - part  12 - Create a REST API model
Hackolade Tutorial - part 12 - Create a REST API model
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
api analytics redis bigquery. Lrug
api analytics redis bigquery. Lrugapi analytics redis bigquery. Lrug
api analytics redis bigquery. Lrug
 

Recently uploaded

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Lessons Learned From the Longitudinal Sampling of a Large Web Archive

  • 1. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Lessons Learned From the Longitudinal Sampling of a Large Web Archive Kritika Garg1 , Sawood Alam2 , Michele C. Weigle1 , Michael L. Nelson1 , Corentin Barreau2 , Mark Graham2 , Dietrich Ayala3 2023 IIPC Web Archiving Conference (WAC) May 3, 2023 1 Web Science & Digital Libraries Research Group, Old Dominion University, Norfolk, Virginia - USA (@WebSciDL) 2 Wayback Machine, Internet Archive, San Francisco, California, USA (@internetarchive) 3 Protocol Labs, San Francisco, California - USA (@protocollabs) 1
  • 2. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs We documented the strategies and lessons learned from sampling the archived web by collecting 27.3 million URLs with 3.8 billion archived pages in each of the 26 years of the Internet Archive's existence, from 1996 to 2021. 2 Overview
  • 3. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs 1996 1996 2003 https://www.washingtonpost.com/archive/politics/2003/11/24/on-the- web-research-work-proves-ephemeral/959c882f-9ad0-4b36-88cd-fb7 411db118d/ 3 The motivation for this work was to obtain a "representative sample of the web" that could be used to revisit fundamental questions regarding the web, such as "how long does a web page last?" The commonly cited answer is “44-100 days on average”, all of which are from research that dates back to 1996--2003. http://web.archive.org/web/19970215093036/http://www.sciam.com:8 0/0397issue/0397kahle.html http://web.archive.org/web/19971011050140/http://www.archi ve.org/sciam_article.html
  • 4. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Curated representative sample using the archived web 27 million URLs Reduce the number of domains with a single URL Downsampled URLs of over-represented domains 285 million URLs Sampled 285M URLs from IA's ZipNum index file that contains every 6000th line of the CDX index. These include URLs of embedded resources, such as images, CSS, and JavaScript. 92 million URLs Filtered the URLs for HTML pages to limit our samples to web pages. Also filtered any invalid URLs and likely URL Aliases. Upsample URLs from early years. 4 Initial Goal: Dataset of 25M URLs (1M URLs for each year of Internet Archive)
  • 5. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs https://brs53.dx.am/scripts/jquery.min.js https://wam.ae/js/ar/markets.js https:///?dn=renunciationguide.com&flrdr=yes&nxte=css https://*/robots.txt https://*/robots.txt https://174.127.81.0/t/87/3/15/4-320x240.jpg https://174.127.81.0/t/87/73/25/1-320x240.jpg https://mf.ag/2121_de.gif?exp=24559886473100 https://127.0.0.1/bb1750.html https://163.30.44.17/principal_test https://notiche.com.ar/index.php?limitstart=42 Archive index contains all kinds of URLs 5 We sampled 285M URLs from IA's ZipNum index file of August 2021 that contains every 6000th line of the CDX index which includes URLs of embedded resources, such as images, CSS, and JavaScript. JavaScript CSS Images robots.txt HTML
  • 6. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Filtered the URLs for likely HTML pages based on extensions To limit our samples to web pages, we filtered the URLs to 107M likely HTML pages (based on trailing slash and filename extensions). 6 Heuristic Example URL trailing slash/no ext https://www.youtube.com/ .do http://example.com/register.do .php[0-9] https://notiche.com.ar/index.php .aspx https://cigaroasis.asia/contact.aspx .cgi https://0009.ir/cgi-sys/suspendedpage.cgi .pl https://007thunderballpoker.com/11-5g-suited-poker-chip/pai-gow-poker-rules.pl .asp https://0000028.cnelc.com/productshop/newpro.asp .jsp https://006bai.net/404.jsp .cfm https://001ok.com/adventure_nz.cfm?nft=1&p=4&t=4 .[a-z]html https://city-sat.asia/thread28004.html .htm http://1st-international.com:80/profiles/16/PersonalBO893.htm
  • 7. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Datetime of the first archive and MIME type using IA's CDX output: surt timestamp original-URL mimetype statuscode digest length 7 We collected first entry of CDX for all the 107M likely HTML pages to determine the time of the first archive and MIME type of the URL Memento datetime of the first archive MIME type
  • 8. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs 86.12% URLs (92M) were correctly predicted as HTML 8 Accuracy of each heuristic used to predict HTML: trailing slash/no ext 83.7% .do 85.1% .php[0-9] 88.7% .aspx 90.0% .cgi 90.1% .pl 91.8% .asp 93.7% .jsp 93.7% .cfm 96.7% .[a-z]html 97.8% .htm 98.3% MIME-type distribution of 107M likely HTML URLs
  • 9. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Significant increase in web and archiving capacity over time 9 We grouped the 92 million URLs with "text/html" MIME types based on the year it was first archive. 2001-2021 exceeding 1M URLs require downsampling 1996-2000 < 1M URLs require upsampling 2021 has only 8 months of data as the Index file used is from August 2021 1996 has partial data as IA’s Wayback Machine started in October 1996
  • 10. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Increase in deep links archived over the years; Extracted root from deep links to upsample earlier years We identify the ~20M domains in 92M sample with no root URLs. We extracted hostnames to form root URLs and then added these missing root URLs to our sample. For example: https://reddit.com/r/argentina/comments/1ruebz/cient%c3%adficos_chubutensesi → https://reddit.com/ 10 Upsampling allowed us to populate URLs in early years, which holds more interest for our study
  • 11. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Long tail distribution; 70% of the domains have just one URL 11 Distribution of Number of URLs for each domain in 2016 sample For example, In 2016 sample, 1.7M domains (79%) have just a single URL 0000-00-00.com 00000000000.cn 000000008.com blumen-konzelmann.de ip-37-187-129.eu jdpiao.com jsygzh.com kkradnik.com sayyum.com schuimrubbergigant.nl spd-wuppertal-katernberg.de tokelezea.com zzzzy.com zzzzyyyyggggtest1.com zzzzz7.com Some of the long tail of URLs features domains that are likely not part of most users' experience, although we can't be sure for foreign sites We kept only 10% of domains with a single URL for yearly samples with longer tails
  • 12. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Popular domains (e.g., amazon.com, yahoo.com) are over-represented 12 1996-2000 Domain No. of URLS amazon.com 16.8K yahoo.com 13.5K geocities.com 12.1K infospace.com 9.6K aol.com 5.2K tripod.com 2.8K msn.com 2.8K wunderground.com 2.8K excite.com 2.7K surfers-paradise.com 2.6K 2001 Domain No. of URLS yahoo.com 11.7K geocities.com 6.6K free.fr 3.3K tripod.com 3.2K amazon.com 2.0K angelfire.com 2.0K hypermart.net 1.9K homestead.com 1.8K sun.com 1.8K sina.com.cn 1.8K 2002 Domain No. of URLS yahoo.com 11.4K geocities.com 5.7K 2ch.net 4.3K amazon.com 3.2K daum.net 3.2K free.fr 3.0K sohu.com 2.3K infoseek.co.jp 2.0K sina.com.cn 2.0K yahoo.co.jp 1.9K Clustered the early years to reach 1M URLs !
  • 13. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs logarithmic-scale downsampling to reduce over-sampled domains 13 https://github.com/adelcambre https://github.com/akitaonrails/i18n_demo_app/tree/maste r https://github.com/alx https://github.com/anotherjesse/s3/watchers https://github.com/280north/cappuccino/issues https://github.com/aaronrussell/gh_repo_recommender https://github.com/00amy/intelligent-tutoring-system https://github.com/00lenon/thediamondknight https://github.com/01045972746/tensor-example … 1.3M URLs for github.com reduced to 234 URLs https://peaceinspire.com/2007/07/28 https://peaceinspire.com/song-lyrics/english-songs https://peaceinspire.com/2008/09/01/give-the-lord-your-hea rt 3 URLs for peaceinspire.com stays as it is We don't require 1.3 million github URLs! Having a lot of URLs for a single domain eventually has diminishing returns. It is sufficient to say that we can have coverage of github.com with a smaller quantity. sample_urls = min(N, K * log(N) + C) N = number of URLs sharing a domain C Include up to C URLs from the same domain log(N) + C Beyond C, sample URLs on a log scale K * log(N) + C Sample K multiples of the log scale to relax the downsampling min(N, K * log(N) + C) Ensure that the samples are not more than available URLs under a domain
  • 14. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs logarithmic-scale downsampling to reach around 1M URLs for each year 14 We applied this techniques to every domain in the yearly sample. We adjusted the parameters K & C to get almost 1M URLs in total for each year while ensuring fairness in the domain representation. sample_urls = min(N, K * log(N) + C) N = number of URLs sharing a domain
  • 15. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Downsampling largely flattened out the large discrepancy in the number of URLs per domain 15 No. Domain R-URLs 1 yahoo.com 489 2 blogspot.com 463 3 google.com 459 4 amazon.com 453 5 wikipedia.org 439 6 house.gov 417 7 msn.com 401 8 yahoo.co.jp 396 9 wordpress.com 394 10 cnn.com 391 11 ca.gov 387 12 ebay.com 383 13 go.com 382 14 amazon.de 382 15 microsoft.com 380 16 senate.gov 379 17 sina.com.cn 378 18 amazon.co.uk 377 19 nih.gov 376 20 amazon.co.jp 376 Our formula does not strictly maintain the ordering. So yahoo.com which was in 8th place before downsampling is now ranked 1. The two rankings are highly correlated (99.4%) No. Domain URLs 1 google.com 1.8M 2 github.com 1.3M 3 reddit.com 1.1M 4 youtube.com 866.6K 5 tumblr.com 685.3K 6 wordpress.com 577.4K 7 blogspot.com 521.2K 8 yahoo.com 456.9K 9 facebook.com 308.3K 10 instagram.com 245.3K 11 bebo.com 229.7K 12 amazon.com 200.5K 13 url.cn 196.2K 14 webshots.com 191.6K 15 twitpic.com 160.5K 16 wikipedia.org 147.7K 17 webs.com 140.6K 18 verizon.net 140.1K 19 qq.com 139.0K 20 hyves.nl 124.6K Top 20 domains before downsampling Top 20 domains after downsampling We have 7M unique domains in the 27.3M URL sample.
  • 16. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Dereferencing and downloading content from archive for 27.3M URLs is expensive 16 http://facebook.com URL TimeMap Mementos 27.3M TimeMaps 1.4TB storage for TimeMaps Cost: ~22 days (0.07s/TimeMap) 27.3M URLs 3.93B total mementos Cost to download: ~172 yrs (1.40s/memento)
  • 17. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3 17 Overtime, root URLs tend to collect more mementos than deep links
  • 18. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3 18 Overtime, root URLs tend to collect more mementos than deep links Later years have fewer root URLs Early years have fewer deep links For all yearly samples, deep links are below the diagonal except for 2021. This is because of some deep links that have more than 100M mementos
  • 19. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs 19 Some deep links not crawled/rarely crawled http://web.archive.org/web/20050523203823/http://www.msnbc.com/ http://web.archive.org/web/20050523203823/http://www.msnbc.com/id/7954620/ www.msnbc.com/id/7954620/ exists on the live web (even if it redirects), but no live web page links to it (not indexed in Google). The page is not archived even though we discovered it from a memento. $ curl -ILks http://www.msnbc.com/id/7954620/ | grep -i "^HTTP|^location:" HTTP/1.1 301 Moved Permanently Location: https://www.msnbc.com/id/7954620/ HTTP/1.1 301 Moved Permanently Location: http://www.nbcnews.com/id/7954620/ HTTP/1.1 301 Moved Permanently Location: https://www.nbcnews.com/id/7954620/ HTTP/1.1 301 Moved Permanently Location: https://www.nbcnews.com/id/wbna7954620 HTTP/1.1 200 OK
  • 20. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Most root URLs first discovered in the early years (1996-2002) are still linked on the live web and are still being crawled by IA ○ Seems less true for root URLs discovered post-2002 ○ This could be due to domain drop catching. Drop catching gives the appearance that URL is alive. 20 $ curl -i http://www.aggressivecars.com/ HTTP/1.1 302 Found content-length: 0 date: Thu, 21 Apr 2022 00:10:12 GMT location: https://www.hugedomains.com/domain_profile.cfm?d=aggressivecars.com
  • 21. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Starting around 2016, most-archived URLs no longer correlate with user experiences (e.g., Yahoo, Wikipedia) but are now service/framework URLs 21 URL No. of mementos https://securelb.imodules.com/s/1858/b p/interior.aspx?cid=1063&gid=2&pgid= 418&sid=1858 32.6M https://cognac.fr/?cookie_accepted=fals e 5.6M https://yastatic.net/safeframe-bundles/0 .69/1-1-0/render.html 5.4M https://fbsbx.com/captcha/recaptcha/ifra me?compact=0&referer=https://www.fa cebook.com 3.6M https://sarahdaisy.com/cgi-sys/suspend edpage.cgi 3.4M 2019 These URLs are not part of the standard user experience! URL No. of mementos https://youtube.com/ 2.7M https://tu06.com/ 1.3M https://fasthorses.biz/logi n.aspx 1.3M https://ameriplanhealth.c om/members.aspx 1.1M https://wap.lunarstorm.s e/log/log_outside.aspx 750.0K 2005 URL No. of mementos https://bloomberg.com/ 3.8M https://genealogy.com/ 3.6M https://royalkona.com/ 2.3M https://msn.com/ 2.3M https://fma.com/ 1.9M 1996 Root URLs are most popular which seems to be part of the standard user experience Deep links starts appearing in most popular URLs
  • 22. ● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs Summary We employed various sampling strategies to curate our representative sample of the web. The final dataset contains TimeMaps of 27.3 million URLs comprising 3.8 billion archived pages from 1996 to 2021. Challenges and Lessons Learned: 1. Archive’s index contains more than HTML pages. We correctly predicted 86% of HTML pages using extensions. 2. Web and Archiving capacity have significantly increased over time, so we had fewer URLs in the early years. 3. Percentage of deep links archived compared to root URLs has increased over the years. 4. Our initial sample was dominated by long tail (domains with only 1 URL), so we reduced it by 90%. 5. Popular domains such as Yahoo, Amazon, and Twitter were over-represented. We applied domain-based logarithmic-scale downsampling. 6. Expensive computational and storage expense of dereferencing and downloading content from 27.3M URLs (and their ~4B mementos) 7. Root URLs tend to collect more mementos than deep links. Most root URLs from the early years are still crawled by IA. 8. Popular URLs in IA after 2016 are no longer end user HTML pages but include service/framework URLs. 22