Lessons Learned From the Longitudinal Sampling of a Large Web Archive

● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs
Lessons Learned From the Longitudinal
Sampling of a Large Web Archive
Kritika Garg1
, Sawood Alam2
, Michele C. Weigle1
, Michael L. Nelson1
, Corentin Barreau2
, Mark Graham2
, Dietrich Ayala3
2023 IIPC Web Archiving Conference (WAC)
May 3, 2023
1
Web Science & Digital Libraries Research Group, Old Dominion University, Norfolk, Virginia - USA (@WebSciDL)
2
Wayback Machine, Internet Archive, San Francisco, California, USA (@internetarchive)
3
Protocol Labs, San Francisco, California - USA (@protocollabs)
1

We documented the strategies and lessons learned from
sampling the archived web by collecting 27.3 million URLs
with 3.8 billion archived pages in each of the 26 years of
the Internet Archive's existence, from 1996 to 2021.
2
Overview

1996 1996 2003
https://www.washingtonpost.com/archive/politics/2003/11/24/on-the-
web-research-work-proves-ephemeral/959c882f-9ad0-4b36-88cd-fb7
411db118d/
3
The motivation for this work was to obtain a "representative sample of the web" that could be used to
revisit fundamental questions regarding the web, such as "how long does a web page last?" The
commonly cited answer is “44-100 days on average”, all of which are from research that dates back
to 1996--2003.
http://web.archive.org/web/19970215093036/http://www.sciam.com:8
0/0397issue/0397kahle.html
http://web.archive.org/web/19971011050140/http://www.archi
ve.org/sciam_article.html

Curated representative sample using the archived web
27 million URLs
Reduce the number of
domains with a single URL
Downsampled URLs of
over-represented domains
285 million URLs
Sampled 285M URLs from IA's
ZipNum index ﬁle that contains
every 6000th line of the CDX
index.
These include URLs of
embedded resources, such as
images, CSS, and JavaScript.
92 million URLs
Filtered the URLs for HTML
pages to limit our samples to
web pages.
Also ﬁltered any invalid URLs
and likely URL Aliases.
Upsample URLs from early
years.
4
Initial Goal: Dataset of 25M URLs (1M URLs for each year of Internet Archive)

https://brs53.dx.am/scripts/jquery.min.js
https://wam.ae/js/ar/markets.js
https:///?dn=renunciationguide.com&flrdr=yes&nxte=css
https://*/robots.txt
https://*/robots.txt
https://174.127.81.0/t/87/3/15/4-320x240.jpg
https://174.127.81.0/t/87/73/25/1-320x240.jpg
https://mf.ag/2121_de.gif?exp=24559886473100
https://127.0.0.1/bb1750.html
https://163.30.44.17/principal_test
https://notiche.com.ar/index.php?limitstart=42
Archive index contains all kinds of URLs
5
We sampled 285M URLs from IA's ZipNum index file of August 2021 that contains every 6000th line of the
CDX index which includes URLs of embedded resources, such as images, CSS, and JavaScript.
JavaScript
CSS
Images
robots.txt
HTML

Filtered the URLs for likely HTML pages based on extensions
To limit our samples to web pages, we filtered the URLs to 107M likely HTML pages
(based on trailing slash and filename extensions).
6
Heuristic Example URL
trailing slash/no ext https://www.youtube.com/
.do http://example.com/register.do
.php[0-9] https://notiche.com.ar/index.php
.aspx https://cigaroasis.asia/contact.aspx
.cgi https://0009.ir/cgi-sys/suspendedpage.cgi
.pl https://007thunderballpoker.com/11-5g-suited-poker-chip/pai-gow-poker-rules.pl
.asp https://0000028.cnelc.com/productshop/newpro.asp
.jsp https://006bai.net/404.jsp
.cfm https://001ok.com/adventure_nz.cfm?nft=1&p=4&t=4
.[a-z]html https://city-sat.asia/thread28004.html
.htm http://1st-international.com:80/profiles/16/PersonalBO893.htm

Datetime of the first archive and MIME type using IA's CDX
output: surt timestamp original-URL mimetype statuscode digest length
7
We collected first entry of CDX for all the 107M likely HTML pages to determine the time of the first archive
and MIME type of the URL
Memento datetime of
the first archive
MIME type

86.12% URLs (92M) were correctly predicted as HTML
8
Accuracy of each heuristic used to predict HTML:
trailing slash/no ext 83.7%
.do 85.1%
.php[0-9] 88.7%
.aspx 90.0%
.cgi 90.1%
.pl 91.8%
.asp 93.7%
.jsp 93.7%
.cfm 96.7%
.[a-z]html 97.8%
.htm 98.3%
MIME-type distribution of 107M likely HTML URLs

Significant increase in web and archiving capacity over time
9
We grouped the 92 million URLs with "text/html" MIME types based on the year it was first archive.
2001-2021 exceeding 1M
URLs require downsampling
1996-2000 < 1M
URLs require
upsampling
2021 has only 8
months of data as
the Index file used is
from August 2021
1996 has partial data
as IA’s Wayback
Machine started in
October 1996

Increase in deep links archived over the years;
Extracted root from deep links to upsample earlier years
We identify the ~20M domains in 92M sample with no root URLs. We extracted hostnames to form root URLs and then
added these missing root URLs to our sample.
For example:
https://reddit.com/r/argentina/comments/1ruebz/cient%c3%adficos_chubutensesi → https://reddit.com/
10
Upsampling allowed
us to populate URLs
in early years, which
holds more interest
for our study

Long tail distribution; 70% of the domains have just one URL
11
Distribution of Number of URLs for each domain in 2016 sample
For example,
In 2016 sample, 1.7M domains (79%) have just a single URL
0000-00-00.com
00000000000.cn
000000008.com
blumen-konzelmann.de
ip-37-187-129.eu
jdpiao.com
jsygzh.com
kkradnik.com
sayyum.com
schuimrubbergigant.nl
spd-wuppertal-katernberg.de
tokelezea.com
zzzzy.com
zzzzyyyyggggtest1.com
zzzzz7.com
Some of the long tail
of URLs features
domains that are
likely not part of most
users' experience,
although we can't be
sure for foreign sites
We kept only 10% of domains with a single
URL for yearly samples with longer tails

Popular domains (e.g., amazon.com, yahoo.com)
are over-represented
12
1996-2000
Domain No. of URLS
amazon.com 16.8K
yahoo.com 13.5K
geocities.com 12.1K
infospace.com 9.6K
aol.com 5.2K
tripod.com 2.8K
msn.com 2.8K
wunderground.com 2.8K
excite.com 2.7K
surfers-paradise.com 2.6K
2001
Domain No. of URLS
yahoo.com 11.7K
geocities.com 6.6K
free.fr 3.3K
tripod.com 3.2K
amazon.com 2.0K
angelfire.com 2.0K
hypermart.net 1.9K
homestead.com 1.8K
sun.com 1.8K
sina.com.cn 1.8K
2002
Domain No. of URLS
yahoo.com 11.4K
geocities.com 5.7K
2ch.net 4.3K
amazon.com 3.2K
daum.net 3.2K
free.fr 3.0K
sohu.com 2.3K
infoseek.co.jp 2.0K
sina.com.cn 2.0K
yahoo.co.jp 1.9K
Clustered the early years to
reach 1M URLs !

logarithmic-scale downsampling to reduce over-sampled domains
13
https://github.com/adelcambre
https://github.com/akitaonrails/i18n_demo_app/tree/maste
r
https://github.com/alx
https://github.com/anotherjesse/s3/watchers
https://github.com/280north/cappuccino/issues
https://github.com/aaronrussell/gh_repo_recommender
https://github.com/00amy/intelligent-tutoring-system
https://github.com/00lenon/thediamondknight
https://github.com/01045972746/tensor-example
…
1.3M URLs for github.com reduced to 234 URLs
https://peaceinspire.com/2007/07/28
https://peaceinspire.com/song-lyrics/english-songs
https://peaceinspire.com/2008/09/01/give-the-lord-your-hea
rt
3 URLs for peaceinspire.com stays as it is
We don't require 1.3 million github URLs!
Having a lot of URLs for a single domain eventually has
diminishing returns. It is sufficient to say that we can have
coverage of github.com with a smaller quantity.
sample_urls = min(N, K * log(N) + C)
N = number of URLs sharing a domain
C Include up to C URLs from the same domain
log(N) + C Beyond C, sample URLs on a log scale
K * log(N) + C Sample K multiples of the log scale to relax the downsampling
min(N, K * log(N) + C) Ensure that the samples are not more than available URLs
under a domain

logarithmic-scale downsampling to reach around 1M URLs for each year
14
We applied this techniques to every domain in the yearly
sample. We adjusted the parameters K & C to get almost
1M URLs in total for each year while ensuring fairness in
the domain representation.
sample_urls = min(N, K * log(N) + C)
N = number of URLs sharing a domain

Downsampling largely flattened out the large discrepancy in the number of
URLs per domain
15
No. Domain R-URLs
1 yahoo.com 489
2 blogspot.com 463
3 google.com 459
4 amazon.com 453
5 wikipedia.org 439
6 house.gov 417
7 msn.com 401
8 yahoo.co.jp 396
9 wordpress.com 394
10 cnn.com 391
11 ca.gov 387
12 ebay.com 383
13 go.com 382
14 amazon.de 382
15 microsoft.com 380
16 senate.gov 379
17 sina.com.cn 378
18 amazon.co.uk 377
19 nih.gov 376
20 amazon.co.jp 376
Our formula does not strictly
maintain the ordering. So
yahoo.com which was in 8th
place before downsampling is
now ranked 1. The two
rankings are highly correlated
(99.4%)
No. Domain URLs
1 google.com 1.8M
2 github.com 1.3M
3 reddit.com 1.1M
4 youtube.com 866.6K
5 tumblr.com 685.3K
6 wordpress.com 577.4K
7 blogspot.com 521.2K
8 yahoo.com 456.9K
9 facebook.com 308.3K
10 instagram.com 245.3K
11 bebo.com 229.7K
12 amazon.com 200.5K
13 url.cn 196.2K
14 webshots.com 191.6K
15 twitpic.com 160.5K
16 wikipedia.org 147.7K
17 webs.com 140.6K
18 verizon.net 140.1K
19 qq.com 139.0K
20 hyves.nl 124.6K
Top 20 domains before downsampling Top 20 domains after downsampling
We have 7M unique domains
in the 27.3M URL sample.

Dereferencing and downloading content from archive for
27.3M URLs is expensive
16
http://facebook.com
URL
TimeMap
Mementos
27.3M TimeMaps
1.4TB storage for TimeMaps
Cost: ~22 days (0.07s/TimeMap)
27.3M URLs
3.93B total mementos
Cost to download: ~172 yrs
(1.40s/memento)

Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3
17
Overtime, root URLs tend to collect more mementos than deep links

Animation: https://observablehq.com/d/21b995649a9d3b33#cell-3
18
Overtime, root URLs tend to collect more mementos than deep links
Later years have
fewer root URLs
Early years have
fewer deep links
For all yearly samples, deep links
are below the diagonal except for
2021.
This is because of some deep links
that have more than 100M
mementos

● Lessons Learned From the Longitudinal Sampling of a Large Web Archive ● IIPC WAC 2023 ● @Kritika_Garg, @WebSciDL, @internetarchive @protocollabs 19
Some deep links not crawled/rarely crawled
http://web.archive.org/web/20050523203823/http://www.msnbc.com/ http://web.archive.org/web/20050523203823/http://www.msnbc.com/id/7954620/
www.msnbc.com/id/7954620/ exists on the live web (even if it redirects), but
no live web page links to it (not indexed in Google). The page is not archived
even though we discovered it from a memento.
$ curl -ILks http://www.msnbc.com/id/7954620/ |
grep -i "^HTTP|^location:"
HTTP/1.1 301 Moved Permanently
Location: https://www.msnbc.com/id/7954620/
Location: http://www.nbcnews.com/id/7954620/
Location: https://www.nbcnews.com/id/7954620/
Location: https://www.nbcnews.com/id/wbna7954620
HTTP/1.1 200 OK

Most root URLs first discovered in the early years (1996-2002) are
still linked on the live web and are still being crawled by IA
○ Seems less true for root URLs discovered post-2002
○ This could be due to domain drop catching. Drop catching gives the appearance that URL is alive.
20
$ curl -i http://www.aggressivecars.com/
HTTP/1.1 302 Found
content-length: 0
date: Thu, 21 Apr 2022 00:10:12 GMT
location:
https://www.hugedomains.com/domain_profile.cfm?d=aggressivecars.com

Starting around 2016, most-archived URLs no longer correlate with user
experiences (e.g., Yahoo, Wikipedia) but are now service/framework URLs
21
URL
No. of
mementos
https://securelb.imodules.com/s/1858/b
p/interior.aspx?cid=1063&gid=2&pgid=
418&sid=1858 32.6M
https://cognac.fr/?cookie_accepted=fals
e 5.6M
https://yastatic.net/safeframe-bundles/0
.69/1-1-0/render.html 5.4M
https://fbsbx.com/captcha/recaptcha/ifra
me?compact=0&referer=https://www.fa
cebook.com 3.6M
https://sarahdaisy.com/cgi-sys/suspend
edpage.cgi 3.4M
2019
These URLs are not part of the standard user
experience!
URL
No. of
mementos
https://youtube.com/ 2.7M
https://tu06.com/ 1.3M
https://fasthorses.biz/logi
n.aspx 1.3M
https://ameriplanhealth.c
om/members.aspx 1.1M
https://wap.lunarstorm.s
e/log/log_outside.aspx 750.0K
2005
URL
No. of
mementos
https://bloomberg.com/ 3.8M
https://genealogy.com/ 3.6M
https://royalkona.com/ 2.3M
https://msn.com/ 2.3M
https://fma.com/ 1.9M
1996
Root URLs are most popular which
seems to be part of the standard
user experience
Deep links starts appearing in most
popular URLs

Summary
We employed various sampling strategies to curate our representative sample of the web. The final dataset contains TimeMaps of 27.3 million URLs
comprising 3.8 billion archived pages from 1996 to 2021.
Challenges and Lessons Learned:
1. Archive’s index contains more than HTML pages. We correctly predicted 86% of HTML pages using extensions.
2. Web and Archiving capacity have significantly increased over time, so we had fewer URLs in the early years.
3. Percentage of deep links archived compared to root URLs has increased over the years.
4. Our initial sample was dominated by long tail (domains with only 1 URL), so we reduced it by 90%.
5. Popular domains such as Yahoo, Amazon, and Twitter were over-represented. We applied domain-based logarithmic-scale downsampling.
6. Expensive computational and storage expense of dereferencing and downloading content from 27.3M URLs (and their ~4B mementos)
7. Root URLs tend to collect more mementos than deep links. Most root URLs from the early years are still crawled by IA.
8. Popular URLs in IA after 2016 are no longer end user HTML pages but include service/framework URLs.
22

Lessons Learned From the Longitudinal Sampling of a Large Web Archive

Recommended

Recommended

More Related Content

Similar to Lessons Learned From the Longitudinal Sampling of a Large Web Archive

Similar to Lessons Learned From the Longitudinal Sampling of a Large Web Archive (20)

Recently uploaded

Recently uploaded (20)

Lessons Learned From the Longitudinal Sampling of a Large Web Archive