2. Why
• We want to know more about “National Collections” and “National
Webs”
• Many Web Archives are not accessible through the Internet
• Often Have only stats of XX PB in Archive
• Not a lot of information on country-code-TLD
• Little info on overlap between archives
• WARC & CDX are too big
• Need for “low common denominator data” that is still “rich enough”
5. Possible alternative sources
Or other Search Engine Index
which holds information
about count & size of MIME
types per domain
6. What is the summary file?
• A file with an entry per 2nd level domain and summary info per year
about the number of files and their size:
• HTML
• CSS
• Images
• PDF
• Video
• Audio
• Javascript
• JSON (Javascript Object Notation)
• Fonts
• HTTP vs HTTPS (secure Web)
11. 0
2
4
6
8
10
12
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
First letter frequency
In IA 2nd-level domains vs French words
French wordlist .fr domains
12. -4
-3
-2
-1
0
1
2
3
4
5
6
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
First letter frequency in IA 2nd-level domains vs French words
14. Data source Luxembourg Web Archive
• Established in 2016
• CDXJ files on disk
• Run programs locally
15. Data source Internet Archive
• CDX Server API at:
http://web.archive.org/cdx/search/cdx?url=lu
Download using:
https://github.com/ikreymer/cdx-index-client
Data downloaded for:
lu dk be fr frl nl
should have 41136 67843 71202 311813 42 230871
actually have 41136 42037 71202 303282 42 205147
missing (%) 0.00% 38.04% 0.00% 2.74% 0.00% 11.14%
16. Data source Common crawl
• Hosted on Amazon S3
• Receipe at:
https://groups.google.com/g/common-
crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ
• Download CDX / CDXJ and process locally
19. Further process .summary
• host_year_total.py
• overlap.py
2nd level domain Year File Count Bytes
alvestedetocht.frl 2015 2 3750
alvestedetocht.frl 2016 108 483354679
Year Common Crawl Internet Archive CC & IA
2019 469 620 1180
20. 0
10000
20000
30000
40000
50000
60000
70000
80000
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
.lu overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts
webarchive.lu webarchive.lu AND InternetArchive
webarchive.lu AND InternetArchive AND CommonCrawl webarchive.lu AND CommonCrawl
InternetArchive InternetArchive AND CommonCrawl
CommonCrawl
21. 0
500000
1000000
1500000
2000000
2500000
1993 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
.fr overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts
lufr lufr AND iafr lufr AND iafr AND ccfr lufr AND ccfr iafr iafr AND ccfr ccfr
22. Related work
• Internet Archive metadata service
https://github.com/jeffersonbailey/web-archive-apis-workshop
curl "https://web.archive.org/__wb/search/metadata?q=tld:lu"
• Sawood Alam et al. characterization of webarchive holdings for memento
aggregator
https://netpreserve.org/resources/IIPC_project-Archive_profiling-final_report.pdf
https://github.com/oduwsdl/MementoMap
• Shine, SOLRWayback can probably answer questions like this for a single
archive
23. • Only one developer/tester ! Bugs…
• MIME
• MIME types are reported by server but not necessarily correct
• Some common-crawl data has no MIME information
• No canonical way to “simplify” MIME types
• Maybe missing interesting categories?
• Domains / Hosts
• Tradeoff between size of summary file and details (e.g. www.ic.ac.uk)
• IDN (xn--p1ai -> рф -> ru)
• Overlap analysis
• Very crude
Limitations