SlideShare a Scribd company logo
1 of 23
Download to read offline
Webarchive CDX summary
WARCNet WG1 - Comparing entire web domains
Aarhus virtual meeting 21.4.2021
yves.maurer@bnl.etat.lu
@yvesmaurer
github.com/ymaurer
Why
• We want to know more about “National Collections” and “National
Webs”
• Many Web Archives are not accessible through the Internet
• Often Have only stats of XX PB in Archive
• Not a lot of information on country-code-TLD
• Little info on overlap between archives
• WARC & CDX are too big
• Need for “low common denominator data” that is still “rich enough”
1 file
245 MB
~ 0.5 million
1.6 TB
Size comparison (wlu)
~ 0.5 million
84 TB
WARC CDXJ Summary file
50x 7000x
Detailed info & Code
https://github.com/ymaurer/cdx-summarize
CDX CDX CDX CDX CDX CDX
…
… …
cdx-summarize.py cdx-summarize.py cdx-summarize.py
combine-summary.py
.summary JSON file
Possible alternative sources
Or other Search Engine Index
which holds information
about count & size of MIME
types per domain
What is the summary file?
• A file with an entry per 2nd level domain and summary info per year
about the number of files and their size:
• HTML
• CSS
• Images
• PDF
• Video
• Audio
• Javascript
• JSON (Javascript Object Notation)
• Fonts
• HTTP vs HTTPS (secure Web)
Summary file example
bnl.lu {
"2002":
{"n_html":175,"n_image":0,"n_pdf":0, ...
"s_html":52634,"s_image":0,"s_pdf":0, ...},
"2003":
{"n_html":639,"n_image":44,"n_pdf":30, ... ,
"s_html":1295481,"s_image":295235,"s_pdf":3071214, ...}
}
Example:
Average size of files
0
5000
10000
15000
20000
25000
2015 2016 2017 2018 2019 2020 2021
Average
Size
in
Bytes Average size of HTML files (s_html / n_html)
Luxembourg Web Archive
Example:
Using the domain names
0
2
4
6
8
10
12
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
First letter frequency
In IA 2nd-level domains vs French words
French wordlist .fr domains
-4
-3
-2
-1
0
1
2
3
4
5
6
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
First letter frequency in IA 2nd-level domains vs French words
Data sources used
Luxembourg Web Archive
Data source Luxembourg Web Archive
• Established in 2016
• CDXJ files on disk
• Run programs locally
Data source Internet Archive
• CDX Server API at:
http://web.archive.org/cdx/search/cdx?url=lu
Download using:
https://github.com/ikreymer/cdx-index-client
Data downloaded for:
lu dk be fr frl nl
should have 41136 67843 71202 311813 42 230871
actually have 41136 42037 71202 303282 42 205147
missing (%) 0.00% 38.04% 0.00% 2.74% 0.00% 11.14%
Data source Common crawl
• Hosted on Amazon S3
• Receipe at:
https://groups.google.com/g/common-
crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ
• Download CDX / CDXJ and process locally
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
100
bytes
1 KB 10 KB 100 KB 1 MB 10 MB 100 MB 1 GB 10 GB 100 GB
Number
of
domains
Size in Archive in bytes (logarithmic)
Number of domains per size archived in ccTLD .fr
IA
commoncrawl
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
0 5 10 15 20 25
Number
of
compressed
bytes
(logarithmic)
Number of years in IA archive
.fr bytes vs number of years presence of domain in Internet Archive
Further process .summary
• host_year_total.py
• overlap.py
2nd level domain Year File Count Bytes
alvestedetocht.frl 2015 2 3750
alvestedetocht.frl 2016 108 483354679
Year Common Crawl Internet Archive CC & IA
2019 469 620 1180
0
10000
20000
30000
40000
50000
60000
70000
80000
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
.lu overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts
webarchive.lu webarchive.lu AND InternetArchive
webarchive.lu AND InternetArchive AND CommonCrawl webarchive.lu AND CommonCrawl
InternetArchive InternetArchive AND CommonCrawl
CommonCrawl
0
500000
1000000
1500000
2000000
2500000
1993 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
.fr overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts
lufr lufr AND iafr lufr AND iafr AND ccfr lufr AND ccfr iafr iafr AND ccfr ccfr
Related work
• Internet Archive metadata service
https://github.com/jeffersonbailey/web-archive-apis-workshop
curl "https://web.archive.org/__wb/search/metadata?q=tld:lu"
• Sawood Alam et al. characterization of webarchive holdings for memento
aggregator
https://netpreserve.org/resources/IIPC_project-Archive_profiling-final_report.pdf
https://github.com/oduwsdl/MementoMap
• Shine, SOLRWayback can probably answer questions like this for a single
archive
• Only one developer/tester ! Bugs…
• MIME
• MIME types are reported by server but not necessarily correct
• Some common-crawl data has no MIME information
• No canonical way to “simplify” MIME types
• Maybe missing interesting categories?
• Domains / Hosts
• Tradeoff between size of summary file and details (e.g. www.ic.ac.uk)
• IDN (xn--p1ai -> рф -> ru)
• Overlap analysis
• Very crude
Limitations

More Related Content

What's hot

The Danish case: What does the danish web talk about
The Danish case: What does the danish web talk aboutThe Danish case: What does the danish web talk about
The Danish case: What does the danish web talk aboutWARCnet
 
Webber Presentation
Webber PresentationWebber Presentation
Webber PresentationWARCnet
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenWARCnet
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked dataEnno Meijers
 
Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Victor de Boer
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportPascal-Nicolas Becker
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage InformationEnno Meijers
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis VZW
 
Sharing irish place names as linked open data - Rebecca Grant
Sharing irish place names as linked open data - Rebecca GrantSharing irish place names as linked open data - Rebecca Grant
Sharing irish place names as linked open data - Rebecca Grantdri_ireland
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsdgarijo
 
Open Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in GermanyOpen Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in Germanyariadnenetwork
 
Linked Data Research Projects at Ontology Engineering Group
Linked Data Research Projects at Ontology Engineering GroupLinked Data Research Projects at Ontology Engineering Group
Linked Data Research Projects at Ontology Engineering GroupBoris Villazón-Terrazas
 
The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...ariadnenetwork
 

What's hot (20)

The Danish case: What does the danish web talk about
The Danish case: What does the danish web talk aboutThe Danish case: What does the danish web talk about
The Danish case: What does the danish web talk about
 
Webber Presentation
Webber PresentationWebber Presentation
Webber Presentation
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 
Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"
 
Learning R - Handling NetCDF files
Learning R - Handling NetCDF filesLearning R - Handling NetCDF files
Learning R - Handling NetCDF files
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Sharing irish place names as linked open data - Rebecca Grant
Sharing irish place names as linked open data - Rebecca GrantSharing irish place names as linked open data - Rebecca Grant
Sharing irish place names as linked open data - Rebecca Grant
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)
 
Linked Data
Linked DataLinked Data
Linked Data
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Open Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in GermanyOpen Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in Germany
 
Wikidata
WikidataWikidata
Wikidata
 
Linked Data Research Projects at Ontology Engineering Group
Linked Data Research Projects at Ontology Engineering GroupLinked Data Research Projects at Ontology Engineering Group
Linked Data Research Projects at Ontology Engineering Group
 
The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...
 

Similar to Maurer Presentation - WARCnet Spring Meeting 2021

ELIXIR Competence Centre in EOSC-hub
ELIXIR Competence Centre in EOSC-hubELIXIR Competence Centre in EOSC-hub
ELIXIR Competence Centre in EOSC-hubEOSC-hub project
 
Hambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2OHambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2OSri Ambati
 
Intro to R and H2O with Spencer Aiello
Intro to R and H2O with Spencer AielloIntro to R and H2O with Spencer Aiello
Intro to R and H2O with Spencer AielloSri Ambati
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
NISO REST Training IIIF
NISO REST Training IIIF NISO REST Training IIIF
NISO REST Training IIIF Glen Robson
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream csching
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service   meetup ovh bordeauxOvh analytics data compute with apache spark as a service   meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service meetup ovh bordeauxMojtaba Imani
 
OVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a ServiceOVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a ServiceOVHcloud
 
Linked Data Usecases
Linked Data UsecasesLinked Data Usecases
Linked Data UsecasesMyungjin Lee
 
The End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersThe End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersCarlos Martinez Cagnazzo
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...InfluxData
 
Vorlesung "Web-Technologies"
Vorlesung "Web-Technologies" Vorlesung "Web-Technologies"
Vorlesung "Web-Technologies" Wolfgang Wiese
 
Lec 01 Introduction.pptx
Lec  01 Introduction.pptxLec  01 Introduction.pptx
Lec 01 Introduction.pptxAhmadMahmood62
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 

Similar to Maurer Presentation - WARCnet Spring Meeting 2021 (20)

IIIF & Digital Humanities
IIIF & Digital Humanities     IIIF & Digital Humanities
IIIF & Digital Humanities
 
ELIXIR Competence Centre in EOSC-hub
ELIXIR Competence Centre in EOSC-hubELIXIR Competence Centre in EOSC-hub
ELIXIR Competence Centre in EOSC-hub
 
Hambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2OHambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2O
 
Intro to R and H2O with Spencer Aiello
Intro to R and H2O with Spencer AielloIntro to R and H2O with Spencer Aiello
Intro to R and H2O with Spencer Aiello
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
NISO REST Training IIIF
NISO REST Training IIIF NISO REST Training IIIF
NISO REST Training IIIF
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service   meetup ovh bordeauxOvh analytics data compute with apache spark as a service   meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
 
OVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a ServiceOVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a Service
 
Linked Data Usecases
Linked Data UsecasesLinked Data Usecases
Linked Data Usecases
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
The End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersThe End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident Responders
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
 
Vorlesung "Web-Technologies"
Vorlesung "Web-Technologies" Vorlesung "Web-Technologies"
Vorlesung "Web-Technologies"
 
Lec 01 Introduction.pptx
Lec  01 Introduction.pptxLec  01 Introduction.pptx
Lec 01 Introduction.pptx
 
Nilges Making The Metadata Work NISO Virtual Conference Ebooks
Nilges Making The Metadata Work NISO Virtual Conference EbooksNilges Making The Metadata Work NISO Virtual Conference Ebooks
Nilges Making The Metadata Work NISO Virtual Conference Ebooks
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 

More from WARCnet

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdfWARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdfWARCnet
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptxWARCnet
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptxWARCnet
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfWARCnet
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfWARCnet
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdfWARCnet
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxWARCnet
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxWARCnet
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAWARCnet
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnetWARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsWARCnet
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsingWARCnet
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussionWARCnet
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWARCnet
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational eventsWARCnet
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...WARCnet
 

More from WARCnet (20)

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptx
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptx
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptx
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptx
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INA
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsing
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussion
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational events
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...
 

Recently uploaded

Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 

Recently uploaded (20)

Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 

Maurer Presentation - WARCnet Spring Meeting 2021

  • 1. Webarchive CDX summary WARCNet WG1 - Comparing entire web domains Aarhus virtual meeting 21.4.2021 yves.maurer@bnl.etat.lu @yvesmaurer github.com/ymaurer
  • 2. Why • We want to know more about “National Collections” and “National Webs” • Many Web Archives are not accessible through the Internet • Often Have only stats of XX PB in Archive • Not a lot of information on country-code-TLD • Little info on overlap between archives • WARC & CDX are too big • Need for “low common denominator data” that is still “rich enough”
  • 3. 1 file 245 MB ~ 0.5 million 1.6 TB Size comparison (wlu) ~ 0.5 million 84 TB WARC CDXJ Summary file 50x 7000x
  • 4. Detailed info & Code https://github.com/ymaurer/cdx-summarize CDX CDX CDX CDX CDX CDX … … … cdx-summarize.py cdx-summarize.py cdx-summarize.py combine-summary.py .summary JSON file
  • 5. Possible alternative sources Or other Search Engine Index which holds information about count & size of MIME types per domain
  • 6. What is the summary file? • A file with an entry per 2nd level domain and summary info per year about the number of files and their size: • HTML • CSS • Images • PDF • Video • Audio • Javascript • JSON (Javascript Object Notation) • Fonts • HTTP vs HTTPS (secure Web)
  • 7. Summary file example bnl.lu { "2002": {"n_html":175,"n_image":0,"n_pdf":0, ... "s_html":52634,"s_image":0,"s_pdf":0, ...}, "2003": {"n_html":639,"n_image":44,"n_pdf":30, ... , "s_html":1295481,"s_image":295235,"s_pdf":3071214, ...} }
  • 9. 0 5000 10000 15000 20000 25000 2015 2016 2017 2018 2019 2020 2021 Average Size in Bytes Average size of HTML files (s_html / n_html) Luxembourg Web Archive
  • 11. 0 2 4 6 8 10 12 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z First letter frequency In IA 2nd-level domains vs French words French wordlist .fr domains
  • 12. -4 -3 -2 -1 0 1 2 3 4 5 6 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z First letter frequency in IA 2nd-level domains vs French words
  • 14. Data source Luxembourg Web Archive • Established in 2016 • CDXJ files on disk • Run programs locally
  • 15. Data source Internet Archive • CDX Server API at: http://web.archive.org/cdx/search/cdx?url=lu Download using: https://github.com/ikreymer/cdx-index-client Data downloaded for: lu dk be fr frl nl should have 41136 67843 71202 311813 42 230871 actually have 41136 42037 71202 303282 42 205147 missing (%) 0.00% 38.04% 0.00% 2.74% 0.00% 11.14%
  • 16. Data source Common crawl • Hosted on Amazon S3 • Receipe at: https://groups.google.com/g/common- crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ • Download CDX / CDXJ and process locally
  • 17. 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 100 bytes 1 KB 10 KB 100 KB 1 MB 10 MB 100 MB 1 GB 10 GB 100 GB Number of domains Size in Archive in bytes (logarithmic) Number of domains per size archived in ccTLD .fr IA commoncrawl
  • 18. 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 0 5 10 15 20 25 Number of compressed bytes (logarithmic) Number of years in IA archive .fr bytes vs number of years presence of domain in Internet Archive
  • 19. Further process .summary • host_year_total.py • overlap.py 2nd level domain Year File Count Bytes alvestedetocht.frl 2015 2 3750 alvestedetocht.frl 2016 108 483354679 Year Common Crawl Internet Archive CC & IA 2019 469 620 1180
  • 20. 0 10000 20000 30000 40000 50000 60000 70000 80000 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 .lu overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts webarchive.lu webarchive.lu AND InternetArchive webarchive.lu AND InternetArchive AND CommonCrawl webarchive.lu AND CommonCrawl InternetArchive InternetArchive AND CommonCrawl CommonCrawl
  • 21. 0 500000 1000000 1500000 2000000 2500000 1993 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 .fr overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts lufr lufr AND iafr lufr AND iafr AND ccfr lufr AND ccfr iafr iafr AND ccfr ccfr
  • 22. Related work • Internet Archive metadata service https://github.com/jeffersonbailey/web-archive-apis-workshop curl "https://web.archive.org/__wb/search/metadata?q=tld:lu" • Sawood Alam et al. characterization of webarchive holdings for memento aggregator https://netpreserve.org/resources/IIPC_project-Archive_profiling-final_report.pdf https://github.com/oduwsdl/MementoMap • Shine, SOLRWayback can probably answer questions like this for a single archive
  • 23. • Only one developer/tester ! Bugs… • MIME • MIME types are reported by server but not necessarily correct • Some common-crawl data has no MIME information • No canonical way to “simplify” MIME types • Maybe missing interesting categories? • Domains / Hosts • Tradeoff between size of summary file and details (e.g. www.ic.ac.uk) • IDN (xn--p1ai -> рф -> ru) • Overlap analysis • Very crude Limitations