Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
The WARCnet Code Book of web archive data formats
1. WG5: The WARCnet Code Book of web archive data formats
June 2022, London
Sharon Healy, Karin de Wild, Niels Brügger, Peter Webster, Márton Németh, Vladimir Tybin
2. The aim of Working Group 5 is to discuss and formulate possible data formats
and thereby create a shared language between web archiving institutions and
research communities.
Projects:
1. A shared data vocabulary to request archived Web data
2. A glossary with terms used in web archive research (WG3)
5. Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.
• Free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.)
• CSV export of search results (with custom field selection).
• Image search (similar to google images).
• Visualization of search results such as:
- Interactive network graph (ingoing/outgoing)
- Statistics over time (e.g. size, number of in and out going links, etc)
6. Ulrich Have (in an email when WG5 was established):
“a standard data format would be interesting as a kind of
future requirements document for researcher-ready-data”
7. Niels Brügger, a systematic description of data formats for web archive studies
8. Actions:
Existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions
12. Request #1
CDX (listing of all the resources within the Web archive)
• Domain
• Host
• Full resource URL
• Crawl date
• Hash
• Resource format (PDF/html etc)
• Link to instance in Wayback
13. Request #2
Seeds and crawl policies
• seed URL
• crawl frequency (daily, weekly etc)
• capped? (yes/no)
• first crawl date
• last crawl date (or ongoing)
• crawl depth
14. Request #3
Links
• Source URL (full)
• Source URL (host)
• Source URL (domain)
• Source File Format (.html etc)
• Target URL (full)
• Target Host
• Target Domain
• Capture Date
• Link to source resource in Wayback
23. Next steps:
• Add terms to the reference list in Zotero
• Add definitions to the reference list in Zotero
• Select terms for a glossary for early career researchers using Web archives