SlideShare a Scribd company logo
1 of 23
WG5: The WARCnet Code Book of web archive data formats
June 2022, London
Sharon Healy, Karin de Wild, Niels Brügger, Peter Webster, Márton Németh, Vladimir Tybin
The aim of Working Group 5 is to discuss and formulate possible data formats
and thereby create a shared language between web archiving institutions and
research communities.
Projects:
1. A shared data vocabulary to request archived Web data
2. A glossary with terms used in web archive research (WG3)
Project 1:
Shared data vocabulary for requesting
archived Web data
Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.
Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.
• Free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.)
• CSV export of search results (with custom field selection).
• Image search (similar to google images).
• Visualization of search results such as:
- Interactive network graph (ingoing/outgoing)
- Statistics over time (e.g. size, number of in and out going links, etc)
Ulrich Have (in an email when WG5 was established):
“a standard data format would be interesting as a kind of
future requirements document for researcher-ready-data”
Niels Brügger, a systematic description of data formats for web archive studies
Actions:
Existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions
Wiki data
Interoperability
Using controlled vocabularies offers the potential to request and interlink (machine-readable) data across digital heritage
collections.
Request #1
CDX (listing of all the resources within the Web archive)
• Domain
• Host
• Full resource URL
• Crawl date
• Hash
• Resource format (PDF/html etc)
• Link to instance in Wayback
Request #2
Seeds and crawl policies
• seed URL
• crawl frequency (daily, weekly etc)
• capped? (yes/no)
• first crawl date
• last crawl date (or ongoing)
• crawl depth
Request #3
Links
• Source URL (full)
• Source URL (host)
• Source URL (domain)
• Source File Format (.html etc)
• Target URL (full)
• Target Host
• Target Domain
• Capture Date
• Link to source resource in Wayback
Request #4
Webpages
To do.
Request #5
Metadata on a dataset
• Web archive
• Provenance
(based on W3C-PROV)
Next steps:
Collect existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions
Project 2:
Glossary for Web archival studies
Next steps:
• Add terms to the reference list in Zotero
• Add definitions to the reference list in Zotero
• Select terms for a glossary for early career researchers using Web archives

More Related Content

Similar to The WARCnet Code Book of web archive data formats

Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Brigitte Jörg
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersNew York University
 
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...cmitch41
 
DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013Frauke Ziedorn
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupAnita de Waard
 
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013Frauke Ziedorn
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentPeter Haase
 
Expanding Machine-Readable Access Methods for Collections
Expanding Machine-Readable Access Methods for CollectionsExpanding Machine-Readable Access Methods for Collections
Expanding Machine-Readable Access Methods for CollectionsAmerican Art Collaborative
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 

Similar to The WARCnet Code Book of web archive data formats (20)

Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
 
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
 
DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013
 
Linked Data Basics
Linked Data BasicsLinked Data Basics
Linked Data Basics
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Expanding Machine-Readable Access Methods for Collections
Expanding Machine-Readable Access Methods for CollectionsExpanding Machine-Readable Access Methods for Collections
Expanding Machine-Readable Access Methods for Collections
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 

More from WARCnet

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdfWARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdfWARCnet
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptxWARCnet
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptxWARCnet
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfWARCnet
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfWARCnet
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdfWARCnet
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxWARCnet
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxWARCnet
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAWARCnet
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnetWARCnet
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsingWARCnet
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussionWARCnet
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experimentWARCnet
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWARCnet
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational eventsWARCnet
 
Web Archive Research Skills and Tools Survey (WARST)
 Web Archive Research Skills and Tools Survey (WARST) Web Archive Research Skills and Tools Survey (WARST)
Web Archive Research Skills and Tools Survey (WARST)WARCnet
 

More from WARCnet (20)

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptx
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptx
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptx
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptx
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INA
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnet
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsing
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussion
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experiment
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational events
 
Web Archive Research Skills and Tools Survey (WARST)
 Web Archive Research Skills and Tools Survey (WARST) Web Archive Research Skills and Tools Survey (WARST)
Web Archive Research Skills and Tools Survey (WARST)
 

Recently uploaded

Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Recently uploaded (20)

Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

The WARCnet Code Book of web archive data formats

  • 1. WG5: The WARCnet Code Book of web archive data formats June 2022, London Sharon Healy, Karin de Wild, Niels Brügger, Peter Webster, Márton Németh, Vladimir Tybin
  • 2. The aim of Working Group 5 is to discuss and formulate possible data formats and thereby create a shared language between web archiving institutions and research communities. Projects: 1. A shared data vocabulary to request archived Web data 2. A glossary with terms used in web archive research (WG3)
  • 3. Project 1: Shared data vocabulary for requesting archived Web data
  • 4. Solr wayback The Solr wayback is a search engine that can retrieve data from WARC’s.
  • 5. Solr wayback The Solr wayback is a search engine that can retrieve data from WARC’s. • Free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.) • CSV export of search results (with custom field selection). • Image search (similar to google images). • Visualization of search results such as: - Interactive network graph (ingoing/outgoing) - Statistics over time (e.g. size, number of in and out going links, etc)
  • 6. Ulrich Have (in an email when WG5 was established): “a standard data format would be interesting as a kind of future requirements document for researcher-ready-data”
  • 7. Niels Brügger, a systematic description of data formats for web archive studies
  • 8. Actions: Existing data vocabularies • Web archives • Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.) Datathons • Identify data requests • Identify terms / variables • Write / improve definitions
  • 9.
  • 11. Interoperability Using controlled vocabularies offers the potential to request and interlink (machine-readable) data across digital heritage collections.
  • 12. Request #1 CDX (listing of all the resources within the Web archive) • Domain • Host • Full resource URL • Crawl date • Hash • Resource format (PDF/html etc) • Link to instance in Wayback
  • 13. Request #2 Seeds and crawl policies • seed URL • crawl frequency (daily, weekly etc) • capped? (yes/no) • first crawl date • last crawl date (or ongoing) • crawl depth
  • 14. Request #3 Links • Source URL (full) • Source URL (host) • Source URL (domain) • Source File Format (.html etc) • Target URL (full) • Target Host • Target Domain • Capture Date • Link to source resource in Wayback
  • 16. Request #5 Metadata on a dataset • Web archive • Provenance (based on W3C-PROV)
  • 17. Next steps: Collect existing data vocabularies • Web archives • Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.) Datathons • Identify data requests • Identify terms / variables • Write / improve definitions
  • 18. Project 2: Glossary for Web archival studies
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Next steps: • Add terms to the reference list in Zotero • Add definitions to the reference list in Zotero • Select terms for a glossary for early career researchers using Web archives