SlideShare a Scribd company logo
1 of 12
VERSITET
NIELS BRÜGGER
AARHUS
UNIVERSITY 4 NOVEMBER 2021
UNI
A RESEARCHER DRIVEN DATA
DESCRIPTION FOR THE ARCHIVED
WEB: WHY AND HOW?
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
AGENDA
›the challenge
›the idea
›possible approach
›where to start?
›possible workplan
›a rough idea — to get us started
2
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
THE CHALLENGE
›a researcher approaches a web archive to have
archived web material extracted
›how to formulate her/his demand for what to have
extracted in a consistent and systematic way,
irrespective of how the web collection is organised or
labelled?
›based on my experience there exist no recognised
standards for doing this
›there exist various practices among researchers and
web archives, but not many reflections on a shared
vocabulary
3
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
THE IDEA
To start a debate about:
›whether a shared vocabulary for archived web data
is needed at all
›how this can be done (in case it is needed)
›and, eventually, what it could look like
›Ulrich Have, in an email when WG5 was established:
“a standard data format would be interesting as a
kind of future requirements document for researcher-
ready data”
4
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
›not concerned with how different web archives
organise their collections, which data fields and
formats they use, how they label their data, etc.
›rather how we as a research community can
formulate our data needs in a consistent and
systematic way
›then it is up to the web archives to translate this to
whatever formats they may use
›should start with an existing research community,
WARCnet is a good candidate
5
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
The situation now:
›loose, unsystematic, and uncoordinated descriptions
›no alignment between countries
›we need something more systematic to ensure the
best interoperability possible between material
extracted from different collections
›the seedlists we received in WG2 related to COVID-
19 collections illustrated this
6
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
7
Total size of data: 12,6MB
Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL.
File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency Theme Supplementary
archiving
Keyword Supplementary info URL history Country of
publication
Language Archiving date Twitter ID Tweet text Provenance**
DK xlsx URL/Link [named tabs] Dato ååååmmdd Webrecorder.io Noter DK
FR BnF csv word URL de d√©part Collecte Fréquence Thème Mots clés Informations descriptives compl√©mentaires
Historique des URL FR_BnF
IIPC xlsx word url Title Top-Level Domain Website Type Description Country of publication
Language IIPC
LUX*** xlsx Domains Seeds Title TLD Group Topic LUX
NL_ALL**** xls Webadres/url Naam website Selectiedatum Status Speciale webcollectie
NL xlsx url NL
UK csv field_url title author created field_crawl_frequency nid UK
FR INA JSON word part of JSON part of JSON part of JSON FR_INA
Hungary html URL NÉV [headings] HUN
-- and more mentioned in the file 'Summary of datasets'
* The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table
** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset
*** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were added in the column 'Actor categories'
**** All collected web domains of the archive
-2020 January February March April May June July Etc.
DK 13 March
FR BnF 1 February 31 July
IIPC ?
LUX ?
NL KB 29 February
UK 21 January
FR INA ?
Hungary ?
About
Datasets to Internal Datathon, WARCnet, WG2, 28 Jan 2021 (revised 18 Mar)
Data format*
Periods covered
Total size of data: 12,6MB
Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL.
File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency
DK xlsx URL/Link [named tabs] Dato ååååmmdd
FR BnF csv word URL de d√©part Collecte Fréquence
IIPC xlsx word url Title Top-Level Domain Website Type
LUX*** xlsx Domains Seeds Title TLD Group
NL_ALL**** xls Webadres/url Naam website Selectiedatum Status
NL xlsx url
UK csv field_url title author created field_crawl_freque
FR INA JSON word
Hungary html URL NÉV [headings]
-- and more mentioned in the file 'Summary of datasets'
* The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table
** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset
*** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were
**** All collected web domains of the archive
DK
FR BnF
IIPC
About
Datasets to Internal Datathon, WARCnet, WG2
ons(exceptforINA) haveonefieldincommon:ExactURL.
Actorcategories Curator Selectedondate Status Frequency Theme Supplementary
archiving
Keyword Supplementaryinfo URLhistory Countryof
publication
Language Archivingdate TwitterID Tweettext Provenance**
[namedtabs] Datoååååmmdd Webrecorder.io Noter DK
Collecte Fréquence Thème Mots clés Informationsdescriptives compl√©mentaires
HistoriquedesURL FR_BnF
WebsiteType Description Countryofpublication
Language IIPC
Group Topic LUX
Selectiedatum Status Specialewebcollectie
NL
author created field_crawl_frequency nid UK
partof JSON partofJSONpartof JSONFR_INA
[headings] HUN
erm is mentionedinthetable
DatasetstoInternalDatathon,WARCnet,WG2,28Jan2021(revised18Mar)
Dataformat*
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
WHERE TO START?
›start with researcher needs?
›start with existing controlled vocabularies (in the
broadest sense of the word)?
My suggestion:
›start with researchers reflecting on how their need for
data can be formulated, and then check existing
controlled vocabularies if they can be used, in part or
in total
8
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE WORKPLAN
1. map which descriptors researchers would like to
have, if possible based on concrete research
projects
2. map what is already available out there (at web
archives, APIs, ISO standards, controlled
vocabularies, etc.)
3. discuss how the two match, in particular with the
peculiarities of the archived web and existing web
archives in mind
9
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
A ROUGH IDEA — TO GET US
STARTED
›first column is the name of the data field
›second column is description of the data
›following columns are possible 'translations' to
existing vocabularies
10
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
11
Examples of possible combinations:
link_source_domain,link_target_domain,link_count,link_start,link_end
› Reads: domain where link starts, domain where link points to, number of
links, date/time when first link identified, date/time when last link is
identified.
link_source_domain,link_target_URL,link_count
› Reads: same as above, but where links point to is now specified as
individual URLs and not domains.
files_size_domain,files_num_domain
› Reads: size of files per domain, number of files per domain
VERSITET
NIELS BRÜGGER
AARHUS
UNIVERSITY 4 NOVEMBER 2021
UNI
A RESEARCHER DRIVEN DATA
DESCRIPTION FOR THE ARCHIVED
WEB: WHY AND HOW?

More Related Content

What's hot

Tuesday 5 May: Definition and Representation of National Web Domains across W...
Tuesday 5 May: Definition and Representation of National Web Domains across W...Tuesday 5 May: Definition and Representation of National Web Domains across W...
Tuesday 5 May: Definition and Representation of National Web Domains across W...WARCnet
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenWARCnet
 
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...WARCnet
 
Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Victor de Boer
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked dataEnno Meijers
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage InformationEnno Meijers
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...Andrea Bollini
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...Andrea Bollini
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataBoris Villazón-Terrazas
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis VZW
 
Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Herbert Van de Sompel
 
Europeana and Schema.org - DC2013
Europeana and Schema.org - DC2013Europeana and Schema.org - DC2013
Europeana and Schema.org - DC2013Antoine Isaac
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
 
Open Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in GermanyOpen Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in Germanyariadnenetwork
 
AddressingHistory - Crowdsourcing historical data and maps
AddressingHistory - Crowdsourcing historical data and mapsAddressingHistory - Crowdsourcing historical data and maps
AddressingHistory - Crowdsourcing historical data and mapsEDINA, University of Edinburgh
 

What's hot (20)

Tuesday 5 May: Definition and Representation of National Web Domains across W...
Tuesday 5 May: Definition and Representation of National Web Domains across W...Tuesday 5 May: Definition and Representation of National Web Domains across W...
Tuesday 5 May: Definition and Representation of National Web Domains across W...
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
 
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...
 
Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
 
Linked Data
Linked DataLinked Data
Linked Data
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked Data
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)
 
Wikidata
WikidataWikidata
Wikidata
 
Europeana and Schema.org - DC2013
Europeana and Schema.org - DC2013Europeana and Schema.org - DC2013
Europeana and Schema.org - DC2013
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Open Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in GermanyOpen Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in Germany
 
Linked data tooling XML
Linked data tooling XMLLinked data tooling XML
Linked data tooling XML
 
AddressingHistory - Crowdsourcing historical data and maps
AddressingHistory - Crowdsourcing historical data and mapsAddressingHistory - Crowdsourcing historical data and maps
AddressingHistory - Crowdsourcing historical data and maps
 

Similar to A researcher driven data description for the archived web: Why and how?

SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
RDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an updateRDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an updateAndy Powell
 
Web of science,Scopus,bibtex,latex
Web of science,Scopus,bibtex,latexWeb of science,Scopus,bibtex,latex
Web of science,Scopus,bibtex,latexKhushalRathore10
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Figoblog
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Elns and research data management case study of RSpace at the University of ...
Elns and research data management  case study of RSpace at the University of ...Elns and research data management  case study of RSpace at the University of ...
Elns and research data management case study of RSpace at the University of ...rmacneil88
 
The Missing Link-The Evolving Current State of Linked Data for Serials-Fallgren
The Missing Link-The Evolving Current State of Linked Data for Serials-FallgrenThe Missing Link-The Evolving Current State of Linked Data for Serials-Fallgren
The Missing Link-The Evolving Current State of Linked Data for Serials-FallgrenNASIG
 
Change Tracking in Knowledge Organization Systems with skos-history
Change Tracking in Knowledge Organization Systems with skos-historyChange Tracking in Knowledge Organization Systems with skos-history
Change Tracking in Knowledge Organization Systems with skos-historyJoachim Neubert
 
KOS evolution in Linked Data
KOS evolution in Linked DataKOS evolution in Linked Data
KOS evolution in Linked DataJoachim Neubert
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membaseArdak Shalkarbayuli
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
Making topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizingMaking topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizingLars Marius Garshol
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportPascal-Nicolas Becker
 

Similar to A researcher driven data description for the archived web: Why and how? (20)

SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
RDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an updateRDTF Metadata Guidelines: an update
RDTF Metadata Guidelines: an update
 
Web of science,Scopus,bibtex,latex
Web of science,Scopus,bibtex,latexWeb of science,Scopus,bibtex,latex
Web of science,Scopus,bibtex,latex
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Spotlight
SpotlightSpotlight
Spotlight
 
Web of Data Usage Mining
Web of Data Usage MiningWeb of Data Usage Mining
Web of Data Usage Mining
 
Elns and research data management case study of RSpace at the University of ...
Elns and research data management  case study of RSpace at the University of ...Elns and research data management  case study of RSpace at the University of ...
Elns and research data management case study of RSpace at the University of ...
 
The Missing Link-The Evolving Current State of Linked Data for Serials-Fallgren
The Missing Link-The Evolving Current State of Linked Data for Serials-FallgrenThe Missing Link-The Evolving Current State of Linked Data for Serials-Fallgren
The Missing Link-The Evolving Current State of Linked Data for Serials-Fallgren
 
Change Tracking in Knowledge Organization Systems with skos-history
Change Tracking in Knowledge Organization Systems with skos-historyChange Tracking in Knowledge Organization Systems with skos-history
Change Tracking in Knowledge Organization Systems with skos-history
 
KOS evolution in Linked Data
KOS evolution in Linked DataKOS evolution in Linked Data
KOS evolution in Linked Data
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
Making topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizingMaking topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizing
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 

More from WARCnet

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdfWARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdfWARCnet
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptxWARCnet
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptxWARCnet
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfWARCnet
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfWARCnet
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdfWARCnet
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxWARCnet
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxWARCnet
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAWARCnet
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnetWARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsWARCnet
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsingWARCnet
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussionWARCnet
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWARCnet
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational eventsWARCnet
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...WARCnet
 

More from WARCnet (20)

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptx
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptx
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptx
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptx
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INA
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsing
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussion
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational events
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...
 

Recently uploaded

miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxnoorehahmad
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !risocarla2016
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@vikas rana
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 

Recently uploaded (20)

miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 

A researcher driven data description for the archived web: Why and how?

  • 1. VERSITET NIELS BRÜGGER AARHUS UNIVERSITY 4 NOVEMBER 2021 UNI A RESEARCHER DRIVEN DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
  • 2. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 AGENDA ›the challenge ›the idea ›possible approach ›where to start? ›possible workplan ›a rough idea — to get us started 2
  • 3. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 THE CHALLENGE ›a researcher approaches a web archive to have archived web material extracted ›how to formulate her/his demand for what to have extracted in a consistent and systematic way, irrespective of how the web collection is organised or labelled? ›based on my experience there exist no recognised standards for doing this ›there exist various practices among researchers and web archives, but not many reflections on a shared vocabulary 3
  • 4. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 THE IDEA To start a debate about: ›whether a shared vocabulary for archived web data is needed at all ›how this can be done (in case it is needed) ›and, eventually, what it could look like ›Ulrich Have, in an email when WG5 was established: “a standard data format would be interesting as a kind of future requirements document for researcher- ready data” 4
  • 5. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 POSSIBLE APPROACH ›not concerned with how different web archives organise their collections, which data fields and formats they use, how they label their data, etc. ›rather how we as a research community can formulate our data needs in a consistent and systematic way ›then it is up to the web archives to translate this to whatever formats they may use ›should start with an existing research community, WARCnet is a good candidate 5
  • 6. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 POSSIBLE APPROACH The situation now: ›loose, unsystematic, and uncoordinated descriptions ›no alignment between countries ›we need something more systematic to ensure the best interoperability possible between material extracted from different collections ›the seedlists we received in WG2 related to COVID- 19 collections illustrated this 6
  • 7. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 7 Total size of data: 12,6MB Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL. File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency Theme Supplementary archiving Keyword Supplementary info URL history Country of publication Language Archiving date Twitter ID Tweet text Provenance** DK xlsx URL/Link [named tabs] Dato ååååmmdd Webrecorder.io Noter DK FR BnF csv word URL de d√©part Collecte Fréquence Thème Mots clés Informations descriptives compl√©mentaires Historique des URL FR_BnF IIPC xlsx word url Title Top-Level Domain Website Type Description Country of publication Language IIPC LUX*** xlsx Domains Seeds Title TLD Group Topic LUX NL_ALL**** xls Webadres/url Naam website Selectiedatum Status Speciale webcollectie NL xlsx url NL UK csv field_url title author created field_crawl_frequency nid UK FR INA JSON word part of JSON part of JSON part of JSON FR_INA Hungary html URL NÉV [headings] HUN -- and more mentioned in the file 'Summary of datasets' * The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table ** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset *** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were added in the column 'Actor categories' **** All collected web domains of the archive -2020 January February March April May June July Etc. DK 13 March FR BnF 1 February 31 July IIPC ? LUX ? NL KB 29 February UK 21 January FR INA ? Hungary ? About Datasets to Internal Datathon, WARCnet, WG2, 28 Jan 2021 (revised 18 Mar) Data format* Periods covered Total size of data: 12,6MB Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL. File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency DK xlsx URL/Link [named tabs] Dato ååååmmdd FR BnF csv word URL de d√©part Collecte Fréquence IIPC xlsx word url Title Top-Level Domain Website Type LUX*** xlsx Domains Seeds Title TLD Group NL_ALL**** xls Webadres/url Naam website Selectiedatum Status NL xlsx url UK csv field_url title author created field_crawl_freque FR INA JSON word Hungary html URL NÉV [headings] -- and more mentioned in the file 'Summary of datasets' * The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table ** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset *** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were **** All collected web domains of the archive DK FR BnF IIPC About Datasets to Internal Datathon, WARCnet, WG2 ons(exceptforINA) haveonefieldincommon:ExactURL. Actorcategories Curator Selectedondate Status Frequency Theme Supplementary archiving Keyword Supplementaryinfo URLhistory Countryof publication Language Archivingdate TwitterID Tweettext Provenance** [namedtabs] Datoååååmmdd Webrecorder.io Noter DK Collecte Fréquence Thème Mots clés Informationsdescriptives compl√©mentaires HistoriquedesURL FR_BnF WebsiteType Description Countryofpublication Language IIPC Group Topic LUX Selectiedatum Status Specialewebcollectie NL author created field_crawl_frequency nid UK partof JSON partofJSONpartof JSONFR_INA [headings] HUN erm is mentionedinthetable DatasetstoInternalDatathon,WARCnet,WG2,28Jan2021(revised18Mar) Dataformat*
  • 8. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 WHERE TO START? ›start with researcher needs? ›start with existing controlled vocabularies (in the broadest sense of the word)? My suggestion: ›start with researchers reflecting on how their need for data can be formulated, and then check existing controlled vocabularies if they can be used, in part or in total 8
  • 9. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 POSSIBLE WORKPLAN 1. map which descriptors researchers would like to have, if possible based on concrete research projects 2. map what is already available out there (at web archives, APIs, ISO standards, controlled vocabularies, etc.) 3. discuss how the two match, in particular with the peculiarities of the archived web and existing web archives in mind 9
  • 10. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 A ROUGH IDEA — TO GET US STARTED ›first column is the name of the data field ›second column is description of the data ›following columns are possible 'translations' to existing vocabularies 10
  • 11. AARHUS UNIVERSITY A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 11 Examples of possible combinations: link_source_domain,link_target_domain,link_count,link_start,link_end › Reads: domain where link starts, domain where link points to, number of links, date/time when first link identified, date/time when last link is identified. link_source_domain,link_target_URL,link_count › Reads: same as above, but where links point to is now specified as individual URLs and not domains. files_size_domain,files_num_domain › Reads: size of files per domain, number of files per domain
  • 12. VERSITET NIELS BRÜGGER AARHUS UNIVERSITY 4 NOVEMBER 2021 UNI A RESEARCHER DRIVEN DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?