VERSITET
NIELS BRÜGGER
AARHUS
UNIVERSITY 4 NOVEMBER 2021
UNI
A RESEARCHER DRIVEN DATA
DESCRIPTION FOR THE ARCHIVED
WEB: WHY AND HOW?
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
AGENDA
›the challenge
›the idea
›possible approach
›where to start?
›possible workplan
›a rough idea — to get us started
2
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
THE CHALLENGE
›a researcher approaches a web archive to have
archived web material extracted
›how to formulate her/his demand for what to have
extracted in a consistent and systematic way,
irrespective of how the web collection is organised or
labelled?
›based on my experience there exist no recognised
standards for doing this
›there exist various practices among researchers and
web archives, but not many reflections on a shared
vocabulary
3
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
THE IDEA
To start a debate about:
›whether a shared vocabulary for archived web data
is needed at all
›how this can be done (in case it is needed)
›and, eventually, what it could look like
›Ulrich Have, in an email when WG5 was established:
“a standard data format would be interesting as a
kind of future requirements document for researcher-
ready data”
4
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
›not concerned with how different web archives
organise their collections, which data fields and
formats they use, how they label their data, etc.
›rather how we as a research community can
formulate our data needs in a consistent and
systematic way
›then it is up to the web archives to translate this to
whatever formats they may use
›should start with an existing research community,
WARCnet is a good candidate
5
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
The situation now:
›loose, unsystematic, and uncoordinated descriptions
›no alignment between countries
›we need something more systematic to ensure the
best interoperability possible between material
extracted from different collections
›the seedlists we received in WG2 related to COVID-
19 collections illustrated this
6
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
7
Total size of data: 12,6MB
Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL.
File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency Theme Supplementary
archiving
Keyword Supplementary info URL history Country of
publication
Language Archiving date Twitter ID Tweet text Provenance**
DK xlsx URL/Link [named tabs] Dato ååååmmdd Webrecorder.io Noter DK
FR BnF csv word URL de d√©part Collecte Fréquence Thème Mots clés Informations descriptives compl√©mentaires
Historique des URL FR_BnF
IIPC xlsx word url Title Top-Level Domain Website Type Description Country of publication
Language IIPC
LUX*** xlsx Domains Seeds Title TLD Group Topic LUX
NL_ALL**** xls Webadres/url Naam website Selectiedatum Status Speciale webcollectie
NL xlsx url NL
UK csv field_url title author created field_crawl_frequency nid UK
FR INA JSON word part of JSON part of JSON part of JSON FR_INA
Hungary html URL NÉV [headings] HUN
-- and more mentioned in the file 'Summary of datasets'
* The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table
** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset
*** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were added in the column 'Actor categories'
**** All collected web domains of the archive
-2020 January February March April May June July Etc.
DK 13 March
FR BnF 1 February 31 July
IIPC ?
LUX ?
NL KB 29 February
UK 21 January
FR INA ?
Hungary ?
About
Datasets to Internal Datathon, WARCnet, WG2, 28 Jan 2021 (revised 18 Mar)
Data format*
Periods covered
Total size of data: 12,6MB
Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL.
File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency
DK xlsx URL/Link [named tabs] Dato ååååmmdd
FR BnF csv word URL de d√©part Collecte Fréquence
IIPC xlsx word url Title Top-Level Domain Website Type
LUX*** xlsx Domains Seeds Title TLD Group
NL_ALL**** xls Webadres/url Naam website Selectiedatum Status
NL xlsx url
UK csv field_url title author created field_crawl_freque
FR INA JSON word
Hungary html URL NÉV [headings]
-- and more mentioned in the file 'Summary of datasets'
* The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table
** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset
*** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were
**** All collected web domains of the archive
DK
FR BnF
IIPC
About
Datasets to Internal Datathon, WARCnet, WG2
ons(exceptforINA) haveonefieldincommon:ExactURL.
Actorcategories Curator Selectedondate Status Frequency Theme Supplementary
archiving
Keyword Supplementaryinfo URLhistory Countryof
publication
Language Archivingdate TwitterID Tweettext Provenance**
[namedtabs] Datoååååmmdd Webrecorder.io Noter DK
Collecte Fréquence Thème Mots clés Informationsdescriptives compl√©mentaires
HistoriquedesURL FR_BnF
WebsiteType Description Countryofpublication
Language IIPC
Group Topic LUX
Selectiedatum Status Specialewebcollectie
NL
author created field_crawl_frequency nid UK
partof JSON partofJSONpartof JSONFR_INA
[headings] HUN
erm is mentionedinthetable
DatasetstoInternalDatathon,WARCnet,WG2,28Jan2021(revised18Mar)
Dataformat*
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
WHERE TO START?
›start with researcher needs?
›start with existing controlled vocabularies (in the
broadest sense of the word)?
My suggestion:
›start with researchers reflecting on how their need for
data can be formulated, and then check existing
controlled vocabularies if they can be used, in part or
in total
8
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE WORKPLAN
1. map which descriptors researchers would like to
have, if possible based on concrete research
projects
2. map what is already available out there (at web
archives, APIs, ISO standards, controlled
vocabularies, etc.)
3. discuss how the two match, in particular with the
peculiarities of the archived web and existing web
archives in mind
9
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
A ROUGH IDEA — TO GET US
STARTED
›first column is the name of the data field
›second column is description of the data
›following columns are possible 'translations' to
existing vocabularies
10
AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
11
Examples of possible combinations:
link_source_domain,link_target_domain,link_count,link_start,link_end
› Reads: domain where link starts, domain where link points to, number of
links, date/time when first link identified, date/time when last link is
identified.
link_source_domain,link_target_URL,link_count
› Reads: same as above, but where links point to is now specified as
individual URLs and not domains.
files_size_domain,files_num_domain
› Reads: size of files per domain, number of files per domain
VERSITET
NIELS BRÜGGER
AARHUS
UNIVERSITY 4 NOVEMBER 2021
UNI
A RESEARCHER DRIVEN DATA
DESCRIPTION FOR THE ARCHIVED
WEB: WHY AND HOW?

A researcher driven data description for the archived web: Why and how?

  • 1.
    VERSITET NIELS BRÜGGER AARHUS UNIVERSITY 4NOVEMBER 2021 UNI A RESEARCHER DRIVEN DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
  • 2.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 AGENDA ›the challenge ›the idea ›possible approach ›where to start? ›possible workplan ›a rough idea — to get us started 2
  • 3.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 THE CHALLENGE ›a researcher approaches a web archive to have archived web material extracted ›how to formulate her/his demand for what to have extracted in a consistent and systematic way, irrespective of how the web collection is organised or labelled? ›based on my experience there exist no recognised standards for doing this ›there exist various practices among researchers and web archives, but not many reflections on a shared vocabulary 3
  • 4.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 THE IDEA To start a debate about: ›whether a shared vocabulary for archived web data is needed at all ›how this can be done (in case it is needed) ›and, eventually, what it could look like ›Ulrich Have, in an email when WG5 was established: “a standard data format would be interesting as a kind of future requirements document for researcher- ready data” 4
  • 5.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 POSSIBLE APPROACH ›not concerned with how different web archives organise their collections, which data fields and formats they use, how they label their data, etc. ›rather how we as a research community can formulate our data needs in a consistent and systematic way ›then it is up to the web archives to translate this to whatever formats they may use ›should start with an existing research community, WARCnet is a good candidate 5
  • 6.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 POSSIBLE APPROACH The situation now: ›loose, unsystematic, and uncoordinated descriptions ›no alignment between countries ›we need something more systematic to ensure the best interoperability possible between material extracted from different collections ›the seedlists we received in WG2 related to COVID- 19 collections illustrated this 6
  • 7.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 7 Total size of data: 12,6MB Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL. File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency Theme Supplementary archiving Keyword Supplementary info URL history Country of publication Language Archiving date Twitter ID Tweet text Provenance** DK xlsx URL/Link [named tabs] Dato ååååmmdd Webrecorder.io Noter DK FR BnF csv word URL de d√©part Collecte Fréquence Thème Mots clés Informations descriptives compl√©mentaires Historique des URL FR_BnF IIPC xlsx word url Title Top-Level Domain Website Type Description Country of publication Language IIPC LUX*** xlsx Domains Seeds Title TLD Group Topic LUX NL_ALL**** xls Webadres/url Naam website Selectiedatum Status Speciale webcollectie NL xlsx url NL UK csv field_url title author created field_crawl_frequency nid UK FR INA JSON word part of JSON part of JSON part of JSON FR_INA Hungary html URL NÉV [headings] HUN -- and more mentioned in the file 'Summary of datasets' * The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table ** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset *** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were added in the column 'Actor categories' **** All collected web domains of the archive -2020 January February March April May June July Etc. DK 13 March FR BnF 1 February 31 July IIPC ? LUX ? NL KB 29 February UK 21 January FR INA ? Hungary ? About Datasets to Internal Datathon, WARCnet, WG2, 28 Jan 2021 (revised 18 Mar) Data format* Periods covered Total size of data: 12,6MB Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL. File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency DK xlsx URL/Link [named tabs] Dato ååååmmdd FR BnF csv word URL de d√©part Collecte Fréquence IIPC xlsx word url Title Top-Level Domain Website Type LUX*** xlsx Domains Seeds Title TLD Group NL_ALL**** xls Webadres/url Naam website Selectiedatum Status NL xlsx url UK csv field_url title author created field_crawl_freque FR INA JSON word Hungary html URL NÉV [headings] -- and more mentioned in the file 'Summary of datasets' * The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table ** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset *** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were **** All collected web domains of the archive DK FR BnF IIPC About Datasets to Internal Datathon, WARCnet, WG2 ons(exceptforINA) haveonefieldincommon:ExactURL. Actorcategories Curator Selectedondate Status Frequency Theme Supplementary archiving Keyword Supplementaryinfo URLhistory Countryof publication Language Archivingdate TwitterID Tweettext Provenance** [namedtabs] Datoååååmmdd Webrecorder.io Noter DK Collecte Fréquence Thème Mots clés Informationsdescriptives compl√©mentaires HistoriquedesURL FR_BnF WebsiteType Description Countryofpublication Language IIPC Group Topic LUX Selectiedatum Status Specialewebcollectie NL author created field_crawl_frequency nid UK partof JSON partofJSONpartof JSONFR_INA [headings] HUN erm is mentionedinthetable DatasetstoInternalDatathon,WARCnet,WG2,28Jan2021(revised18Mar) Dataformat*
  • 8.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 WHERE TO START? ›start with researcher needs? ›start with existing controlled vocabularies (in the broadest sense of the word)? My suggestion: ›start with researchers reflecting on how their need for data can be formulated, and then check existing controlled vocabularies if they can be used, in part or in total 8
  • 9.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 POSSIBLE WORKPLAN 1. map which descriptors researchers would like to have, if possible based on concrete research projects 2. map what is already available out there (at web archives, APIs, ISO standards, controlled vocabularies, etc.) 3. discuss how the two match, in particular with the peculiarities of the archived web and existing web archives in mind 9
  • 10.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 A ROUGH IDEA — TO GET US STARTED ›first column is the name of the data field ›second column is description of the data ›following columns are possible 'translations' to existing vocabularies 10
  • 11.
    AARHUS UNIVERSITY A DATA DESCRIPTIONFOR THE ARCHIVED WEB: WHY AND HOW? NIELS BRÜGGER 4 NOVEMBER 2021 11 Examples of possible combinations: link_source_domain,link_target_domain,link_count,link_start,link_end › Reads: domain where link starts, domain where link points to, number of links, date/time when first link identified, date/time when last link is identified. link_source_domain,link_target_URL,link_count › Reads: same as above, but where links point to is now specified as individual URLs and not domains. files_size_domain,files_num_domain › Reads: size of files per domain, number of files per domain
  • 12.
    VERSITET NIELS BRÜGGER AARHUS UNIVERSITY 4NOVEMBER 2021 UNI A RESEARCHER DRIVEN DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?