A researcher driven data description for the archived web: Why and how?

VERSITET
NIELS BRÜGGER
AARHUS
UNIVERSITY 4 NOVEMBER 2021
UNI
A RESEARCHER DRIVEN DATA
DESCRIPTION FOR THE ARCHIVED
WEB: WHY AND HOW?

AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
AGENDA
›the challenge
›the idea
›possible approach
›where to start?
›possible workplan
›a rough idea — to get us started
2

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
THE CHALLENGE
›a researcher approaches a web archive to have
archived web material extracted
›how to formulate her/his demand for what to have
extracted in a consistent and systematic way,
irrespective of how the web collection is organised or
labelled?
›based on my experience there exist no recognised
standards for doing this
›there exist various practices among researchers and
web archives, but not many reflections on a shared
vocabulary
3

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
THE IDEA
To start a debate about:
›whether a shared vocabulary for archived web data
is needed at all
›how this can be done (in case it is needed)
›and, eventually, what it could look like
›Ulrich Have, in an email when WG5 was established:
“a standard data format would be interesting as a
kind of future requirements document for researcher-
ready data”
4

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
›not concerned with how different web archives
organise their collections, which data fields and
formats they use, how they label their data, etc.
›rather how we as a research community can
formulate our data needs in a consistent and
systematic way
›then it is up to the web archives to translate this to
whatever formats they may use
›should start with an existing research community,
WARCnet is a good candidate
5

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
The situation now:
›loose, unsystematic, and uncoordinated descriptions
›no alignment between countries
›we need something more systematic to ensure the
best interoperability possible between material
extracted from different collections
›the seedlists we received in WG2 related to COVID-
19 collections illustrated this
6

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
7
Total size of data: 12,6MB
Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL.
File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency Theme Supplementary
archiving
Keyword Supplementary info URL history Country of
publication
Language Archiving date Twitter ID Tweet text Provenance**
DK xlsx URL/Link [named tabs] Dato ååååmmdd Webrecorder.io Noter DK
FR BnF csv word URL de d√©part Collecte Fréquence Thème Mots clés Informations descriptives compl√©mentaires
Historique des URL FR_BnF
IIPC xlsx word url Title Top-Level Domain Website Type Description Country of publication
Language IIPC
LUX*** xlsx Domains Seeds Title TLD Group Topic LUX
NL_ALL**** xls Webadres/url Naam website Selectiedatum Status Speciale webcollectie
NL xlsx url NL
UK csv field_url title author created field_crawl_frequency nid UK
FR INA JSON word part of JSON part of JSON part of JSON FR_INA
Hungary html URL NÉV [headings] HUN
-- and more mentioned in the file 'Summary of datasets'
* The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table
** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset
*** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were added in the column 'Actor categories'
**** All collected web domains of the archive
-2020 January February March April May June July Etc.
DK 13 March
FR BnF 1 February 31 July
IIPC ?
LUX ?
NL KB 29 February
UK 21 January
FR INA ?
Hungary ?
About
Datasets to Internal Datathon, WARCnet, WG2, 28 Jan 2021 (revised 18 Mar)
Data format*
Periods covered
Total size of data: 12,6MB
Immediate reflexion: it all looks quite nice, despite the differences, since most are in csv format, and all collections (except for INA) have one field in common: Exact URL.
File format Documentation Domain Name Exact URL Page title TLD Actor categories Curator Selected on date Status Frequency
DK xlsx URL/Link [named tabs] Dato ååååmmdd
FR BnF csv word URL de d√©part Collecte Fréquence
IIPC xlsx word url Title Top-Level Domain Website Type
LUX*** xlsx Domains Seeds Title TLD Group
NL_ALL**** xls Webadres/url Naam website Selectiedatum Status
NL xlsx url
UK csv field_url title author created field_crawl_freque
FR INA JSON word
Hungary html URL NÉV [headings]
-- and more mentioned in the file 'Summary of datasets'
* The data format suggested by me; in case another term is used for the same field in the original dataset, this term is mentioned in the table
** A new column with the acronym of each collection, to be added in each collection before merging them; allows us to back-track each entry in the dataset
*** The included columns only apply for the first sheet, 'Websites', for the other sheets only the exact URL is included (with no heading); in the compiled version of LUX sheet names were
**** All collected web domains of the archive
DK
FR BnF
IIPC
About
Datasets to Internal Datathon, WARCnet, WG2
ons(exceptforINA) haveonefieldincommon:ExactURL.
Actorcategories Curator Selectedondate Status Frequency Theme Supplementary
archiving
Keyword Supplementaryinfo URLhistory Countryof
publication
Language Archivingdate TwitterID Tweettext Provenance**
[namedtabs] Datoååååmmdd Webrecorder.io Noter DK
Collecte Fréquence Thème Mots clés Informationsdescriptives compl√©mentaires
HistoriquedesURL FR_BnF
WebsiteType Description Countryofpublication
Language IIPC
Group Topic LUX
Selectiedatum Status Specialewebcollectie
NL
author created field_crawl_frequency nid UK
partof JSON partofJSONpartof JSONFR_INA
[headings] HUN
erm is mentionedinthetable
DatasetstoInternalDatathon,WARCnet,WG2,28Jan2021(revised18Mar)
Dataformat*

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
WHERE TO START?
›start with researcher needs?
›start with existing controlled vocabularies (in the
broadest sense of the word)?
My suggestion:
›start with researchers reflecting on how their need for
data can be formulated, and then check existing
controlled vocabularies if they can be used, in part or
in total
8

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE WORKPLAN
1. map which descriptors researchers would like to
have, if possible based on concrete research
projects
2. map what is already available out there (at web
archives, APIs, ISO standards, controlled
vocabularies, etc.)
3. discuss how the two match, in particular with the
peculiarities of the archived web and existing web
archives in mind
9

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
A ROUGH IDEA — TO GET US
STARTED
›first column is the name of the data field
›second column is description of the data
›following columns are possible 'translations' to
existing vocabularies
10

AARHUS
UNIVERSITY
NIELS BRÜGGER
4 NOVEMBER 2021
11
Examples of possible combinations:
link_source_domain,link_target_domain,link_count,link_start,link_end
› Reads: domain where link starts, domain where link points to, number of
links, date/time when first link identified, date/time when last link is
identified.
link_source_domain,link_target_URL,link_count
› Reads: same as above, but where links point to is now specified as
individual URLs and not domains.
files_size_domain,files_num_domain
› Reads: size of files per domain, number of files per domain

A researcher driven data description for the archived web: Why and how?

More Related Content

What's hot

Similar to A researcher driven data description for the archived web: Why and how?

More from WARCnet

Recently uploaded

A researcher driven data description for the archived web: Why and how?