2. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
AGENDA
›the challenge
›the idea
›possible approach
›where to start?
›possible workplan
›a rough idea — to get us started
2
3. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
THE CHALLENGE
›a researcher approaches a web archive to have
archived web material extracted
›how to formulate her/his demand for what to have
extracted in a consistent and systematic way,
irrespective of how the web collection is organised or
labelled?
›based on my experience there exist no recognised
standards for doing this
›there exist various practices among researchers and
web archives, but not many reflections on a shared
vocabulary
3
4. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
THE IDEA
To start a debate about:
›whether a shared vocabulary for archived web data
is needed at all
›how this can be done (in case it is needed)
›and, eventually, what it could look like
›Ulrich Have, in an email when WG5 was established:
“a standard data format would be interesting as a
kind of future requirements document for researcher-
ready data”
4
5. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
›not concerned with how different web archives
organise their collections, which data fields and
formats they use, how they label their data, etc.
›rather how we as a research community can
formulate our data needs in a consistent and
systematic way
›then it is up to the web archives to translate this to
whatever formats they may use
›should start with an existing research community,
WARCnet is a good candidate
5
6. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE APPROACH
The situation now:
›loose, unsystematic, and uncoordinated descriptions
›no alignment between countries
›we need something more systematic to ensure the
best interoperability possible between material
extracted from different collections
›the seedlists we received in WG2 related to COVID-
19 collections illustrated this
6
8. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
WHERE TO START?
›start with researcher needs?
›start with existing controlled vocabularies (in the
broadest sense of the word)?
My suggestion:
›start with researchers reflecting on how their need for
data can be formulated, and then check existing
controlled vocabularies if they can be used, in part or
in total
8
9. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
POSSIBLE WORKPLAN
1. map which descriptors researchers would like to
have, if possible based on concrete research
projects
2. map what is already available out there (at web
archives, APIs, ISO standards, controlled
vocabularies, etc.)
3. discuss how the two match, in particular with the
peculiarities of the archived web and existing web
archives in mind
9
10. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
A ROUGH IDEA — TO GET US
STARTED
›first column is the name of the data field
›second column is description of the data
›following columns are possible 'translations' to
existing vocabularies
10
11. AARHUS
UNIVERSITY
A DATA DESCRIPTION FOR THE ARCHIVED WEB: WHY AND HOW?
NIELS BRÜGGER
4 NOVEMBER 2021
11
Examples of possible combinations:
link_source_domain,link_target_domain,link_count,link_start,link_end
› Reads: domain where link starts, domain where link points to, number of
links, date/time when first link identified, date/time when last link is
identified.
link_source_domain,link_target_URL,link_count
› Reads: same as above, but where links point to is now specified as
individual URLs and not domains.
files_size_domain,files_num_domain
› Reads: size of files per domain, number of files per domain