The document summarizes a datathon conducted using various COVID-19 datasets from different European web archives. The goals were to 1) create a sandbox for exploring the data, 2) conduct initial analysis to see what could be achieved, and 3) document the process. Different institutions provided different types of datasets, including seedlists, tweets, and derived datasets. Challenges included restrictions on sharing raw data and representing large collections. Preliminary analysis identified potential research questions and ways to study web archives, collections, and the pandemic response.
1. Exploring COVID datasets through an internal Datathon
Aarhus 21 April 2021
Nicola Bingham (British Library)
Karin De Wild (Leiden University)
Susan Aasman (University of Groningen)
2. Introduction
Working Group 2 focuses on transnational events; unforeseen and predictable.
Sub-projects included researching European Web Archive Collections on Covid.
Practical exercises getting hands on with Collections to look at usability.
Lock down prevented travel prompting requests for remote access in a datathon January 2021
Aims
To develop a test bed to evaluate what could be done with heterogeneous datasets
To create a transnational corpus and to explore various issues such as copyright, legal deposit,
tools and methods.
Essentially, we had three goals:
1- To create a sandbox for practical exploration of the data.
2- To conduct a first round of analysis to get tangible results of what could be achieved with the
data and how we could build a shared corpus.
3- To document the process of working with the datasets, with a view to feeding back to web
archiving institutions.
3. Where are the datasets? How did we ask for them? What did we ask
for?
Access to collections
Access to collections varies between archiving institution.
▪ Openly accessible datasets, e.g. Bibliothèque nationale du
Luxembourg
▪ Public API access, e.g. Arquivo.pt: https://github.com/arquivo/pwa-
technologies/wikisome.
▪ Some archives e.g. the Royal Library of Denmark allow researchers
to access the datasets but with specific conditions and restrictions
on sharing the data.
▪ Pre prepared datasets e.g. the UK Web Archive has several
secondary datasets available to download e.g. a Geoindex of the
JISC UK Web Domain Dataset (1996-2010).
What to ask for?
Not straight forward – the raw data? derived datasets? Metadata?
4. Negotiations, concerns and issues from the archivists
HOW
▪ Using contacts and networks for requests
▪ Privileged relations created for several years with these institutions, the fact
that they are also participants in the WARCnet project.
▪ Negotiating and refining requests for data: clarify what was needed and what
our intentions were.
▪ Enquiries by web archivists about our precise needs and research questions,
in order to try to select the relevant data.
AT THE ARCHIVE
▪ Process of sharing data is relatively underdeveloped in archiving institutions
▪ Lack of clarity on what can and cannot be done with the data
▪ Questions of format
▪ Legal Deposit Restrictions and accessibility of collections.
EXAMPLE: The UK Web Archive
▪ Unable to share raw WARC files
▪ Legal Deposit restrictions prevent this
▪ Difficulty in extrapolating a subset of thematically grouped WARCs.
EXAMPLE: IIPC CDG Collection
▪ No Legal Deposit, but…
▪ The Coronavirus collection was 3.6TB therefore challenging to find a sample
that would be truly "representative" of the whole collection.
REQUESTS FROM ARCHIVISTS
▪ Deletion of metadata/seedlists after the datathon
▪ Documented outputs to be shared with the archiving institution/or consortium
that had put the seed lists together.
5. THE COLLECTIONS
▪ IIPC Content Development Group/Archive-it
▪ UK Web Archive
▪ Bibliothèque nationale de France
▪ Bibliothèque nationale du Luxembourg
▪ Det Kongelige Bibliotek | Royal Danish Library
▪ Koninklijke Bibliotheek | Nationale Bibliotheek van Nederland
▪ National Library of Hungary.
FORMAT OF THE DATA
With the exception of the dataset from INA (a selection of Tweet data in json
format), all datasets were seedlists in Excel or CSV
DOCUMENTATION
In some cases they were provided with minimal information, while in other
cases, such as that of the BnF, they arrived with substantial documentation and
contextual information (statistics, description of the whole COVID collection, etc)
STORAGE
Secure dropbox folder
6.
7. INA DATASET
▪ Focussed collection of hashtags containing the words “covid” or “vaccine”
▪ 61 Tweets extracted from a much larger dataset by INA
▪ Tweets collected through the Twitter public API in the JSON Lines format.
▪ Provides the actual content, the text of the Tweets, so different from the seedlists.
▪ Json lines format gives access to all the metadata; timestamp id; local info (attributes and tweet text).
▪ Good documentation + interview with INA staff.
▪ ISSUES: combining this dataset with the other seedlists
Example of a tweet in JSON format
12. Second-level domain names (SLD)
The IF function is used to keep the second-level domain names of a selection of websites:
=IF(COUNTIF(Lookup!A:A;D2);K2;"")
Argument:
• If the domain names is found within the Lookup table (“Lookup!A:A”);
• then give it the value in column K (“K2”);
• otherwise it the value “”.
13. Remove duplicates
Visual Basic Code (Developer > Visual Basic or shortcut “Alt+F11”):
Sub RemoveDuplicates()
'UpdatebyExtendoffice20160918
Dim xRow As Long
Dim xCol As Long
Dim xrg As Range
Dim xl As Long
On Error Resume Next
Set xrg = Application.InputBox("Select a
range:", "Kutools for Excel", _
ActiveWindow.RangeSelection.AddressLoca
l, , , , , 8)
xRow = xrg.Rows.Count + xrg.Row - 1
xCol = xrg.Column
'MsgBox xRow & ":" & xCol
Application.ScreenUpdating = False
For xl = xRow To 2 Step -1
If Cells(xl, xCol) = Cells(xl - 1, xCol)
Then
Cells(xl, xCol) = ""
End If
Next xl
Application.ScreenUpdating = True
End Sub
15. Top Level Domains (TLD)
To extract the top-level domain from the domain names:
=RIGHT(C2;LEN(C2)-SEARCH("$";SUBSTITUTE(C2;".";"$";LEN(C2)-
LEN(SUBSTITUTE(C2;".";"")))))
Argument:
• Try to find the number of periods within the URL (LEN(C2)-LEN(SUBSTITUTE(C2;".";"").
• Substitute the last period with a character that is not often found within an URL, in this
example “$” (SUBSTITUTE(C2;".";"$").
• Find this position (SEARCH("$").
• The RIGHT() function extracts the characters before the "$”.
16. Top Level Domains (TLD)
Top-level domains can give information about the intended use of the
website.
IANA (Internet Assigned Numbers Authority) groups:
• Generic top-level domains (gTLD), historically the generic domain
names that are now sponsored by designated organizations (.com).
• Country code top-level domains (ccTLD), generally used or reserved
for a specific country (.uk, .nl).
17. Geographical data
Data from Wikipedia was scraped and pasted into a new sheet tab named “Lookup”.
Remove unintended whitespaces:
=SUBSTITUTE($V4;" ";"";1).
Add the country to the TLD in the dataset:
=IF(INDEX(Lookup!Y:Y; MATCH($G2;Lookup!U:U;0))=0;““;
INDEX(Lookup!Y:Y; MATCH($G2; Lookup!U:U;0)))
18.
19.
20. What can one study with these data?
First step in exploring in what is available, retrievable an searchable through European
web archives
(1) Web archives archiving out of their ccTLD
(2) The types of actors
(3) New event-specific websites
21. (1) How to make an entry point for a researcher through European COVID collections? Why
datasets may be useful to guide him/her?
(2) Can this table highlight several methods of creating COVID collections in European
countries and more generally the practices of web archiving collections as well as their noises
and silences?
(3) From a cultural and governance perspective, could we combine web archiving
institutions’ experience, governance, practices with the reality of the datasets we get to
demonstrate how web archives have politics.
Some preliminary conclusions with regards to the study of heritagization and
web archives, considering inclusiveness, values & practices
22. Other datasets and initiatives carried out by researchers on the Covid pandemic
▪ Twitter collection by Frédéric Clavert https://www.c2dh.uni.lu/data/covid19fr-un-pays-confine-sur-twitter
▪ News Media Tweet Dataset from Universitat Autonoma de Barcelona, https://arxiv.org/abs/2004.01791)
▪ Archive-It Collections (https://archive-it.org/explore?q=COVID).
Further resources
▪ The COVID 19 Data portal, https://www.covid19dataportal.org
▪ A journal of the Plague year, https://covid-19archive.org/s/archive/collecting/item/2410
▪ The University of Southern California’s COVID tweet dataset, https://github.com/echen102/EUROPEAN GREEN DEAL-
TweetIDs
▪ Geolocated tweets from QCRI, Qatar, https://crisisnlp.qcri.org/covid19
▪ Twitter covid19 stream, https://developer.twitter.com/en/docs/labs/covid19-stream/overview
23. And then finally, slide 23!
what type of research questions
did we start with?
24. Between data-driven science and research-
driven questions
“If the question of the priority of the egg over the hen
or the hen over the egg troubles you, it is because
you assume that the animals were originally what
they are now. What madness!”
Denis Diderot, The Dream of d'Alembert, 1769 (our translation).
25. (1) Women, Gender and COVID within this collection (e.g., domestic violence, care and homeschooling, etc.)?
(2) How to identify private journals of lockdowns, individual traces of daily life, different online expressions that
give insight into the ways people deal with Covid in their everyday life?
(3) Can we trace public support/opposition to lockdown
(4) How was the "school at home" debate conducted on the Web?
(5) How to identify fake news, conspiracy theories and other covid-related controversies within these big data?
(6) Is it possible to perform a visual analysis of what medical-scientific types of communication on Covid-19 looks
like (and what type of visual communication is used: e.g, graphs, virus visuals and the many types of color)?
(7) The pandemic seriously affected museums around the world and the Web became a prominent channel for
their communication. How did museum websites evolve during the COVID-19 pandemic?
26. “The chicken is only an egg’s
way for making another
egg”!, Richard Dawkins
Natural partners:
historians and
archivists