SlideShare a Scribd company logo
Exploring COVID datasets through an internal Datathon
Aarhus 21 April 2021
Nicola Bingham (British Library)
Karin De Wild (Leiden University)
Susan Aasman (University of Groningen)
Introduction
Working Group 2 focuses on transnational events; unforeseen and predictable.
Sub-projects included researching European Web Archive Collections on Covid.
Practical exercises getting hands on with Collections to look at usability.
Lock down prevented travel prompting requests for remote access in a datathon January 2021
Aims
To develop a test bed to evaluate what could be done with heterogeneous datasets
To create a transnational corpus and to explore various issues such as copyright, legal deposit,
tools and methods.
Essentially, we had three goals:
1- To create a sandbox for practical exploration of the data.
2- To conduct a first round of analysis to get tangible results of what could be achieved with the
data and how we could build a shared corpus.
3- To document the process of working with the datasets, with a view to feeding back to web
archiving institutions.
Where are the datasets? How did we ask for them? What did we ask
for?
Access to collections
Access to collections varies between archiving institution.
▪ Openly accessible datasets, e.g. Bibliothèque nationale du
Luxembourg
▪ Public API access, e.g. Arquivo.pt: https://github.com/arquivo/pwa-
technologies/wikisome.
▪ Some archives e.g. the Royal Library of Denmark allow researchers
to access the datasets but with specific conditions and restrictions
on sharing the data.
▪ Pre prepared datasets e.g. the UK Web Archive has several
secondary datasets available to download e.g. a Geoindex of the
JISC UK Web Domain Dataset (1996-2010).
What to ask for?
Not straight forward – the raw data? derived datasets? Metadata?
Negotiations, concerns and issues from the archivists
HOW
▪ Using contacts and networks for requests
▪ Privileged relations created for several years with these institutions, the fact
that they are also participants in the WARCnet project.
▪ Negotiating and refining requests for data: clarify what was needed and what
our intentions were.
▪ Enquiries by web archivists about our precise needs and research questions,
in order to try to select the relevant data.
AT THE ARCHIVE
▪ Process of sharing data is relatively underdeveloped in archiving institutions
▪ Lack of clarity on what can and cannot be done with the data
▪ Questions of format
▪ Legal Deposit Restrictions and accessibility of collections.
EXAMPLE: The UK Web Archive
▪ Unable to share raw WARC files
▪ Legal Deposit restrictions prevent this
▪ Difficulty in extrapolating a subset of thematically grouped WARCs.
EXAMPLE: IIPC CDG Collection
▪ No Legal Deposit, but…
▪ The Coronavirus collection was 3.6TB therefore challenging to find a sample
that would be truly "representative" of the whole collection.
REQUESTS FROM ARCHIVISTS
▪ Deletion of metadata/seedlists after the datathon
▪ Documented outputs to be shared with the archiving institution/or consortium
that had put the seed lists together.
THE COLLECTIONS
▪ IIPC Content Development Group/Archive-it
▪ UK Web Archive
▪ Bibliothèque nationale de France
▪ Bibliothèque nationale du Luxembourg
▪ Det Kongelige Bibliotek | Royal Danish Library
▪ Koninklijke Bibliotheek | Nationale Bibliotheek van Nederland
▪ National Library of Hungary.
FORMAT OF THE DATA
With the exception of the dataset from INA (a selection of Tweet data in json
format), all datasets were seedlists in Excel or CSV
DOCUMENTATION
In some cases they were provided with minimal information, while in other
cases, such as that of the BnF, they arrived with substantial documentation and
contextual information (statistics, description of the whole COVID collection, etc)
STORAGE
Secure dropbox folder
INA DATASET
▪ Focussed collection of hashtags containing the words “covid” or “vaccine”
▪ 61 Tweets extracted from a much larger dataset by INA
▪ Tweets collected through the Twitter public API in the JSON Lines format.
▪ Provides the actual content, the text of the Tweets, so different from the seedlists.
▪ Json lines format gives access to all the metadata; timestamp id; local info (attributes and tweet text).
▪ Good documentation + interview with INA staff.
▪ ISSUES: combining this dataset with the other seedlists
Example of a tweet in JSON format
Deconstruct the URL
https://www.acl.lu/en-us/news/voyages-loisirs/voyages-et-transports
Includes:
● Domain name (www.acl.lu)
● Second-level domain name (“en-us”)
● Other levels (“news’, etc,)
Deconstruct the URL
https://www.acl.lu/en-us/news/voyages-loisirs/voyages-et-transports
Second-level domain names (SLD)
The IF function is used to keep the second-level domain names of a selection of websites:
=IF(COUNTIF(Lookup!A:A;D2);K2;"")
Argument:
• If the domain names is found within the Lookup table (“Lookup!A:A”);
• then give it the value in column K (“K2”);
• otherwise it the value “”.
Remove duplicates
Visual Basic Code (Developer > Visual Basic or shortcut “Alt+F11”):
Sub RemoveDuplicates()
'UpdatebyExtendoffice20160918
Dim xRow As Long
Dim xCol As Long
Dim xrg As Range
Dim xl As Long
On Error Resume Next
Set xrg = Application.InputBox("Select a
range:", "Kutools for Excel", _
ActiveWindow.RangeSelection.AddressLoca
l, , , , , 8)
xRow = xrg.Rows.Count + xrg.Row - 1
xCol = xrg.Column
'MsgBox xRow & ":" & xCol
Application.ScreenUpdating = False
For xl = xRow To 2 Step -1
If Cells(xl, xCol) = Cells(xl - 1, xCol)
Then
Cells(xl, xCol) = ""
End If
Next xl
Application.ScreenUpdating = True
End Sub
Note: All domain names are only appearing once.
Top Level Domains (TLD)
To extract the top-level domain from the domain names:
=RIGHT(C2;LEN(C2)-SEARCH("$";SUBSTITUTE(C2;".";"$";LEN(C2)-
LEN(SUBSTITUTE(C2;".";"")))))
Argument:
• Try to find the number of periods within the URL (LEN(C2)-LEN(SUBSTITUTE(C2;".";"").
• Substitute the last period with a character that is not often found within an URL, in this
example “$” (SUBSTITUTE(C2;".";"$").
• Find this position (SEARCH("$").
• The RIGHT() function extracts the characters before the "$”.
Top Level Domains (TLD)
Top-level domains can give information about the intended use of the
website.
IANA (Internet Assigned Numbers Authority) groups:
• Generic top-level domains (gTLD), historically the generic domain
names that are now sponsored by designated organizations (.com).
• Country code top-level domains (ccTLD), generally used or reserved
for a specific country (.uk, .nl).
Geographical data
Data from Wikipedia was scraped and pasted into a new sheet tab named “Lookup”.
Remove unintended whitespaces:
=SUBSTITUTE($V4;" ";"";1).
Add the country to the TLD in the dataset:
=IF(INDEX(Lookup!Y:Y; MATCH($G2;Lookup!U:U;0))=0;““;
INDEX(Lookup!Y:Y; MATCH($G2; Lookup!U:U;0)))
What can one study with these data?
First step in exploring in what is available, retrievable an searchable through European
web archives
(1) Web archives archiving out of their ccTLD
(2) The types of actors
(3) New event-specific websites
(1) How to make an entry point for a researcher through European COVID collections? Why
datasets may be useful to guide him/her?
(2) Can this table highlight several methods of creating COVID collections in European
countries and more generally the practices of web archiving collections as well as their noises
and silences?
(3) From a cultural and governance perspective, could we combine web archiving
institutions’ experience, governance, practices with the reality of the datasets we get to
demonstrate how web archives have politics.
Some preliminary conclusions with regards to the study of heritagization and
web archives, considering inclusiveness, values & practices
Other datasets and initiatives carried out by researchers on the Covid pandemic
▪ Twitter collection by Frédéric Clavert https://www.c2dh.uni.lu/data/covid19fr-un-pays-confine-sur-twitter
▪ News Media Tweet Dataset from Universitat Autonoma de Barcelona, https://arxiv.org/abs/2004.01791)
▪ Archive-It Collections (https://archive-it.org/explore?q=COVID).
Further resources
▪ The COVID 19 Data portal, https://www.covid19dataportal.org
▪ A journal of the Plague year, https://covid-19archive.org/s/archive/collecting/item/2410
▪ The University of Southern California’s COVID tweet dataset, https://github.com/echen102/EUROPEAN GREEN DEAL-
TweetIDs
▪ Geolocated tweets from QCRI, Qatar, https://crisisnlp.qcri.org/covid19
▪ Twitter covid19 stream, https://developer.twitter.com/en/docs/labs/covid19-stream/overview
And then finally, slide 23!
what type of research questions
did we start with?
Between data-driven science and research-
driven questions
“If the question of the priority of the egg over the hen
or the hen over the egg troubles you, it is because
you assume that the animals were originally what
they are now. What madness!”
Denis Diderot, The Dream of d'Alembert, 1769 (our translation).
(1) Women, Gender and COVID within this collection (e.g., domestic violence, care and homeschooling, etc.)?
(2) How to identify private journals of lockdowns, individual traces of daily life, different online expressions that
give insight into the ways people deal with Covid in their everyday life?
(3) Can we trace public support/opposition to lockdown
(4) How was the "school at home" debate conducted on the Web?
(5) How to identify fake news, conspiracy theories and other covid-related controversies within these big data?
(6) Is it possible to perform a visual analysis of what medical-scientific types of communication on Covid-19 looks
like (and what type of visual communication is used: e.g, graphs, virus visuals and the many types of color)?
(7) The pandemic seriously affected museums around the world and the Web became a prominent channel for
their communication. How did museum websites evolve during the COVID-19 pandemic?
“The chicken is only an egg’s
way for making another
egg”!, Richard Dawkins
Natural partners:
historians and
archivists

More Related Content

What's hot

Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
Andy Jackson
 
Archiving and Preserving Born Digital Government Documents
Archiving and Preserving Born Digital Government DocumentsArchiving and Preserving Born Digital Government Documents
Archiving and Preserving Born Digital Government Documents
mollyastrid
 

What's hot (20)

Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
V discoverdrupal
V discoverdrupalV discoverdrupal
V discoverdrupal
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
 
Turning your catalogue into Linked Data
Turning your catalogue into Linked DataTurning your catalogue into Linked Data
Turning your catalogue into Linked Data
 
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
 
Building a Collection of the Historical UK Web for scholarly use
Building a Collection of the Historical UK Web for scholarly useBuilding a Collection of the Historical UK Web for scholarly use
Building a Collection of the Historical UK Web for scholarly use
 
Linking the 20th century paper history to the sum of all knowledge
Linking the 20th century paper history to the sum of all knowledgeLinking the 20th century paper history to the sum of all knowledge
Linking the 20th century paper history to the sum of all knowledge
 
Visualizing linkeddata aall2012d-ss
Visualizing linkeddata aall2012d-ssVisualizing linkeddata aall2012d-ss
Visualizing linkeddata aall2012d-ss
 
Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...
 
Wikidata as opportunity for special collections: the 20th Century Press Archi...
Wikidata as opportunity for special collections: the 20th Century Press Archi...Wikidata as opportunity for special collections: the 20th Century Press Archi...
Wikidata as opportunity for special collections: the 20th Century Press Archi...
 
Clare Lanigan - Presentation to IES Students
Clare Lanigan - Presentation to IES StudentsClare Lanigan - Presentation to IES Students
Clare Lanigan - Presentation to IES Students
 
Linked Data
Linked DataLinked Data
Linked Data
 
Donating data to Wikidata: First experiences from the „20th Century Press Arc...
Donating data to Wikidata: First experiences from the „20th Century Press Arc...Donating data to Wikidata: First experiences from the „20th Century Press Arc...
Donating data to Wikidata: First experiences from the „20th Century Press Arc...
 
Rebecca Grant, Kathryn Cassidy, Marta Bustillo - Implementing Orphan Works Le...
Rebecca Grant, Kathryn Cassidy, Marta Bustillo - Implementing Orphan Works Le...Rebecca Grant, Kathryn Cassidy, Marta Bustillo - Implementing Orphan Works Le...
Rebecca Grant, Kathryn Cassidy, Marta Bustillo - Implementing Orphan Works Le...
 
Estermann wd glam-intro_20181204
Estermann wd glam-intro_20181204Estermann wd glam-intro_20181204
Estermann wd glam-intro_20181204
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloNetarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
 
Archiving and Preserving Born Digital Government Documents
Archiving and Preserving Born Digital Government DocumentsArchiving and Preserving Born Digital Government Documents
Archiving and Preserving Born Digital Government Documents
 

Similar to Bingham, De Wild & Aasman Presentation

Connecting the Dots: Linking Digitized Collections Across Metadata Silos
Connecting the Dots: Linking Digitized Collections Across Metadata SilosConnecting the Dots: Linking Digitized Collections Across Metadata Silos
Connecting the Dots: Linking Digitized Collections Across Metadata Silos
OCLC
 

Similar to Bingham, De Wild & Aasman Presentation (20)

Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Connecting the Dots: Linking Digitized Collections Across Metadata Silos
Connecting the Dots: Linking Digitized Collections Across Metadata SilosConnecting the Dots: Linking Digitized Collections Across Metadata Silos
Connecting the Dots: Linking Digitized Collections Across Metadata Silos
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
Uk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcaseUk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcase
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
Dariah vcc3 2505-2013_displaying
Dariah vcc3 2505-2013_displayingDariah vcc3 2505-2013_displaying
Dariah vcc3 2505-2013_displaying
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museums
 
Benefits and practice of open science
Benefits and practice of open scienceBenefits and practice of open science
Benefits and practice of open science
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
 
Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data
 
Cornell 2011 05-13
Cornell 2011 05-13Cornell 2011 05-13
Cornell 2011 05-13
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 

More from WARCnet

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
WARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
WARCnet
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
WARCnet
 

More from WARCnet (20)

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptx
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptx
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptx
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptx
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INA
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsing
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussion
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experiment
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational events
 
Web Archive Research Skills and Tools Survey (WARST)
 Web Archive Research Skills and Tools Survey (WARST) Web Archive Research Skills and Tools Survey (WARST)
Web Archive Research Skills and Tools Survey (WARST)
 

Recently uploaded

plant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated cropsplant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated crops
parmarsneha2
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf
YibeltalNibretu
 

Recently uploaded (20)

Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
plant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated cropsplant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated crops
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 

Bingham, De Wild & Aasman Presentation

  • 1. Exploring COVID datasets through an internal Datathon Aarhus 21 April 2021 Nicola Bingham (British Library) Karin De Wild (Leiden University) Susan Aasman (University of Groningen)
  • 2. Introduction Working Group 2 focuses on transnational events; unforeseen and predictable. Sub-projects included researching European Web Archive Collections on Covid. Practical exercises getting hands on with Collections to look at usability. Lock down prevented travel prompting requests for remote access in a datathon January 2021 Aims To develop a test bed to evaluate what could be done with heterogeneous datasets To create a transnational corpus and to explore various issues such as copyright, legal deposit, tools and methods. Essentially, we had three goals: 1- To create a sandbox for practical exploration of the data. 2- To conduct a first round of analysis to get tangible results of what could be achieved with the data and how we could build a shared corpus. 3- To document the process of working with the datasets, with a view to feeding back to web archiving institutions.
  • 3. Where are the datasets? How did we ask for them? What did we ask for? Access to collections Access to collections varies between archiving institution. ▪ Openly accessible datasets, e.g. Bibliothèque nationale du Luxembourg ▪ Public API access, e.g. Arquivo.pt: https://github.com/arquivo/pwa- technologies/wikisome. ▪ Some archives e.g. the Royal Library of Denmark allow researchers to access the datasets but with specific conditions and restrictions on sharing the data. ▪ Pre prepared datasets e.g. the UK Web Archive has several secondary datasets available to download e.g. a Geoindex of the JISC UK Web Domain Dataset (1996-2010). What to ask for? Not straight forward – the raw data? derived datasets? Metadata?
  • 4. Negotiations, concerns and issues from the archivists HOW ▪ Using contacts and networks for requests ▪ Privileged relations created for several years with these institutions, the fact that they are also participants in the WARCnet project. ▪ Negotiating and refining requests for data: clarify what was needed and what our intentions were. ▪ Enquiries by web archivists about our precise needs and research questions, in order to try to select the relevant data. AT THE ARCHIVE ▪ Process of sharing data is relatively underdeveloped in archiving institutions ▪ Lack of clarity on what can and cannot be done with the data ▪ Questions of format ▪ Legal Deposit Restrictions and accessibility of collections. EXAMPLE: The UK Web Archive ▪ Unable to share raw WARC files ▪ Legal Deposit restrictions prevent this ▪ Difficulty in extrapolating a subset of thematically grouped WARCs. EXAMPLE: IIPC CDG Collection ▪ No Legal Deposit, but… ▪ The Coronavirus collection was 3.6TB therefore challenging to find a sample that would be truly "representative" of the whole collection. REQUESTS FROM ARCHIVISTS ▪ Deletion of metadata/seedlists after the datathon ▪ Documented outputs to be shared with the archiving institution/or consortium that had put the seed lists together.
  • 5. THE COLLECTIONS ▪ IIPC Content Development Group/Archive-it ▪ UK Web Archive ▪ Bibliothèque nationale de France ▪ Bibliothèque nationale du Luxembourg ▪ Det Kongelige Bibliotek | Royal Danish Library ▪ Koninklijke Bibliotheek | Nationale Bibliotheek van Nederland ▪ National Library of Hungary. FORMAT OF THE DATA With the exception of the dataset from INA (a selection of Tweet data in json format), all datasets were seedlists in Excel or CSV DOCUMENTATION In some cases they were provided with minimal information, while in other cases, such as that of the BnF, they arrived with substantial documentation and contextual information (statistics, description of the whole COVID collection, etc) STORAGE Secure dropbox folder
  • 6.
  • 7. INA DATASET ▪ Focussed collection of hashtags containing the words “covid” or “vaccine” ▪ 61 Tweets extracted from a much larger dataset by INA ▪ Tweets collected through the Twitter public API in the JSON Lines format. ▪ Provides the actual content, the text of the Tweets, so different from the seedlists. ▪ Json lines format gives access to all the metadata; timestamp id; local info (attributes and tweet text). ▪ Good documentation + interview with INA staff. ▪ ISSUES: combining this dataset with the other seedlists Example of a tweet in JSON format
  • 8.
  • 9.
  • 10. Deconstruct the URL https://www.acl.lu/en-us/news/voyages-loisirs/voyages-et-transports Includes: ● Domain name (www.acl.lu) ● Second-level domain name (“en-us”) ● Other levels (“news’, etc,)
  • 12. Second-level domain names (SLD) The IF function is used to keep the second-level domain names of a selection of websites: =IF(COUNTIF(Lookup!A:A;D2);K2;"") Argument: • If the domain names is found within the Lookup table (“Lookup!A:A”); • then give it the value in column K (“K2”); • otherwise it the value “”.
  • 13. Remove duplicates Visual Basic Code (Developer > Visual Basic or shortcut “Alt+F11”): Sub RemoveDuplicates() 'UpdatebyExtendoffice20160918 Dim xRow As Long Dim xCol As Long Dim xrg As Range Dim xl As Long On Error Resume Next Set xrg = Application.InputBox("Select a range:", "Kutools for Excel", _ ActiveWindow.RangeSelection.AddressLoca l, , , , , 8) xRow = xrg.Rows.Count + xrg.Row - 1 xCol = xrg.Column 'MsgBox xRow & ":" & xCol Application.ScreenUpdating = False For xl = xRow To 2 Step -1 If Cells(xl, xCol) = Cells(xl - 1, xCol) Then Cells(xl, xCol) = "" End If Next xl Application.ScreenUpdating = True End Sub
  • 14. Note: All domain names are only appearing once.
  • 15. Top Level Domains (TLD) To extract the top-level domain from the domain names: =RIGHT(C2;LEN(C2)-SEARCH("$";SUBSTITUTE(C2;".";"$";LEN(C2)- LEN(SUBSTITUTE(C2;".";""))))) Argument: • Try to find the number of periods within the URL (LEN(C2)-LEN(SUBSTITUTE(C2;".";""). • Substitute the last period with a character that is not often found within an URL, in this example “$” (SUBSTITUTE(C2;".";"$"). • Find this position (SEARCH("$"). • The RIGHT() function extracts the characters before the "$”.
  • 16. Top Level Domains (TLD) Top-level domains can give information about the intended use of the website. IANA (Internet Assigned Numbers Authority) groups: • Generic top-level domains (gTLD), historically the generic domain names that are now sponsored by designated organizations (.com). • Country code top-level domains (ccTLD), generally used or reserved for a specific country (.uk, .nl).
  • 17. Geographical data Data from Wikipedia was scraped and pasted into a new sheet tab named “Lookup”. Remove unintended whitespaces: =SUBSTITUTE($V4;" ";"";1). Add the country to the TLD in the dataset: =IF(INDEX(Lookup!Y:Y; MATCH($G2;Lookup!U:U;0))=0;““; INDEX(Lookup!Y:Y; MATCH($G2; Lookup!U:U;0)))
  • 18.
  • 19.
  • 20. What can one study with these data? First step in exploring in what is available, retrievable an searchable through European web archives (1) Web archives archiving out of their ccTLD (2) The types of actors (3) New event-specific websites
  • 21. (1) How to make an entry point for a researcher through European COVID collections? Why datasets may be useful to guide him/her? (2) Can this table highlight several methods of creating COVID collections in European countries and more generally the practices of web archiving collections as well as their noises and silences? (3) From a cultural and governance perspective, could we combine web archiving institutions’ experience, governance, practices with the reality of the datasets we get to demonstrate how web archives have politics. Some preliminary conclusions with regards to the study of heritagization and web archives, considering inclusiveness, values & practices
  • 22. Other datasets and initiatives carried out by researchers on the Covid pandemic ▪ Twitter collection by Frédéric Clavert https://www.c2dh.uni.lu/data/covid19fr-un-pays-confine-sur-twitter ▪ News Media Tweet Dataset from Universitat Autonoma de Barcelona, https://arxiv.org/abs/2004.01791) ▪ Archive-It Collections (https://archive-it.org/explore?q=COVID). Further resources ▪ The COVID 19 Data portal, https://www.covid19dataportal.org ▪ A journal of the Plague year, https://covid-19archive.org/s/archive/collecting/item/2410 ▪ The University of Southern California’s COVID tweet dataset, https://github.com/echen102/EUROPEAN GREEN DEAL- TweetIDs ▪ Geolocated tweets from QCRI, Qatar, https://crisisnlp.qcri.org/covid19 ▪ Twitter covid19 stream, https://developer.twitter.com/en/docs/labs/covid19-stream/overview
  • 23. And then finally, slide 23! what type of research questions did we start with?
  • 24. Between data-driven science and research- driven questions “If the question of the priority of the egg over the hen or the hen over the egg troubles you, it is because you assume that the animals were originally what they are now. What madness!” Denis Diderot, The Dream of d'Alembert, 1769 (our translation).
  • 25. (1) Women, Gender and COVID within this collection (e.g., domestic violence, care and homeschooling, etc.)? (2) How to identify private journals of lockdowns, individual traces of daily life, different online expressions that give insight into the ways people deal with Covid in their everyday life? (3) Can we trace public support/opposition to lockdown (4) How was the "school at home" debate conducted on the Web? (5) How to identify fake news, conspiracy theories and other covid-related controversies within these big data? (6) Is it possible to perform a visual analysis of what medical-scientific types of communication on Covid-19 looks like (and what type of visual communication is used: e.g, graphs, virus visuals and the many types of color)? (7) The pandemic seriously affected museums around the world and the Web became a prominent channel for their communication. How did museum websites evolve during the COVID-19 pandemic?
  • 26. “The chicken is only an egg’s way for making another egg”!, Richard Dawkins Natural partners: historians and archivists