Digital Scholarship with Newspaper Collections

The Past, Present and Future of
Digital Scholarship with
Newspaper Collections
DH2019, Utrecht, July 2019

The Past, Present and Future of Digital
Scholarship with Newspaper Collections
• Short Project Presentations:
• Living with Machines
• impresso - Media Monitoring of the Past
• Construire avec les usagers la numérisation des collections de périodiques
(NewsEye)
• Overview Papers
• Digital Editions of Serials and media historians: an overview
• Towards a Critical Framework for Digital Newspaper Scholarship
• Q&A

Our Partners Our Funders
Living with Machines
Dr Mia Ridge, British Library, Co-Investigator
Paper authors/project team: Mia Ridge, Giovanni Colavizza, with Ruth Ahnert, Claire
Austin, David Beavan, Kaspar Beelens, Mariona Coll Ardanuy, Adam Farquhar, Emma
Griffin, James Hetherington, Jon Lawrence, Katie McDonough, Barbara McGillivray,
André Piza, Daniel van Strien, Giorgia Tolfo, Alan Wilson, Daniel Wilson.

Project vision
• We aim to facilitate new historical findings about the impact of
technology on the lives of ordinary people during the Industrial
Revolution / long nineteenth century (c. 1780 – 1918)
Or
• Applying new methods to questions about the past to explore the
future of collaboration between data science, history and digital
humanities
Or
• Challenging library professionals, data scientists and historians to
‘radically collaborate’ and learn from and with each other

Why newspapers?
• Large digitised corpus available if requested
• Opportunity to tackle the challenges of working at scale:
operational, methodological, organisational
• Suitable for developing innovative computational models, tools,
code, data and infrastructure reusable by other scholars and
research projects

The British Newspaper Archive
• Nearly 33 million newspaper pages
• Site by Findmypast Limited in commercial partnership with the
British Library
• BL Labs previously facilitated access for researchers to JISC-
funded digitised newspapers

British Library newspapers and periodicals
• British Library has 60m issues (450 million pages, 34,000 titles)
from 17thC to today
• Majority UK/Irish (Legal Deposit from 1869), but also overseas
esp. USA, India, Africa
• New digitisation through ‘Heritage Made Digital’ and Living with
Machines projects
• 6.8% digitised (July 2019)

But what’s actually
available digitally?

Courtesy Yann Ryan @lievesofgrass and @BL_MadeDigital

Copyright ‘safe date’
discussions are on-going
and... complicated

Our early work with newspapers
Research questions tackled across various Labs include:
• How bad is the OCR, really? And what effect does that have on
computational linguistic and nominal linkage methods?
• Can digitising newspaper directories help us understand the
difference in political and religious affiliations (etc.) between the
overall potential corpus and what’s currently been digitised?
• Can we use crowdsourcing tasks to reliably gather information
about industrial accidents? Can we then use the results to train
machine learning tools to find accidents at scale?

Ongoing questions
• To what extent does ‘convenience’ in digitisation and the quest for
geographical coverage affect scholarship?
• Copyright dates, short vs long runs, microfilm vs hard copy
• How do we show the impact of OCR quality on both keyword
searches and data processing at scale?
• What kinds of derived datasets would be useful to researchers?
• Planning for legacy: how do we integrate entity recognition etc.
results into discovery systems? How do we ensure interoperability?
• We can share public domain but not potentially copyrighted pages
– what effect does that have on user experience?
• How do we reconcile different ideas about ‘outputs’?

Thank you!
Living with Machines @LivingWMachines
Sneak preview and newsletter signup:
http://livingwithmachines.ac.uk/

Dividing the work into ‘Labs’
• Sources - showing the biases in the collection and processing of sources
• Language - combining approaches from computational linguistics to corpora
including newspapers and novels
• Space and time - combining census data and event-based records to
understand urban change with spatial and temporal analyses
• Communities - a meta lab, amplifying results and engaging the public in
meaningful crowdsourcing that contributes to the project's research
• 3I (Integration, infrastructure and interfaces) - connects the IT infrastructure
with work done in the other labs and vice-versa, thinking about computational
processes and integration of data science.
• Data acquisition and wrangling – managing practical aspects of data ingest
including rights and data management

Digital Scholarship with Newspaper Collections

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Digital Scholarship with Newspaper Collections

Similar to Digital Scholarship with Newspaper Collections (20)

More from Mia

More from Mia (20)

Recently uploaded

Recently uploaded (20)

Digital Scholarship with Newspaper Collections

Editor's Notes