The document discusses the Living with Machines project, which aims to apply computational methods and digital tools to historical newspaper collections to gain new insights. It summarizes the project's goals of facilitating collaboration between data scientists, historians, and digital humanities researchers. It also provides details on the project partners and funders, the newspaper collections involved including the British Library and British Newspaper Archive, challenges around copyright and digitization, and the project's research questions and division into specialized "Labs".
1. The Past, Present and Future of
Digital Scholarship with
Newspaper Collections
DH2019, Utrecht, July 2019
2. The Past, Present and Future of Digital
Scholarship with Newspaper Collections
• Short Project Presentations:
• Living with Machines
• impresso - Media Monitoring of the Past
• Construire avec les usagers la numérisation des collections de périodiques
(NewsEye)
• Overview Papers
• Digital Editions of Serials and media historians: an overview
• Towards a Critical Framework for Digital Newspaper Scholarship
• Q&A
3. Our Partners Our Funders
Living with Machines
Dr Mia Ridge, British Library, Co-Investigator
Paper authors/project team: Mia Ridge, Giovanni Colavizza, with Ruth Ahnert, Claire
Austin, David Beavan, Kaspar Beelens, Mariona Coll Ardanuy, Adam Farquhar, Emma
Griffin, James Hetherington, Jon Lawrence, Katie McDonough, Barbara McGillivray,
André Piza, Daniel van Strien, Giorgia Tolfo, Alan Wilson, Daniel Wilson.
4. Project vision
• We aim to facilitate new historical findings about the impact of
technology on the lives of ordinary people during the Industrial
Revolution / long nineteenth century (c. 1780 – 1918)
Or
• Applying new methods to questions about the past to explore the
future of collaboration between data science, history and digital
humanities
Or
• Challenging library professionals, data scientists and historians to
‘radically collaborate’ and learn from and with each other
5. Why newspapers?
• Large digitised corpus available if requested
• Opportunity to tackle the challenges of working at scale:
operational, methodological, organisational
• Suitable for developing innovative computational models, tools,
code, data and infrastructure reusable by other scholars and
research projects
6. The British Newspaper Archive
• Nearly 33 million newspaper pages
• Site by Findmypast Limited in commercial partnership with the
British Library
• BL Labs previously facilitated access for researchers to JISC-
funded digitised newspapers
7. British Library newspapers and periodicals
• British Library has 60m issues (450 million pages, 34,000 titles)
from 17thC to today
• Majority UK/Irish (Legal Deposit from 1869), but also overseas
esp. USA, India, Africa
• New digitisation through ‘Heritage Made Digital’ and Living with
Machines projects
• 6.8% digitised (July 2019)
11. Our early work with newspapers
Research questions tackled across various Labs include:
• How bad is the OCR, really? And what effect does that have on
computational linguistic and nominal linkage methods?
• Can digitising newspaper directories help us understand the
difference in political and religious affiliations (etc.) between the
overall potential corpus and what’s currently been digitised?
• Can we use crowdsourcing tasks to reliably gather information
about industrial accidents? Can we then use the results to train
machine learning tools to find accidents at scale?
12. Ongoing questions
• To what extent does ‘convenience’ in digitisation and the quest for
geographical coverage affect scholarship?
• Copyright dates, short vs long runs, microfilm vs hard copy
• How do we show the impact of OCR quality on both keyword
searches and data processing at scale?
• What kinds of derived datasets would be useful to researchers?
• Planning for legacy: how do we integrate entity recognition etc.
results into discovery systems? How do we ensure interoperability?
• We can share public domain but not potentially copyrighted pages
– what effect does that have on user experience?
• How do we reconcile different ideas about ‘outputs’?
13. Thank you!
Living with Machines @LivingWMachines
Sneak preview and newsletter signup:
http://livingwithmachines.ac.uk/
14. The Past, Present and Future of Digital
Scholarship with Newspaper Collections
• Short Project Presentations:
• Living with Machines
• impresso - Media Monitoring of the Past
• Construire avec les usagers la numérisation des collections de périodiques
(NewsEye)
• Overview Papers
• Digital Editions of Serials and media historians: an overview
• Towards a Critical Framework for Digital Newspaper Scholarship
• Q&A
15. Dividing the work into ‘Labs’
• Sources - showing the biases in the collection and processing of sources
• Language - combining approaches from computational linguistics to corpora
including newspapers and novels
• Space and time - combining census data and event-based records to
understand urban change with spatial and temporal analyses
• Communities - a meta lab, amplifying results and engaging the public in
meaningful crowdsourcing that contributes to the project's research
• 3I (Integration, infrastructure and interfaces) - connects the IT infrastructure
with work done in the other labs and vice-versa, thinking about computational
processes and integration of data science.
• Data acquisition and wrangling – managing practical aspects of data ingest
including rights and data management
Editor's Notes
3 half hour sections
There are a few different ways to think about the goals of the project.
Conveniently already had lots digitised; allowed us to tackle questions of scale and truly break new ground (‘new’ allowing for all the other pojrects!)
Many names of researchers will be familiar to DH audiences
Our dates are different than FMP, which have different relationships with newspaper publishers and can work to a later date
Will we be able to link people, places etc. to identifiers at scale?