Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections
Talk at Digital Humanities 2016 with Melissa Terras, James Hetherington, David Beavan, Anne Welsh, Helen O'Neill, Will Finley, Oliver Duke-Williams, Adam Farquhar, and Martin Zaltz Austwick.
Abstract http://dh2016.adho.org/abstracts/2584
Similar to Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections
Research Objects for improved sharing and reproducibilityOscar Corcho
Similar to Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections (20)
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections
1. Melissa Terras, James
Baker, James
Hetherington, David
Beavan, Martin Zaltz
Austwick, Anne Welsh,
Helen O'Neill, Will Finley,
Oliver Duke-Williams, and
Adam Farquhar
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: quotations, embeds from external sources, logos, and marked images.
Enabling Complex
Analysis of Large-Scale
Digital Collections
Humanities Research, High Performance
Computing, and transforming access to
British Library Digital Collections Data, code, viz: github.com/UCL-
dataspring
2. Overview
Barriers to computational approaches:
● fragmentation of communities,
resources, and tools;
● lack of interoperability;
● lack of technical skills
Data, code, viz: github.com/UCL-dataspring
3. Method
60k books from the British Library:
●
17th
- 19th
century
● 224GB compressed ALTO XML
● UCL High Performance Computing
● 4 humanities researchers
● Research questions to
computational queries
Data, code, viz: github.com/UCL-dataspring
5. Method
60k books from the British Library:
●
17th
- 19th
century
● 224GB compressed ALTO XML
● UCL High Performance Computing
● 4 humanities researchers
● Research questions to
computational queries
Data, code, viz: github.com/UCL-dataspring
6. Results
It worked!:
● Case Study 1: History of Medicine
● Case Study 2: History of Images
● Technical barriers
● Search ‘recipes’
Data, code, viz: github.com/UCL-dataspring
7. Case Study 1
History of Medicine Oliver Duke-Williams, UCL
Data, code, viz: github.com/UCL-dataspring
10. Technical
Major sticking point:
● Using humanities data on HPCs
Best practice recommendations:
● Derived datasets
● Normalisations
● Documentating decisions
● Fixed/defined dataset
Data, code, viz: github.com/UCL-dataspring
11. Generic searches:
● for all variants of a word
● that return keywords in context
traced over time
● for a word or phrase that ignore
another word or phrase
● for a word when in close proximity
to word a second word
● based on image metadata
Data, code, viz: github.com/UCL-dataspring
12. Conclusions
Recommendations for enabling
complex analysis of large-scale digital
collections in the humanities:
● 1 Invest in research software engineer capacity
to deploy and maintain openly licensed large-
scale digital collections from across the GLAM
sector in order to facilitate research in the arts,
humanities and social and historical sciences,
● 2 Invest in training library staff to run these initial
queries in collaboration with humanities faculty,
to support work with subsets of data that are
produced, and to document and manage
resulting code and derived data.
Data, code, viz: github.com/UCL-dataspring
13. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: quotations, embeds from external sources, logos, and marked images.
Special thanks to UCL
Research Computing and
British Library Digital
Research for their hard work
and support!
Data, code, viz: github.com/UCL-
dataspring
Melissa Terras, James
Baker, James
Hetherington, David
Beavan, Martin Zaltz
Austwick, Anne Welsh,
Helen O'Neill, Will Finley,
Oliver Duke-Williams, and
Adam Farquhar