Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Melissa Terras, James
Baker, James
Hetherington, David
Beavan, Martin Zaltz
Austwick, Anne Welsh,
Helen O'Neill, Will Finl...
Overview
Barriers to computational approaches:
● fragmentation of communities,
resources, and tools;
● lack of interoperab...
Method
60k books from the British Library:
●
17th
- 19th
century
● 224GB compressed ALTO XML
● UCL High Performance Comput...
Data, code, viz: github.com/UCL-dataspring
UCL’s Legion Cluster supercomputing facility. Photo: Tony Slade, © UCL Creative...
Method
60k books from the British Library:
●
17th
- 19th
century
● 224GB compressed ALTO XML
● UCL High Performance Comput...
Results
It worked!:
● Case Study 1: History of Medicine
● Case Study 2: History of Images
● Technical barriers
● Search ‘r...
Case Study 1
History of Medicine Oliver Duke-Williams, UCL
Data, code, viz: github.com/UCL-dataspring
Case
Study 2
History of
Images
Will Finley,
Sheffield
Data, code, viz: github.com/UCL-dataspring
Case
Study 2
History of
Images
Will Finley,
Sheffield
Data, code, viz: github.com/UCL-dataspring
Technical
Major sticking point:
● Using humanities data on HPCs
Best practice recommendations:
● Derived datasets
● Normal...
Generic searches:
● for all variants of a word
● that return keywords in context
traced over time
● for a word or phrase t...
Conclusions
Recommendations for enabling
complex analysis of large-scale digital
collections in the humanities:
● 1 Invest...
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: quotations, e...
Upcoming SlideShare
Loading in …5
×

Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections

603 views

Published on

Talk at Digital Humanities 2016 with Melissa Terras, James Hetherington, David Beavan, Anne Welsh, Helen O'Neill, Will Finley, Oliver Duke-Williams, Adam Farquhar, and Martin Zaltz Austwick.

Abstract http://dh2016.adho.org/abstracts/2584

Published in: Education
  • Be the first to comment

Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections

  1. 1. Melissa Terras, James Baker, James Hetherington, David Beavan, Martin Zaltz Austwick, Anne Welsh, Helen O'Neill, Will Finley, Oliver Duke-Williams, and Adam Farquhar This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: quotations, embeds from external sources, logos, and marked images. Enabling Complex Analysis of Large-Scale Digital Collections Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections Data, code, viz: github.com/UCL- dataspring
  2. 2. Overview Barriers to computational approaches: ● fragmentation of communities, resources, and tools; ● lack of interoperability; ● lack of technical skills Data, code, viz: github.com/UCL-dataspring
  3. 3. Method 60k books from the British Library: ● 17th - 19th century ● 224GB compressed ALTO XML ● UCL High Performance Computing ● 4 humanities researchers ● Research questions to computational queries Data, code, viz: github.com/UCL-dataspring
  4. 4. Data, code, viz: github.com/UCL-dataspring UCL’s Legion Cluster supercomputing facility. Photo: Tony Slade, © UCL Creative Media Services (all rights reserved)
  5. 5. Method 60k books from the British Library: ● 17th - 19th century ● 224GB compressed ALTO XML ● UCL High Performance Computing ● 4 humanities researchers ● Research questions to computational queries Data, code, viz: github.com/UCL-dataspring
  6. 6. Results It worked!: ● Case Study 1: History of Medicine ● Case Study 2: History of Images ● Technical barriers ● Search ‘recipes’ Data, code, viz: github.com/UCL-dataspring
  7. 7. Case Study 1 History of Medicine Oliver Duke-Williams, UCL Data, code, viz: github.com/UCL-dataspring
  8. 8. Case Study 2 History of Images Will Finley, Sheffield Data, code, viz: github.com/UCL-dataspring
  9. 9. Case Study 2 History of Images Will Finley, Sheffield Data, code, viz: github.com/UCL-dataspring
  10. 10. Technical Major sticking point: ● Using humanities data on HPCs Best practice recommendations: ● Derived datasets ● Normalisations ● Documentating decisions ● Fixed/defined dataset Data, code, viz: github.com/UCL-dataspring
  11. 11. Generic searches: ● for all variants of a word ● that return keywords in context traced over time ● for a word or phrase that ignore another word or phrase ● for a word when in close proximity to word a second word ● based on image metadata Data, code, viz: github.com/UCL-dataspring
  12. 12. Conclusions Recommendations for enabling complex analysis of large-scale digital collections in the humanities: ● 1 Invest in research software engineer capacity to deploy and maintain openly licensed large- scale digital collections from across the GLAM sector in order to facilitate research in the arts, humanities and social and historical sciences, ● 2 Invest in training library staff to run these initial queries in collaboration with humanities faculty, to support work with subsets of data that are produced, and to document and manage resulting code and derived data. Data, code, viz: github.com/UCL-dataspring
  13. 13. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: quotations, embeds from external sources, logos, and marked images. Special thanks to UCL Research Computing and British Library Digital Research for their hard work and support! Data, code, viz: github.com/UCL- dataspring Melissa Terras, James Baker, James Hetherington, David Beavan, Martin Zaltz Austwick, Anne Welsh, Helen O'Neill, Will Finley, Oliver Duke-Williams, and Adam Farquhar

×