Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Unexpected Repurposing: the British
Library's digital collections and UCL
teaching, research and infrastructure
Professor ...
#openglam
British Library, 28th
May 2008.
https://web.archive.org/web/20110707135434/http://pressandpolicy.bl.uk/Press-Releases/The-...
Optically Character Recognised (OCR) generated TextScanned Page
OCR XML Generated by ABBY Fine Reader
https://www.flickr.com/photos/britishlibrary
Image on Flickr
Commons
https://goo.gl/AC43vs
http://blpublicdomain.wikispaces.com/home
https://historicaltexts.jisc.ac.uk/results?filter=service%7C%7Cbl&tab=date
Data: what can we do with 65,000 books?
224GB compressed ALTO XML
http://www0.cs.ucl.ac.uk/staff/D.Mohamedally/
Staff and Students, working together
• James Baker, Adam Farquhar
• Melissa Terras, Dean Mohamedally, Tim
Weyrich,
• Stefa...
Approach
• How can we search the dataset differently?
• Complex and multifaceted needs of humanities
researchers
• Boolean...
github.com/BL-publicdomain/blpublicdomain
picaguess.herokuapp.com,
dx.doi.org/10.5281/zenodo.15980
James Baker, Tim Weyrich, Dean Mohamedally
Jonathan Lloyd, Meral ...
http://blbigdata.herokuapp.com/
James Baker, Tim Weyrich, Dean Mohamedally,
Ali Sarraf, James Durrant, Muhammad Rafdi
github.com/UCL-dataspring
Method
• 65k books from the British Library:
• 17th - 19th century
• 224GB compressed ALTO XML
• UCL High Performance Comp...
Results
Taking Humanities data to HPC…
https://www.flickr.com/photos/epublicist/3546059144
Case Study 1: History of Medicine, Oliver Duke-Williams, UCL
Case Study 2: History of Images, Will Finley, Sheffield
What did this tell us?
• Best practice recommendations:
– Derived datasets for home use
– Documentating decisions
– Fixed/...
Common Queries
• searches for all variants of a word
• searches that return keywords in context traced
over time
• NOT sea...
Do try this at home…
1. Invest in research software engineer capacity to
deploy and maintain openly licensed largescale
di...
github.com/UCL-dataspring
With thanks to
• BL Labs and Digital Curators: James Baker,
Adam Farquhar, Mahendra Mahey, Ben O’Steen,
Hana Lewis
• UCL C...
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Keynote: Unexpected repurposing
Upcoming SlideShare
Loading in …5
×

Keynote: Unexpected repurposing

3,520 views

Published on

Fourth annual BL Labs Symposium, 7 Nov 2016 keynote by Professor Melissa Terras: ‘Unexpected repurposing: The British Library's digital collections and UCL teaching, research and infrastructure’

Published in: Education
  • Be the first to comment

  • Be the first to like this

Keynote: Unexpected repurposing

  1. 1. Unexpected Repurposing: the British Library's digital collections and UCL teaching, research and infrastructure Professor Melissa Terras Professor of Digital Humanities, UCL Dept of Information Studies Director, UCL Centre for Digital Humanities m.terras@ucl.ac.uk, @melissaterras
  2. 2. #openglam
  3. 3. British Library, 28th May 2008. https://web.archive.org/web/20110707135434/http://pressandpolicy.bl.uk/Press-Releases/The-British-Library-19th-Century-Book-Digitisatio Returned to library in 2012, placed under a CCO-Public domain license for commercial and non-commercial use.
  4. 4. Optically Character Recognised (OCR) generated TextScanned Page
  5. 5. OCR XML Generated by ABBY Fine Reader
  6. 6. https://www.flickr.com/photos/britishlibrary
  7. 7. Image on Flickr Commons https://goo.gl/AC43vs
  8. 8. http://blpublicdomain.wikispaces.com/home
  9. 9. https://historicaltexts.jisc.ac.uk/results?filter=service%7C%7Cbl&tab=date
  10. 10. Data: what can we do with 65,000 books? 224GB compressed ALTO XML
  11. 11. http://www0.cs.ucl.ac.uk/staff/D.Mohamedally/
  12. 12. Staff and Students, working together • James Baker, Adam Farquhar • Melissa Terras, Dean Mohamedally, Tim Weyrich, • Stefan Alborzpour, Stelios Georgiou, Nektaria Stavrou, Wendy Wong, Jonathan Lloyd, Meral Sahin, Divya Surendran, James Durrant, Muhammad Rafdi, Ali Sarraf
  13. 13. Approach • How can we search the dataset differently? • Complex and multifaceted needs of humanities researchers • Boolean and Advanced Search • Microsoft Azure 5 APIs were implemented that functionally scale to the data • Offering unconventional services such as bulk download of text based on metadata queries, word frequency lists, and OCR text previews.
  14. 14. github.com/BL-publicdomain/blpublicdomain
  15. 15. picaguess.herokuapp.com, dx.doi.org/10.5281/zenodo.15980 James Baker, Tim Weyrich, Dean Mohamedally Jonathan Lloyd, Meral Sahin,Divya Surendran
  16. 16. http://blbigdata.herokuapp.com/ James Baker, Tim Weyrich, Dean Mohamedally, Ali Sarraf, James Durrant, Muhammad Rafdi
  17. 17. github.com/UCL-dataspring
  18. 18. Method • 65k books from the British Library: • 17th - 19th century • 224GB compressed ALTO XML • UCL High Performance Computing • Support from RITS and UCLDH • 4 humanities researchers • Turn research questions into computational queries • Learn from the researchers about their needs, wants, desires, and method.
  19. 19. Results
  20. 20. Taking Humanities data to HPC… https://www.flickr.com/photos/epublicist/3546059144
  21. 21. Case Study 1: History of Medicine, Oliver Duke-Williams, UCL
  22. 22. Case Study 2: History of Images, Will Finley, Sheffield
  23. 23. What did this tell us? • Best practice recommendations: – Derived datasets for home use – Documentating decisions – Fixed/defined dataset – Normalisations
  24. 24. Common Queries • searches for all variants of a word • searches that return keywords in context traced over time • NOT searches for a word or phrase that ignored another word or phrase • searches for a word when in close proximity to a second word • searches based on image metadata …. All returned in a derived dataset, in context.
  25. 25. Do try this at home… 1. Invest in research software engineer capacity to deploy and maintain openly licensed largescale digital collections from across the GLAM sector in order to facilitate research in the arts, humanities and social and historical sciences 2. Invest in training library staff to run these initial queries in collaboration with humanities faculty, to support work with subsets of data that are produced, and to document and manage resulting code and derived data.
  26. 26. github.com/UCL-dataspring
  27. 27. With thanks to • BL Labs and Digital Curators: James Baker, Adam Farquhar, Mahendra Mahey, Ben O’Steen, Hana Lewis • UCL CS Student Project Team: James Baker, Tim Weyrich, Dean Mohamedally • Bluclobber Project Team: James Baker, James Hetherington, David Beavan, Anne Welsh, Helen O’Neill, Will Finley, Oliver Duke-Williams, Adam Farquhar. • UCL Research IT Services: James Hetherington, Clare Gryce, Raquel Algere.

×