1. 1
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
http://www.bl.uk/projects/british-library-labs
Funded by the Andrew W. Mellon Foundation
Mahendra Mahey
Experiment with our
Digital Collections
Mahendra Mahey
Manager of BL Labs
Digital Humanities at the Open University Research Collaboration with the BL
10:00 to 1630, Monday 9th April 2018
BL Labs Roadshow 2018
OU, London
UK.
2. 2
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Introductions
• What are your research interests?
• What you hope to get out of the day?
3. 3
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Collections – not just books!
> 180*million items
> 0.8* m serial titles
> 8* m stamps
> 14* m books
> 6* m sound recordings
> 4* m maps
> 1.6* m musical scores
> 0.3* m manuscripts
> 60* m patents
King’s Library *Estimates
6. 6
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Finding Open Cultural Heritage Datasets
Collection Guides (199 as of 09/04/2018)
https://www.bl.uk/collection-guides/
Datasets about our collections
Bibliographic datasets relating to our published and
archival holdings
Datasets for content mining
Content suitable for use in text and data mining
research
Datasets for image analysis
Image collections suitable for large-scale image-
analysis-based research
Datasets from UK Web Archive
Data and API services available for accessing UK Web
Archive
Digital mapping
Geospatial data, cartographic applications, digital aerial
photography and scanned historic map materials
https://data.bl.uk
Download collections as zips, no API
Each dataset has a Digital Object Identifier (DOI)
can be referenced for research
Not all discoverable via
search engines!
7. 7
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Competition
Awards
Projects
Tell us your ideas of what to do with our digital content
Show us what you have already done with our digital
content in research, artistic, commercial and learning and
teaching categories
Talk to us about working on collaborative projects
8. 8
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Have you got X?
https://upload.wikimedia.org/wikipedia/commons/5/50/Real_wuerzburg.jpg
Looking for Physical Content in the British Library
9. 9
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Have you got X digitised / in digital form?
http://www.yorkmix.com/wp-content/uploads/2014/04/mr-simms-sweet-shoppe-york.jpg
Looking for Digitised / Digital Content in the BL
11. 11
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Openly Licensed Digital Content?
15% Openly
Licensed
Around 80%*
available online
Working through to make more open…
Though some collections will always only be available onsite due to
various reasons including legal, ethical etc
Breakdown by collection*
Manuscripts 59%
Books 9%
Maps and Views 7%
Newspapers 3%
Archives and Records 3%
Paintings, Prints and Drawings 2%
*Based on number of digitisation projects (702 as of 09/04/18)
Largest proportion of funding
Public / Private Partnership
15 %* Openly Licensed – most online
85 %* Available onsite only at the moment
*Estimates
12. 12
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
The Story of the Digital Collection…
Digital
Collection
Curator
Who paid for the digitisation?
Who did the digitisation?
Technology used
Born digital?
Published
Unpublished
Where is it?
Can it still be accessed?
Generates income
Reputational risk in using?
Legalities
Politics when digitised
Personalities involved
Surprises (e.g. gaps)
Descriptive information
Old format not supported
What media was the
digitisation done from?
Is there any background documentation?
No Descriptive information
Inconsistent descriptive information
Still there?
Good to know the background ‘Story’ of a Digital Collection’
if you want to use it for research and make conclusions…
13. 13
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
https://goo.gl/qpCLlk
https://goo.gl/wMTS3Z
• Dialogue typically:
– you are ‘lucky’ & we have the digital content
/ data relevant to your research
– we don’t have exactly what your looking for,
but is there anything of interest? Let’s talk…
– engagement is hard work and it’s constantly
required to maintain interest in our digital
collections!
• Artists find this dialogue easier…
• We also tend to attract researchers with ‘fuzzier’
research boundaries and possibly open to more
interdisciplinary / collaborative research
What engagement does the BL have with
researchers wanting use our digital content?
16. 16
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
How do we give access to
onsite-only
Digital Collections
(85% of our Digital Collections)?
17. 17
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
READING
ROOM
ON
SITE
NOT
ONLINE
OPEN
British Library
£
Labs Residency Model
Challenges of access to Digital Collections
18. 18
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Accessing digital collections onsite
OPEN
£
• Have to be ‘onsite’ (interpretations vary)
• Need to be ‘security cleared’ ‘trusted’ for some collections
– Hence ‘Researcher in Residence Model’
• Permission required (depending on ‘story’ of collection)
• Content could be on various media formats
(not always online)
• 5 - 20 % re-use of material for non commercial research for
some collections, depends on agreements in place
• We are learning ‘pathways’ so that this becomes ‘everyday’ to
provide onsite access to some digital collections in the future
20. www.bl.uk
Phase 1: Exploration
• Exploration phase allows a researcher to:
• understand the data in an open-ended fashion,
• discover potential tools to work with the data,
• gain awareness of their capabilities and limitations,
• develop a firmer research query and
• gauge the costs, risks and time needed.
• Outputs of the exploration are not intended to be shareable,
beyond personal experience and key features (data size, formats, tool
successes, etc).
20
21. www.bl.uk
Phase 2: Query-Focussed
• “Query-Focussed”, the familiar project but due to phase 1:
• A firmer and more informed query by the researcher
• Suitable datasets already lined up
• A good idea of the initial toolset and capabilities (human and computer)
required
• Project output is outlined, and relevant reuse applications are begun.
• Clear agreements on what happens at the end of the project – data
deletion, virtual machine deletion/archiving/etc.
• Project may iterate on initial ideas, depending on researcher’s cost/risk
appetite
21
22. www.bl.uk
Phase 3: Wrap-up
• Wrap-up
• Work (code, notes) exported and given to researcher
• All derivative data is licenced or retained based on reuse agreements
(Access & Reuse board, etc)
• Provisions made for the project are wound-down, as agreed (derivative
data deleted after a grace period, etc)
22
24. 24
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Why are doing this? (1)
Working closely with and listening
to those who want use our digital
collections and data for their work
https://goo.gl/esqpRb
25. 25
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
We can learn how we are and should be supporting them and
this therefore shapes the problems we work on, such as:
https://goo.gl/esqpRb
Why are doing this? (2)
• Access to digital collections / data?
• Advice, guidance, technical
support, training
• Services, Tools and Processes?
• Many more reasons…
26. 26
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Where are the gaps between what users want & what we can
give?
How do we build the bridges to overcome the gaps?
Why are doing this? (3)
https://goo.gl/6CwCeE
27. 27
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
How do we help users ‘navigate’ their way through the
‘maze’ of the Library to what they want to do?
Sometimes requires understanding the culture of the organisation
https://goo.gl/62JnQT
Why are doing this? (4)
28. 28
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Working with British Library Digitised
Newspapers
• Digitised through public / private means
• Can use commercial products to look manually for content, with search
interfaces but no APIs, useful starting point though, manual methods
can translate into computational ones
• OCR quality is not great, metadata is OK, but plenty of hidden material,
approaches require to consider this, e.g. ‘Good, Bad and Ugly’ OCR
• If you want to work on digitised items at the BL, need security clearance,
carry out exploration, write letter of intent to communicate to GALE
• Can purchase drives from GALE Cengage with content (dependent on
subscription)
29. 29
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Good, Bad, Ugly Image Quality / OCR
• Original image capture of newspaper images can effect the
quality of the OCR
• A poor image, very difficult to re-OCR
• Good image quality much better chance for re-OCR
• Bi-tonal, Grey Scale, Colour can effect the quality of the
OCR
• Methodology of working with collection at scale needs to
acknowledge OCR and image quality
31. 31
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Burney Collection
• Gathered by the Reverend Charles Burney (1757- 1817)
• 700 volumes, newspapers and news pamphlets, published
in London, English provincial, Irish and Scottish papers, and
a few examples from the American colonies.
• 1271 titles
• Around 1 million digitised page images – from around 2006
from Microfilm
• OCR quality mixed, used custom XML format
• Bi-tonal
36. 36
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Breakdown of titles
Title No. of Pages
PUBLIC ADVERTISER 60680
LONDON GAZETTE 44463
LONDON EVENING POST 38920
LONDON CHRONICLE 32030
GAZETTEER AND NEW DAILY ADVERTISER 31250
LLOYD'S EVENING POST 28941
ST. JAMES'S CHRONICLE OR THE BRITISH EVENING POST 28130
MORNING CHRONICLE AND LONDON ADVERTISER 27658
DAILY COURANT 25334
GENERAL EVENING POST 23500
12 TITLES WITH 10,000+ PAGES 188266
87 TITLES WITH 1,000+ PAGES 289745
216 TITLES WITH 100+ PAGES 79374
945 TITLES WITH 1 TO 100 PAGES 16816
37. 37
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Example Folders
B0001ORIWEEJO - APPLEBEE''S ORIGINAL WEEKLY JOURNAL - 1715 – 1720
B0018CONTPROC - PROCEEDINGS OF THE ARMY UNDER THE COMMAND OF SIR
THOMAS FAIRFAX – 1645
B0054REPINFCH - REPORT OF THE STATE OF THE GENERAL INFIRMARY AT
CHESTOR - 1754?-1779
B0101PROCPARL - EXACT RELATION OF THE PROCEEDINGS AND TRANSACTIONS
OF THE LATE PARLIAMENT – 1654
B0277INSTRUCT - INSTRUCTOR – 1724
B1381SCOU1717 - SCOURGE (1717, REPRINT) - 1717?
38. 38
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Example files
‘service’ folder contains page level images and corresponding OCR XML
BurneyB0001ORIWEEJO17151119service
39. 39
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
APPLEBEE''S ORIGINAL WEEKLY JOURNAL
FROM SATURDAY NOVEMBER 19 TO SATURDAY
NOVEMBER 26 1715
WO2_B0001ORIWEEJO_1715_11_19-0001.tiff
44. 44
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Accessing digitised newspapers
onsite at the BL (JISC 1)
12 Volumes, 80TB of data
45. 45
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Accessing digitised newspapers
onsite at the BL
15a
Accessing ‘service’ Copy (post processed)
and results of OCR available as XML
46. 46
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Accessing digitised newspapers
onsite at the BL
Accessing ‘service’
Copy (post processed)
15b
49. 49
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Metadata from BL (JISC 1 and 2)
• Title Metadata
– Title, as written
– Normalised title across all variants
– Standardised title abbreviation
– Variant titles, with associated dates
– Place of publication
– Dates of publication
– Genre, such as newspaper
– Sub-collection, such as Regional
Daily
Issue Metadata
Volume Number
Issue Number
Date as printed
Normalised date (YYYY.MM.DD)
Number of pages
The microfilm reel number
The OCR quality
Page image data
The number of the image within that issue
The filename
The spatial coordinates for the page within the
image
The degree of page skew
50. 50
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Metadata from Gale (JISC 1 and 2)
• Standardised identifier
• Newspaper title
• Standardised title abbreviation
• Project codes
• Digitized collection name
• Issue number
• Date as printed
• Standardised date (Month, DD,
YYYY)
• Standardised date (YYYYMMDD)
• Day of the week
• Number of Pages
• Copyright holder
Language
Unique ID for publication
Holding Library
Citation of the physical item
Title metadata
Title as recorded in the MARC Library
Catalogue
Dates of publication
Genre, such as newspaper
Conversion credit, usually a vendor
Article
Unique ID
OCR quality
SC, or standardized category of article
Unique ID(s) of page(s)
Unique ID(s) of individual column(s)
Column number
Headline
Article type
51. 51
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Samples for JISC 1
‘master’ contains high res tiff
‘service’ contains post processed tiff and OCR XML
BNWL - The Belfast News-Letter - 1871 - November 14
BNWL - The Belfast News-Letter - 1885 - September 12
DNLN - Daily News - 21 Jan 1846 - 31 Dec 1900
56. 56
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Samples for JISC 2
Lancaster Gazetter, And General Advertiser For Lancashire West
Southampton Herald
Berrows Worcester Journal
A - Contains post processed files
M - Contains JP2
O - Contains ALTO XML
57. 57
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Previous ideas of using collection
• Bob Nicholson – finding jokes
• Katrina Navickas – Political meetings
• Hannah Murray – Black abolitionist performances
• Jennifer Batt – Finding poetry
• Surendra Singh – Finding suicide articles
• Melodee Beals – Evidence of copy and paste
• Ryan Cordel – Viral Texts
• Paul Fyfe - Snipping out images
58. 58
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
http://victorianhumour.tubmblr.com
Victorian Meme Machine (2014)
https://goo.gl/HMqDt3
Bob Nicholson
http://victorianhumour.tumblr.com/
Bob Nicholson interviewed on
BBC Radio 4 Making History Programme:
http://goo.gl/fmV9ep
And telling jokes to the public:
http://goo.gl/xIDRhz
Bob obtained further funding from his university
Looking for more collaborations
https://www.youtube.com/watch?v=-GRgj7Q5OM0
Rob Walker, Victorian Mother-in-law Jokes
Victorian Comedy Night, 7 Nov 2016
Learnt about access paths
to digital collections
59. 59
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Katrina Navickas (2015)
Political Meetings Mapper
http://politicalmeetingsmapper.co.uk
https://goo.gl/Qq78Oa
Labs Symposium 2015
https://goo.gl/BSA3be
Interview 2015
The Chartist Newspaper
http://goo.gl/vOLSnH
Chartist Monster Meeting
Chartists Walking Tour and
Re-enactment London
Learnt that domain knowledge
reduces noise
60. 60
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Black Abolitionist Performances & their
Presence in Britain (2016) – Hannah-Rose Murray
Frederick
Douglass
Ellen
Craft
Josiah
Henson
Ida B
Wells
A Performance by
Joe Williams &
Martelle Edinborough
http://frederickdouglassinbritain.com/
Started to implement
Machine Learning Techniques
61. 61
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Data-mining verse in 18th Century newspapers
BL Labs Project 16-17, Jennifer Batt
https://goo.gl/5Akthd
Slides courtesy Jennifer Batt
62. 62
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
What thoj' among ourrelves, with too much Heat, or t
W: fweutimes.wongle, wvhen we Ihould debate, W –
(A confequential Ill which Freedom drawvs, fl t
A bad Efficf, but from a noble Caufe) t
We can with univeifal Zcal advance, to
To cutb the faithlefs Arrogancccof V rance. hi
Dublin Journal, 10-14 September, 1745 Slides courtesy Jennifer Batt
63. 63
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Verse: 81% lines begin with
initial capital
Prose: 52% lines begin with
initial capital
Westminster Journal 3 March 1745
Slides courtesy Jennifer Batt
Started to refine
Machine Learning Techniques
Jennifer Batt @ the BL on World Poetry Day
‘40,000’ things found…
Possibly using Gale Primary
Sources interface to see if we
can sift this data
64. 64
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Psychiatrist’s Journey
into 19th Century Newspapers (2016)
• Dr Surendra P Singh, Consultant Psychiatrist
• To identify weekly, monthly, yearly and
longitudinal trends in suicide reporting in
terms of gender, status, sites, locations and
health in OCR text of 19th Century
Newspapers
• Used ‘R’ Open Source Stats
Package to collect ‘Suicide’ corpus
• Looking for collaborators to work on this
dataset
Use off-the-shelf tools
and remote access pathways
66. 66
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Use of Overproof
OCR Correction?
Re-OCR with
ABBY FineReader?
https://www.abbyy.com/en-gb/
http://overproof.projectcomputing.com/
RE-OCR
67. 67
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Virtual Infrastructure for OCR text
OCR text ‘scraped’ from
digitised newspapers
and put in cloud
Jupyter notebook
Write python code and results
in web browser
http://jupyter.org
Access available for researchers ‘in residence’
https://www.docker.com/
http://dhbox.org/
68. 68
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Working with the MS Books Collection
• Metadata
• Page level images
• OCR Text
• Flickr Commons - images snipped out and user generated
tags for images
• Artistic Works
• 19th Century Books Collection data
69. 69
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
65,000 digitised 19th Century books
Image: Artwork by Alicia Martin 2007 / 2008
Paid for by:
For a full list:
https://goo.gl/HqPQMS
Subjects include:
Philosophy
Poetry
History
Literature
1789 - 1876
80. 80
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Optically Character Recognised (OCR)
generated Text
Scanned Page
Image on Flickr
Commons
https://goo.gl/AC43vs
81. 81
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
British Library Flickr Commons
https://www.flickr.com/photos/britishlibrary/
Flickr Commons has items from
Galleries, Libraries, Archives and Museums (GLAM)
(Mostly Public Domain)
83. 83
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Getting an account on Flickr
•Get a Flickr / Yahoo account
(https://login.yahoo.com/account/create)
•You can then tag, organise favourites, make
your own albums and galleries from Flickr
images online or uploaded
•You get 1TB for free!
•You could reference your own Flickr account
for the competition?
84. 84
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
British Library Flickr Commons
Why Flickr Commons?
• Free!
• Each image has it’s own unique web address, easy to share
• Can Tag images
• Has Application Programming Interface (API)
Late August 2013
85. 85
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Worked better for female faces than men’s
Press
http://mechanicalcurator.tumblr.com
Posts image every 30 minutes
http://www.flickr.com/photos/britishlibrary/
1,020,418 images
need tagging!
Creative uses of images
Face recognition
Algorithms based on photos
Mechanical Curator
with an algorithmic brain
(Circles, Squares and Slanty etc)
http://goo.gl/qPPgxX
Wikimedia
Flickr Commons
Individual URL & API
Snipping out images
from 65,000 Digitised Books*
>800,000,000* views
>17,000,000* tags
https://goo.gl/FgZ4HM
Work @ BL by Ben O’Steen, Labs
and Digital Research Team*Matt Prior - http://goo.gl/j29Tnx
Since Dec 2013
Tumblr
*Estimates
>More demand to see
physical items
86. 86
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Using British Library Flickr Commons
•How do we find things in this collection?
•Remember snipped out images from books
with no description?
•Not straightforward…
87. 87
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
How is Flickr Commons Organised?
• Photostream
• Albums
• Faves
• Galleries
• Tags
88. 88
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Flickr Photostream
https://www.flickr.com/photos/britishlibrary/
Kind of the home page for the collection!
Usually displays images with most recent activity!
89. 89
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Flickr Albums
Curated by the British Library – specifically Nora McGregor
She works with the public to add images or create new ones!
440 Albums as of 23/10/17 – Mostly Maps!
90. 90
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Flickr Faves
Most favorited image first in descending order
To favourite an image requires an account
91. 91
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Flickr Galleries
More useful if you have an account
You can create a Gallery of Flickr images to share with everyone
Gallery is tied to your account
92. 92
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Flickr Groups
Community based – for sharing and discussing images
We might create a group for the competition – watch this space!
100. 100
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Opportunities
– increasing traffic to Library services
You can purchase
a ‘High Res’ Copy
View in the
Library Item Viewer
Download .pdf
All illustrations
in book
Other illustrations in books
Published in same year
View the item in
the Library Catalogue Tags auto generated
User generated
Tag
Grouping for image
117. 117
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Warning – can be large file!
It’s aPDF
You can do Ctrl F in it to find text
But health warning about OCR!
119. 119
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Page numbers don’t always correspond!
Page numbers
Don’t always correspond
Page 132 on Flickr?
Is Page Number in PDF
In PDF of
book
Page number
in book
121. 121
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Plain Text from Books?
Not working
But can be obtained from https://data.bl.uk/digbks/db14.html
122. 122
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
All illustrations in book / books in same year!
All the illustrations in this book Other illustrations books published
in the same year
127. 127
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Tagging a million images
Iterative Crowdsourcing
http://goo.gl/j6fxac
Cardiff University’s
Lost Visions Project
http://www.metadatagames.org/
Metadata Games
James Heald
Mario Klingemann
Chico 45
Use computational methods
Human Tagger
Top British Library Flickr Commons Taggers
18 hard core taggers
How to reward and keep motivated this ‘small group?
Average for ‘crowd’ is 1 tag per person
What kind of ‘task’ can this ‘crowd’ do?
Mobile games for ‘Ships’, ‘Covers’ and ‘Portraits’ Interface for tagging
128. 128
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Adding Tags!
•You have to have an account to add tags!
•Could you be the next Chico 45?
136. 136
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Artistic / Creative Works
http://goo.gl/dM8ieA
Mario Klingeman (2015)
Code Artist / Curator
https://www.youtube.com/watch?v=Q3SBxO34Zlc
David Normal 2014 and 2015
Collages/Paintings & Lightboxes
http://goo.gl/bNxGZZ
Kris Hoffman (2016)
Animation for Fashion Week 2016
https://goo.gl/QilqqT
Jiayi Chong 2016 - Animation tool
https://www.facebook.com/RealmlandStory/
Paul Rand Pierce 2016
Graphic Novel on Facebook
Tragic Looking Women
44 Men who Look 44
(Notice the direction faces)
A Hat on the Ground
Spells trouble
137. 137
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Imaginary Cities – BL Labs Project 16-18
Michael Takeo Magruder
An artistic exploration seeking to create provocative fictional cityscapes for the Information Age
from the British Library’s digital collection of historic urban maps
139. 139
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
19th Century Books Metadata
• 1,9 Million records of 19th Century Books
• Used for Sample generator project
140. 140
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Special Jury’s Prize (2015)
James Heald – Wikimedia and Map work
https://goo.gl/WYZCB2
http://goo.gl/HNQq5e
https://goo.gl/VPgffL
https://commons.wikimedia.org/
https://goo.gl/djtm1b
Labs Symposium (2015)Geotagging maps
50,000 Maps
Found in Flickr 1 million
Human & Computational Tagging
& Community engagement
Geo-referencing work
https://www.bl.uk/georeferencer
141. 141
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Using the Wikimedia Synoptic Index
• Created to help find all the maps in the books
• Great resource if you want to find things by place!
https://goo.gl/zuxRnG
146. 146
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Alston Index
Internal Document
55-602 - Topical Index
603 - 925 - Pressmark Sequence925 page document of BL /
British Museum Pressmarks
147. 147
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Alston Index
• Internal document (not to be externally shared)
• Published in 1987 – dot matrix printed
• Refers to British Museum and British Library Pressmarks /
Shelfmarks
• Shelfmarks are used internally to identify
151. 151
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Playbills
• 90,000 theatrical playbills 1660-1902 scanned for Optical
Character Recognition (OCR), total approximately 107,000
individual sheets (including Misc), comprising of 320
volumes, as 320 .pdfs with copyable text, folder, called
‘PDFs’. The quality of the OCR is quite poor.
152. 152
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Playbills
• Each pdf corresponds to a physical volume at the British
Library. If we examine ‘lsidyv3f9b9a08.pdf’
• On the spreadsheet – ‘MISC Data’ – corresponds to
Playbills 1
• Descriptive Data worksheet
• A collection of playbills from Drury Lane Theatre 1780-1785.
• 516 sheets, 36 cm.
156. 156
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
Sarah Middle - Judicial Committee of the
Privy Council Papers
http://blogs.bl.uk/digital-scholarship/2017/12/cleaning-and-visualising-privy-council-appeals-data.html
http://blogs.bl.uk/socialscience/2017/12/the-judicial-committee-of-the-privy-council-a-short-introduction-and-sources-for-research.html
http://blogs.bl.uk/digital-scholarship/2018/02/converting-privy-council-appeals-metadata-to-linked-data.html
157. 157
@BL_Labs #OUDH @BL_DigiSchol labs@bl.uk
British National Bibliography
• http://bnb.data.bl.uk
http://thedatahub.org/dataset/bluk-bnb-basic
http://www.bl.uk/bibliographic/download.html
http://bnb.data.bl.uk/sparql
85 seconds
The picture you can see is inside the main building in London, it’s the King’s Library – King George the Third’s personal library! Sometimes known as the ‘stack’, I walk past this everyday and I sometimes forget that the collections the British Library have are truly staggering! We currently estimate them to exceed <click>150 million items, representing every age of written civilisation and every known language. Our archives now contain the earliest surviving printed book in the world, the Diamond Sutra, written in Chinese and dating from 868 AD….
So some big numbers…
Over …<click>14 million books
<click>60 million patents
<click>8 million stamps
<click>4 million maps
<click>3 million sound recordings
<click>1.6 million music scores
<click>over .3 million manuscripts
<click>0.8 million serials titles (which are of course made up of many many volumes/editions), this is where a lot of our content is, just in case you thought the numbers didn’t add up!
17 Seconds (53 Words)
<Click>The British Library is one of the largest Library’s in the world <Click> with an estimated 180 million physical items, with only a small proportion being digitised. <Click>We estimate this is around 1-2%, but no one really knows exactly how much. However, increasingly more items are being stored as ‘born’ digital, such as the UK Web Archive<Click>
<click>The British Library faces many challenges of access to our Digital collections!
<click> Sometimes digital content is only available onsite due to license restrictions,
<click>or even only on a specific computer in a reading room! Technically there are very few reasons why digital content can’t be online
<click> though it might be too big or hasn’t been transferred from other digital storage media.
<click>Sometimes access is through a paywall. Finally,
<click>some content is in the happy sunny place, online, open and freely available.
The real reasons why there are challenges to accessing digital content are of course human. They require different approaches from the Library and may often involve an honest, open dialogue and negotiation with the publishers.
The Labs project has tried to address this problem my creating a ‘residency model’ for researchers to work intensively with a digital collection on-site, so as to not infringe access conditions, I will say more about this later.
<click>The British Library faces many challenges of access to our Digital collections!
<click> Sometimes digital content is only available onsite due to license restrictions,
<click>or even only on a specific computer in a reading room! Technically there are very few reasons why digital content can’t be online
<click> though it might be too big or hasn’t been transferred from other digital storage media.
<click>Sometimes access is through a paywall. Finally,
<click>some content is in the happy sunny place, online, open and freely available.
The real reasons why there are challenges to accessing digital content are of course human. They require different approaches from the Library and may often involve an honest, open dialogue and negotiation with the publishers.
The Labs project has tried to address this problem my creating a ‘residency model’ for researchers to work intensively with a digital collection on-site, so as to not infringe access conditions, I will say more about this later.
21 Seconds (65 Words)
Katrina Navickas was particularly interested in the <Click>Chartist Movement who were a group who were campaigning for the vote for working people. <Click>They were the biggest popular movement for democracy in 19th century British history, just as this is early picture shows a huge monster meeting at Kennington Common<Click>She wanted to use a combination of manual and computational methods to explore our Digitised Newspapers to find out when and where they met and plot them on map. <Click>and hopefully unearthing new history.
970 files from a selection of 19th century newspaper titles from the BL corpus for us to correct using the overProof post-OCR correction software
The best way to measure the improvement made by the correction process is to compare the OCR'ed text and the automatically corrected text with a perfect correction made by a human (known as the "ground truth").
Hannah-Rose's 5 small human-corrected samples are show as green dots. These are not only smaller than the other files, but their raw error rate is much lower at 13.3%. OverProof was measured as reducing this to 5.4%, a removal of almost 60% of errors.
The red dotted-line indicates the correction "break-even" point: the further under the line, the better the quality of the document after correction.
In the graph below, the grey line shows distribution of files across error rates before correction and the green line after correction.
Posts small illustrations taken almost at random from the digitised book corpus to a Tumblr blog.
This experiment with undirected engagement was a by-product of work to uncover the hidden wealth of illustrations within the digitised pages.
50 seconds
Here is the anatomy of a Flickr record, importantly we have created links to many of the Library’s services <click>some of this lovely traffic is going back to the Library and hopefully generating more interest in our services, from downloading a pdf of the book to purchasing a high res scan of the image.
<click>Tags are added from the original book record, including the approximate page number the image came from<click>users of Flickr can add their own tags, and I have mentioned they have already started doing it.
18 Seconds (56 Words)
Indexing BL the 1 million & Mapping the Maps – was led by James Heald and collaboration with others <Click>They produced an index of 1 million 'Mechanical Curator collection' images on <Click>Wikimedia Commons from a collection of largely un-described images. <Click>This gave rise to finding 50,000 maps within the collection partially through a map-tag-a-thon <Click>These are now being geo-referenced. <Click>