Session5 02.tom derrick

Cross-disciplinary collaborations
to enrich access to non-Western
language material in the Cultural
Heritage sector
Tom Derrick, Nora McGregor, Dr Adi Keinan-Schoonbaert
Digital Scholarship Department, British Library

www.bl.uk 2
National library of the UK and
world’s largest library by
number of items catalogued.
c.150-200 million items stored
in London and in York.
20+ years creating digital
assets.
Digitisation is key to opening
up access
We can now do much more
than simply view these digital
objects online and must
embrace opportunities
afforded by analysing digital
collections at scale.

www.bl.uk 3
Our aims
• Support the British Library's mission to make our intellectual heritage
accessible to everyone “for research, inspiration and enjoyment”, particularly
our non-western materials
• Raise awareness of our South Asian Printed books and Arabic manuscript
collections, with a wide and diverse audience around the world, from the
general public, computer scientists, to students
• Instigate new collaborations in the computer science/recognition domain,
creating a dialogue around the challenges/opportunities for automatic
transcription of historical Arabic and Bengali texts
• Create openly licensed ground truth datasets to aid digital humanists and
researchers working on the state-of-the-art in recognition software

www.bl.uk 4
Two Centuries of Indian Print
Scope of collection:
- Rare and unique South Asian printed books collection
- 1,000 Bengali books, 1713-1914
- 600 Assamese and Sylheti books digitising 2018/19

www.bl.uk 5
Challenges for OCR
• Bengali not widely, or well supported by leading providers
• Extensive alphabet with complex character forms
• Varied historical fonts and alphabetical reforms
• Physical defects in material
• Quality of digitised items

www.bl.uk 6
Bangla OCR Competition
• ICDAR (Kyoto, Nov 2017)
• PRImA Research Lab, University of Salford
• 23 institutions 7 countries (50% India)
• Commercial tech companies + university computer science & engineering depts

Bangla OCR Competition Process
Selected images from collection
representing OCR challenges
Created Ground Truth
training set
Entrants trained systems
on Ground truth
Entrants perform OCR on
full collection of images
Evaluated by
PRIMA Research Lab
Image: neural network
Image: poster
Published report and poster
at ICDAR conference
Kyoto, Nov 2017

www.bl.uk 8
Competition Results
www.primaresearch.org/datasets/REID2017

www.bl.uk 9
Competition Results
www.primaresearch.org/datasets/REID2017

www.bl.uk 10
Current Bangla OCR Initiatives - Transkribus
• Handwritten and printed text analysis
• Collaborative platform
• 100 pages ground truth train HTR engine
• Supports non-Latin scripts
www.transkribus.eu

www.bl.uk 11
Initial Transkribus Results
• 100 pages of ground truth transcribed by Jadavpur University
• New HTR+ achieved 6% CER on same set of pages!
• On par with Google for OCR performance but requires lots of manual work
www.transkribus.eu

www.bl.uk 12
Future Plans
• Evaluate ICDAR2019 OCR competition methods
• Continue training Transkribus with new transcriptions in 2019
• Facilitating OCR training workshops in South Asia

www.bl.uk 13
BL Arabic scientific
manuscript collections
In 2014 the British Library Qatar Foundation
Partnership launched the Qatar Digital Library
(QDL): a bilingual, online portal providing access
to digitised British Library archival materials and
manuscripts relating to Gulf history and Arabic
science.
• 600 manuscripts (215 digitised)
• 1,500 texts
• 184,000 pages
• Manuscripts produced from Spain/North Africa
to India
• Manuscripts dating from the 10th-20th centuries
• Authors dating from the 5th century BC to the
19th century

www.bl.uk 14
Challenges with Arabic script
Arabic script presents unique challenges for text recognition:
• Arabic script writing styles are varied
• Characters are written in cursive, joined right to left, they
may take 2 to 4 shapes, and each is context sensitive
• The shape of each of the 28 Arabic characters may change
drastically depending on their location in the word
• The existence of non-joining characters means that
although the script is cursive, they do not join to the
following letter resulting in a small space within a word
• Long strokes along the baseline
• Complex combination of ascenders, descenders, diacritics,
and special notation either above or below the baseline
depending on the character pose further challenges.

www.bl.uk 16
In collaboration with our partners at
the Alan Turing Institute and PRImA
Research Lab, we launched a
competition as part of the 16th
International Conference on Frontiers
in Handwriting Recognition (ICFHR
2018) held August 5-8, 2018 in Niagara
Falls (USA).
The competition focused on finding an
optimal solution for accurately and
automatically transcribing historical
Arabic scientific handwritten
manuscripts, utilising ground truth that
we created.
A paper describing the competition and results was published in
the proceedings of ICFHR 2018:
C. Clausner, A. Antonacopoulos, N. McGregor, D. Wilson-Nunn,
"ICFHR 2018 Competition on Recognition of Historical Arabic
Scientific Manuscripts - RASM2018", Proceedings of the 17th
International Workshop on Frontiers in Handwriting Recognition
(ICFHR2018), Niagara Falls, USA, August 2018, pp. 471-476.
RASM2018 ICFHR2018 Competition on
Recognition of Historical Arabic Scientific
Manuscripts
http://www.primaresearch.org/RASM2018/

www.bl.uk 17
Competition challenges
‫و‬َ‫ا‬‫ﻻ‬‫و‬‫ل‬‫ﻣ‬ِ‫ن‬ْ‫ھ‬‫ذ‬‫ه‬‫ا‬‫ﻟ‬‫ﺛ‬‫ﻠ‬‫ﺛ‬‫ﮫ‬‫ا‬‫ﻻ‬‫ر‬‫آ‬‫و‬‫ھ‬ُ‫و‬‫ر‬َ‫ا‬‫ي‬‫ﻣ‬‫ن‬ْ‫ﯨ‬‫ز‬‫ﻋ‬‫م‬‫ا‬‫ن‬ّ
‫ا‬‫ﻻ‬‫ﻣ‬‫ز‬‫ا‬‫ج‬‫ا‬‫ر‬ْ‫ﺑ‬‫ﻌ‬‫ﮫ‬‫ھ‬ُ‫و‬‫ﻛ‬‫ﺎ‬‫ذ‬‫ب‬‫و‬‫ھ‬‫و‬‫ﯨ‬‫ﺎ‬‫ﻗ‬‫ص‬.‫و‬‫ﺛ‬‫ﺎ‬‫ﻧ‬‫ﻲ‬‫و‬‫ھ‬‫و‬‫ر‬َ‫ا‬‫ي‬
‫ﻣ‬‫ن‬‫ﯨ‬‫ز‬‫ﻋ‬‫م‬‫ا‬‫ن‬ّ‫ا‬‫ﻻ‬‫ﻣ‬‫ز‬‫ا‬‫ج‬‫ا‬‫ﺛ‬‫ﻧ‬‫ﺎ‬‫ن‬‫و‬‫ھ‬ُ‫و‬‫ا‬‫ﻛ‬‫ذ‬‫ب‬‫و‬َ‫ا‬‫ﻧ‬‫ﻘ‬‫ص‬.‫و‬َ‫ا‬‫ﻟ‬‫ﺛ‬‫ﺎ‬‫ﻟ‬‫ث‬
‫ر‬َ‫ا‬‫ى‬‫ﻣ‬‫ن‬‫ﻗ‬‫ﺎ‬‫ل‬‫ا‬‫ن‬ّ‫ا‬‫ﻻ‬‫ﻣ‬‫ز‬‫ا‬‫ج‬‫ﺗ‬‫ﺳ‬‫ﻌ‬‫ﮫ‬‫و‬‫ھ‬‫و‬‫ﺣ‬‫ق‬ّ‫ﻛ‬َ‫ﺎ‬‫ﻣ‬ِ‫ل‬.‫ا‬‫ﻟ‬ّ‫ذ‬‫ﯾ‬‫ن‬
‫ﯨ‬‫ز‬‫ﻋ‬‫ﻣ‬‫و‬‫ن‬‫ا‬‫ن‬‫ا‬‫ﻻ‬‫ﻣ‬‫ز‬‫ﺟ‬‫ﮫ‬‫ا‬‫ر‬ْ‫ﺑ‬‫ﻌ‬‫ﮫ‬‫ﯨ‬‫ﻠ‬‫ﺗ‬‫ﻣ‬‫ﺳ‬‫و‬‫ن‬‫ا‬‫ﻟ‬‫ﺣ‬‫ﺟ‬‫ﮫ‬‫ا‬‫ﻟ‬‫ﻣ‬‫ﻘ‬‫ﻧ‬‫ﻌ‬‫ﮫ‬‫ﻓ‬ِ‫ﻰ‬‫ذ‬‫ﻟ‬‫ك‬
‫ﻣ‬ِ‫ن‬‫ھ‬‫ذ‬‫ا‬‫ا‬‫ﻟ‬‫و‬‫ﺟ‬‫ﮫ‬‫ﻗ‬‫ﺎ‬‫ﻟ‬‫و‬‫ا‬‫ا‬ِ‫ن‬ّ‫ا‬‫ﻟ‬‫ﻛ‬‫ﯾ‬‫ﻔ‬‫ﯾ‬‫ﺎ‬‫ت‬‫ا‬‫ﻟ‬‫ﻔ‬‫ﺎ‬‫ﻋ‬‫ﻠ‬‫ﮫ‬‫و‬‫ا‬‫ﻟ‬‫ﻣ‬‫ﻧ‬‫ﻔ‬‫ﻌ‬‫ﻠ‬‫ﮫ‬‫ا‬‫ﻟ‬ّ‫ﺗ‬‫ﻰ‬
‫ﺑ‬‫ﮭ‬‫ﺎ‬‫ﯨ‬‫ﻛ‬‫و‬‫ن‬‫ا‬‫ﻟ‬‫ﻣ‬‫ز‬‫ا‬‫ج‬‫ھ‬‫ﻲ‬‫ا‬‫ر‬ْ‫ﺑ‬‫ﻌ‬‫ﮫ‬‫ا‬‫ﻟ‬‫ﺣ‬‫ر‬‫ا‬‫ر‬َ‫ه‬.‫و‬َ‫ا‬‫ﻟ‬‫ﺑ‬‫ر‬ُ‫و‬‫د‬‫ه‬‫و‬َ‫ا‬‫ﻟ‬‫ر‬ّ‫ط‬‫و‬‫ﺑ‬‫ﮫ‬
‫و‬َ‫ا‬‫ﻟ‬‫ﯾ‬‫ﺑ‬‫و‬‫ﺳ‬َ‫ﮫ‬.‫و‬‫ﻣ‬‫ز‬‫ا‬‫و‬‫ﺟ‬‫ﺎ‬‫ت‬‫ا‬‫ﻻ‬‫ر‬ْ‫ﺑ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ﻛ‬‫ﯾ‬‫ﻔ‬‫ﯾ‬‫ﺎ‬‫ت‬‫ﺳ‬‫ٮ‬‫ﻣ‬‫ز‬‫ا‬‫و‬‫ﺣ‬‫ﺎ‬‫ت‬
‫ﻣ‬‫ﻧ‬‫ﮭ‬‫ﺎ‬‫ا‬‫ﺛ‬‫ﯨ‬‫ﺎ‬‫ن‬‫ﻻ‬‫ﯨ‬‫ﯨ‬‫ﺑ‬‫ت‬‫و‬‫ھ‬‫ﻲ‬‫ا‬‫ﻟ‬‫ﺣ‬‫ر‬‫ا‬‫ر‬َ‫ه‬‫ﻣ‬‫ﻊ‬َ‫ا‬‫ﻟ‬‫ﺑ‬‫ر‬ُ‫و‬‫د‬‫ه‬.‫و‬َ‫ا‬‫ﻟ‬‫ر‬ّ‫ط‬‫و‬‫ﺑ‬َ‫ﮫ‬
‫ﻣ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ﯾ‬‫ﺑ‬‫و‬‫ﺳ‬َ‫ﮫ‬‫و‬َ‫ا‬‫ر‬ْ‫ﺑ‬‫ﻊ‬‫ﻣ‬‫ز‬‫ا‬‫و‬‫ﺟ‬‫ﺎ‬‫ت‬‫ھ‬‫ﻰ‬‫ا‬‫ﻟ‬‫ﺗ‬‫ﻰ‬‫ﯨ‬‫ﯨ‬‫ﺑ‬‫ت‬‫ا‬‫ﻋ‬‫ﻧ‬‫ﻰ‬‫ا‬‫ﻟ‬‫ﺣ‬‫ر‬‫ا‬‫ر‬َ‫ه‬
‫ﻣ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ر‬ّ‫ط‬‫و‬‫ﺑ‬‫ﮫ‬‫ا‬‫و‬‫ﻣ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ﯾ‬‫ﺑ‬‫و‬‫ﺳ‬َ‫ﮫ‬.‫و‬َ‫ا‬‫ﻟ‬‫ﺑ‬‫ر‬ُ‫و‬‫د‬‫ه‬‫ﻣ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ﯾ‬‫ﺑ‬‫و‬‫ﺳ‬‫ﮫ‬
‫ا‬‫و‬‫ا‬‫ﻟ‬‫ر‬ّ‫ط‬‫و‬‫ﺑ‬‫ﮫ‬.‫ﻣ‬‫ﺛ‬‫ﺎ‬‫ل‬ُ‫ھ‬‫ذ‬‫ه‬‫ا‬‫ﻟ‬‫ﻣ‬ُ‫ز‬َ‫ا‬‫و‬َ‫ﺟ‬َ‫ﺎ‬‫ت‬
‫ا‬‫ﻟ‬ّ‫د‬‫ﯾ‬‫ن‬‫ﻗ‬‫ﺎ‬‫ﻟ‬‫و‬‫ا‬‫ا‬ِ‫ن‬ّ‫ا‬‫ﻻ‬‫ﻣ‬‫ز‬‫ا‬‫ج‬‫ﻣ‬‫ز‬‫ا‬‫ﺟ‬‫ﺎ‬‫ن‬‫ا‬‫ﻟ‬‫ﺗ‬‫ﻣ‬‫ﺳ‬‫و‬‫ا‬‫ﺑ‬‫ﯾ‬‫ﺎ‬‫ن‬‫ذ‬‫ﻟ‬‫ك‬‫ﻣ‬‫ن‬‫ھ‬‫ذ‬‫ا‬
‫ا‬‫ﻟ‬‫و‬‫ﺟ‬‫ﮫ‬‫و‬‫ز‬‫ﻋ‬‫ﻣ‬‫و‬‫ا‬‫ا‬‫ن‬ّ‫ا‬‫ﻟ‬‫ﺣ‬‫ر‬‫ا‬‫ر‬َ‫ه‬‫و‬‫ھ‬‫ﻰ‬‫و‬‫ا‬‫ﺣ‬‫د‬‫ه‬‫ﻣ‬ِ‫ن‬َ‫ا‬‫ﻟ‬‫ﻛ‬‫ﯨ‬‫ﻔ‬‫ﯨ‬‫ﺎ‬‫ت‬‫ا‬‫ﻟ‬‫ﻔ‬‫ﺎ‬‫ﻋ‬‫ﻠ‬‫ﮫ‬
‫ﯨ‬‫ﺛ‬‫ﯨ‬‫ت‬‫ﻣ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ﯨ‬‫ﯨ‬‫و‬‫ﺳ‬َ‫ﮫ‬‫و‬‫ﻻ‬‫ﯨ‬‫ﺛ‬‫ﺑ‬‫ٮ‬‫ﻣ‬‫ﻊ‬‫ا‬‫ﻟ‬‫ر‬‫ط‬‫و‬‫ﺑ‬‫ﮫ‬‫ﻻ‬‫ن‬‫ا‬‫ﻟ‬‫ﺣ‬‫ر‬‫ا‬‫ر‬َ‫ه‬‫ﻻ‬َ
‫ﯨ‬‫ز‬‫ا‬‫ل‬‫ﯨ‬‫ﻔ‬‫ﻧ‬‫ﻰ‬‫ا‬‫ﻟ‬‫ر‬ّ‫ط‬‫و‬‫ﺑ‬‫ﮫ‬‫و‬َ‫ا‬‫ن‬ّ‫ا‬‫ﻟ‬‫ﺑ‬‫ر‬ُ‫و‬‫د‬‫ه‬‫ا‬‫ﯾ‬‫ﺿ‬‫ﺎ‬‫ھ‬ِ‫ﻲ‬‫و‬َ‫ا‬‫ﺣ‬‫د‬‫ه‬‫ﻣ‬‫ن‬‫ا‬‫ﻟ‬‫ﻛ‬‫ﯾ‬‫ﻔ‬‫ﯾ‬‫ﺎ‬‫ت‬
Challenge 2:
Text Line Polygons
Challenge 3:
OCR
Challenge 1:
Page Layout Analysis

www.bl.uk 18
We explored creating a ground
truth dataset collaboratively and
at scale, using the collective
expertise of volunteers
We utilised a free and open-source
platform, From the Page, which
allowed anyone with an interest in
historical Arabic manuscripts to
experience them up close
A BL team of curatorial &
translation experts produced the
first 10 pages to use as an example
for volunteers
It took only 18 days for 36
volunteers from around the world
to fully transcribe 85 pages
Collaborative Transcription
https://fromthepage.com/

www.bl.uk 20
Methods Evaluated
• Google Cloud Vision API
J. Walker, Y. Fujii, A.C. Popat “A Web-Based OCR Service for Documents” in
Proceedings of the 13th IAPR International Workshop on Document Analysis
Systems (DAS), Vienna, Austria, Apr. 2018
• KFCN, Ben-Gurion University of the Negev
B. Kurar and J. El-Sana, “Binarization free layout analysis for Arabic historical
documents using fully convolutional networks” in Arabic Script Analysis and
Recognition (ASAR), 2018 2nd International Workshop on. IEEE, 2018
• RDI, Cairo University
RDI-Corporation’s own Historical Arabic Handwritten/Typewritten OCR system which
has been built from different historical manuscripts
• Tesseract 3.04 + 4.0 (beta)
• ABBYY FineReader Engine 11

www.bl.uk 21
Results – Challenge 1
48.4%
54.5%
40.9%
70.6%
87.9%
30%
40%
50%
60%
70%
80%
90%
100%
Tesseract 3 Tesseract 4 FRE11 Google KFCN
SuccessRate
Page Layout Analysis
Winner

www.bl.uk 22
Text Line Segmentation
28.8%
44.2% 43.2%
67.7%
81.6%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Tesseract 3 Tesseract 4 FRE11 KFCN RDI
SuccessRate
Winner

www.bl.uk 23
OCR – Character Accuracy
20.93%
30.45%
12.23%
64.76%
85.44%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Tesseract 3 Tesseract 4 FRE11 Google RDI
FlexCharacterAccuracy
Winner

www.bl.uk 24
• RASM2019 ICDAR2019 competition
• Test this material with Transkribus
• Explore external collaborations e.g. with RDI, Transkribus, Open
Islamicate Texts Initiative (OpenITI)
Future Plans for historical Arabic texts

www.bl.uk 25
What’s next
• Integrate OCR with digital objects to make full
text searchable through IIIF viewer
• Host all ground truth resources and make freely
available for anyone wishing to advance the state-
of-the-art in text recognition technology (BL
Repository, replacing data.bl.uk)
• Host all resources on the IMPACT Centre of
Competence website
• Pilot workflows to OCR our materials at scale
using the more successful methods
• Promote our fully searchable digitised items to
target audiences (e.g. researchers)

@BL_DigiSchol
@BL_IndianPrint
@BL_AdiKS
digitalresearch@bl.uk
Six Jain saints from a Jain doctrinal text,
Jaina Dharma Siddhanta Sara
bl.uk/early-indian-printed-books
primaresearch.org/REID2019/
bl.uk/projects/arabic-htr
primaresearch.org/RASM2019

Session5 02.tom derrick

More Related Content

Similar to Session5 02.tom derrick

More from IMPACT Centre of Competence

Recently uploaded

Session5 02.tom derrick