A presentation to attendees of our Arabic Scientific Manuscripts ground truth for OCR transcription workshop.
For more details see: https://www.eventbrite.co.uk/e/arabic-scientific-manuscripts-transcription-workshop-tickets-43303096728
About the project: http://blogs.bl.uk/digital-scholarship/2018/03/arabic-handwrittten-ocr.html
The Ground Truth: Arabic Scientific Manuscripts Workshop
1. The Ground Truth: Arabic
Scientific Manuscripts Workshop
Nora McGregor
Digital Curator
@ndalyrose
2. www.bl.uk 2
10:00 Welcome & Introduction to the project
11:00 Meet the Curators and the Manuscripts
11:30 Getting started with the platform
12:00 Lunch & Digging into transcription
14:00 Tea & Coffee
16:00 Close
Timetable
3. www.bl.uk 3
The British Library is the
national library of the UK
and by many counts one
of the largest research
libraries in the world.
By law (Legal Deposit) a
copy of every UK and
Ireland print publication
must be given to the
British Library by its
publishers. In 2013 this
extended to digital.
4. www.bl.uk 4
Well over 150 Million items
are currently stored in
London and in York.
The building in St Pancras
can sit 1,200 researchers
at any one time across 11
reading rooms.
If you saw 5 items a day it
would take you 80,000
years to see the whole
collection.
Digitisation is key to
opening up access.
5. www.bl.uk 5
BL Arabic scientific
manuscript collections
In 2012 the British Library Qatar
Foundation Partnership launched the Qatar
Digital Library a bilingual, online portal
providing access to previously undigitised
British Library archive materials relating to
Gulf history and Arabic science.
• 600 manuscripts
• 1,500 texts
• 184,000 pages
• Manuscripts produced from Spain/North
Africa to India
• Manuscripts dating from the 10th-20th
centuries
• Authors dating from the 5th century BC to
the 19th century
6. www.bl.uk 6
Digital Scholarship @ British Library
Founded in 2010, the Digital
Scholarship Department at British
Library supports researchers and
staff to make innovative use of our
digital collections and data.
We are a group of cross disciplinary
experts in the areas of digitisation,
librarianship, digital history &
humanities, computer and data
science, looking at how technology is
transforming research, and in turn,
our services.
@BL_DigiSchol
7. www.bl.uk 7
• The Library has spent the last two decades creating digital assets
through digitisation and preserving born-digital objects and will do
far into the future.
• We can now do much more than use technology to simply view
these digital objects online and must embrace the opportunities
afforded by analysing these digital collections at scale.
The Digital Research View
10. www.bl.uk 10
OCR
http://www.explainthatstuff.com
/how-ocr-works.html
Optical Character Recognition
(OCR) is the process of turning a
picture of text into text itself—in
other words, producing something
like a TXT or DOC file from a
scanned JPG of a printed or
handwritten page.
OCR software can automatically
analyse text and turn it into a form
that a computer can process more
easily.
11. www.bl.uk 11
Text & Data Mining
Using a variety of computational techniques to derive information from
and find patterns in texts and large datasets. Two common TM tasks:
• Named-entity recognition: find and classify words in texts that might
refer to names of things, such as a person or company
• Topic modelling: a method for finding a group of words (i.e topic) from a
collection of documents that best represents the information in the
collection.
13. www.bl.uk 13
The East India Company archives include
900 log-books of ships containing daily
instrumental measurements of temperature
and pressure, and subjective estimates of
wind speed and direction, from voyages
across the Atlantic and Indian Oceans
between 1789 and 1834.
The Met Office digitised and transcribed
these books, providing 273,000 new weather
records offering an unprecedentedly detailed
view of the weather and climate of the late
eighteenth and early nineteenth centuries,
which can be used to test the accuracy of
their forecasting models.
18th Century Ships Logs +
Modern Weather
Forecasting
14. www.bl.uk 14
“West and the rest”
Buttressed by the rise of data science, faculty
across humanities fields have harnessed search
algorithms and optical character recognition
(OCR) to conduct research on an unprecedented
scale. Petabytes, not pages, are now the unit of
analysis. Yet the majority of these tools only
handle Latin script.
“Digital databases and text corpora – the ‘raw
material’ of text mining and computational text
analysis – are far more abundant for English and
other Latin alphabetic scripts than they are for
Chinese, Japanese, Korean, Sanskrit, Hindi,
Arabic and other non-Latin orthographies,”
Mullaney said. Troves of unread primary sources
lie dormant because no text mining technology
exists to parse them…..”
http://news.stanford.edu/thedish/2016/10/17/digita
l-humanities-scholars-receive-mellon-support/
https://islamicdh.org/conference2013/
https://islamicdh.org/2016/03/31/new-publication-
on-islamic-digital-humanities/
15. www.bl.uk 15
Challenges with Arabic script
Arabic script presents unique challenges for text recognition:
• Arabic script writing styles are varied
• Characters are written in cursive, joined right to left, they may take 2 to 4
shapes, and each is context sensitive.
• The shape of each of the 28 Arabic characters for instance may change
drastically depending on their location in the word while the existence of
non-joining characters means that although the script is cursive, they do
not join to the following letter resulting in a small space within a word.
• Long strokes along the baseline
• Complex combination of ascenders, descenders, diacritics, and special
notation either above or below the baseline depending on the character
pose further challenges.
16. www.bl.uk 16
Ground Truth
By knowing what the software
is supposed to recognise on a
page of handwritten text,
researchers can both train their
system to recognise the
characters as well as test how
well the system does once
trained.
Most OCR systems require
ground truth, essentially a set of
files which record the complete
and accurate record of every
element (text, line breaks etc.) of
an image, in order to train and test
their models.
Ground truth is the objective verification of
the particular properties of a digital image,
used to test the accuracy of automated
image analysis processes. The ground truth
of an image’s text content, for instance, is
the complete and accurate record of every
character and word in the image.
This can be compared to the output of an
OCR engine and used to assess the
engine’s accuracy, and how important any
deviation from ground truth is in that
instance.
18. OCR Competition: RASM2018
ICFHR2018 Competition on Recognition
of Historical Arabic Scientific
Manuscripts
http://www.primaresearch.org/RASM2018/
The 16th International Conference on Frontiers in Handwriting Recognition
August 5 - 8, 2018 ● Niagara Falls, USA
19. www.bl.uk 19
Transkribus
Transkribus is an open-source software
for the automated recognition,
transcription, indexing and enrichment of
handwritten archival documents. It relies
on crowdsourcing and machine learning.
Each contribution
helps train the model
for automatic
recognition.
Once a department of the British Museum –became its own, 1973 and moved into it’s own building in 1997.
While we acquire items through purchase or gifts, much of the collection has been built up through legal deposit.
Legal Deposit is a concept which has been part of English law since 1662.
In 2013, legal deposit has been extended to cover non-print material which means by law we take in digitally published items as well, which means regular mass crawls of the entire UK web domain as well as ebooks, ejournals and the like.
https://www.bl.uk/collection-guides/the-kings-library
An example of a major digitisation project.
Earliest MS: Or 2600 (348/959)
What you can do when pictures of text turn into text itself.
https://stanfordnlp.github.io/CoreNLP/
http://www.scottbot.net/HIAL/index.html@p=19113.html
In OCR we can locate where images might be…..see flickr. All these images are a result of mining OCR: https://www.flickr.com/photos/britishlibrary/albums
http://www.clim-past.net/8/1551/2012/cp-8-1551-2012.html
The future: automatically transcribe these historical handwritten documents and turn them into machine readable data for modern weather models.