SlideShare a Scribd company logo
1 of 23
The Ground Truth: Arabic
Scientific Manuscripts Workshop
Nora McGregor
Digital Curator
@ndalyrose
www.bl.uk 2
10:00 Welcome & Introduction to the project
11:00 Meet the Curators and the Manuscripts
11:30 Getting started with the platform
12:00 Lunch & Digging into transcription
14:00 Tea & Coffee
16:00 Close
Timetable
www.bl.uk 3
The British Library is the
national library of the UK
and by many counts one
of the largest research
libraries in the world.
By law (Legal Deposit) a
copy of every UK and
Ireland print publication
must be given to the
British Library by its
publishers. In 2013 this
extended to digital.
www.bl.uk 4
Well over 150 Million items
are currently stored in
London and in York.
The building in St Pancras
can sit 1,200 researchers
at any one time across 11
reading rooms.
If you saw 5 items a day it
would take you 80,000
years to see the whole
collection.
Digitisation is key to
opening up access.
www.bl.uk 5
BL Arabic scientific
manuscript collections
In 2012 the British Library Qatar
Foundation Partnership launched the Qatar
Digital Library a bilingual, online portal
providing access to previously undigitised
British Library archive materials relating to
Gulf history and Arabic science.
• 600 manuscripts
• 1,500 texts
• 184,000 pages
• Manuscripts produced from Spain/North
Africa to India
• Manuscripts dating from the 10th-20th
centuries
• Authors dating from the 5th century BC to
the 19th century
www.bl.uk 6
Digital Scholarship @ British Library
Founded in 2010, the Digital
Scholarship Department at British
Library supports researchers and
staff to make innovative use of our
digital collections and data.
We are a group of cross disciplinary
experts in the areas of digitisation,
librarianship, digital history &
humanities, computer and data
science, looking at how technology is
transforming research, and in turn,
our services.
@BL_DigiSchol
www.bl.uk 7
• The Library has spent the last two decades creating digital assets
through digitisation and preserving born-digital objects and will do
far into the future.
• We can now do much more than use technology to simply view
these digital objects online and must embrace the opportunities
afforded by analysing these digital collections at scale.
The Digital Research View
The opportunities…
and challenges!
www.bl.uk 9
www.bl.uk 10
OCR
http://www.explainthatstuff.com
/how-ocr-works.html
Optical Character Recognition
(OCR) is the process of turning a
picture of text into text itself—in
other words, producing something
like a TXT or DOC file from a
scanned JPG of a printed or
handwritten page.
OCR software can automatically
analyse text and turn it into a form
that a computer can process more
easily.
www.bl.uk 11
Text & Data Mining
Using a variety of computational techniques to derive information from
and find patterns in texts and large datasets. Two common TM tasks:
• Named-entity recognition: find and classify words in texts that might
refer to names of things, such as a person or company
• Topic modelling: a method for finding a group of words (i.e topic) from a
collection of documents that best represents the information in the
collection.
www.bl.uk 12
www.bl.uk 13
The East India Company archives include
900 log-books of ships containing daily
instrumental measurements of temperature
and pressure, and subjective estimates of
wind speed and direction, from voyages
across the Atlantic and Indian Oceans
between 1789 and 1834.
The Met Office digitised and transcribed
these books, providing 273,000 new weather
records offering an unprecedentedly detailed
view of the weather and climate of the late
eighteenth and early nineteenth centuries,
which can be used to test the accuracy of
their forecasting models.
18th Century Ships Logs +
Modern Weather
Forecasting
www.bl.uk 14
“West and the rest”
Buttressed by the rise of data science, faculty
across humanities fields have harnessed search
algorithms and optical character recognition
(OCR) to conduct research on an unprecedented
scale. Petabytes, not pages, are now the unit of
analysis. Yet the majority of these tools only
handle Latin script.
“Digital databases and text corpora – the ‘raw
material’ of text mining and computational text
analysis – are far more abundant for English and
other Latin alphabetic scripts than they are for
Chinese, Japanese, Korean, Sanskrit, Hindi,
Arabic and other non-Latin orthographies,”
Mullaney said. Troves of unread primary sources
lie dormant because no text mining technology
exists to parse them…..”
http://news.stanford.edu/thedish/2016/10/17/digita
l-humanities-scholars-receive-mellon-support/
https://islamicdh.org/conference2013/
https://islamicdh.org/2016/03/31/new-publication-
on-islamic-digital-humanities/
www.bl.uk 15
Challenges with Arabic script
Arabic script presents unique challenges for text recognition:
• Arabic script writing styles are varied
• Characters are written in cursive, joined right to left, they may take 2 to 4
shapes, and each is context sensitive.
• The shape of each of the 28 Arabic characters for instance may change
drastically depending on their location in the word while the existence of
non-joining characters means that although the script is cursive, they do
not join to the following letter resulting in a small space within a word.
• Long strokes along the baseline
• Complex combination of ascenders, descenders, diacritics, and special
notation either above or below the baseline depending on the character
pose further challenges.
www.bl.uk 16
Ground Truth
By knowing what the software
is supposed to recognise on a
page of handwritten text,
researchers can both train their
system to recognise the
characters as well as test how
well the system does once
trained.
Most OCR systems require
ground truth, essentially a set of
files which record the complete
and accurate record of every
element (text, line breaks etc.) of
an image, in order to train and test
their models.
Ground truth is the objective verification of
the particular properties of a digital image,
used to test the accuracy of automated
image analysis processes. The ground truth
of an image’s text content, for instance, is
the complete and accurate record of every
character and word in the image.
This can be compared to the output of an
OCR engine and used to assess the
engine’s accuracy, and how important any
deviation from ground truth is in that
instance.
www.bl.uk 17
http://asar.ieee.tn/
OCR Competition: RASM2018
ICFHR2018 Competition on Recognition
of Historical Arabic Scientific
Manuscripts
http://www.primaresearch.org/RASM2018/
The 16th International Conference on Frontiers in Handwriting Recognition
August 5 - 8, 2018 ● Niagara Falls, USA
www.bl.uk 19
Transkribus
Transkribus is an open-source software
for the automated recognition,
transcription, indexing and enrichment of
handwritten archival documents. It relies
on crowdsourcing and machine learning.
Each contribution
helps train the model
for automatic
recognition.
www.bl.uk 20
Kraken + Qanat +
Open Islamicate Texts Initiative (OpenITI)
www.bl.uk 21
www.bl.uk 22
And more, so let’s get
started!
www.bl.uk 23

More Related Content

What's hot

We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...Trevor Owens
 
Text and Data Mining - FutureTDM Knowledge Café
Text and Data Mining - FutureTDM Knowledge CaféText and Data Mining - FutureTDM Knowledge Café
Text and Data Mining - FutureTDM Knowledge CaféSteven Claeyssens
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Ralf Stockmann
 
Harris County: Using Elastic to Accelerate Investigations
Harris County: Using Elastic to Accelerate InvestigationsHarris County: Using Elastic to Accelerate Investigations
Harris County: Using Elastic to Accelerate InvestigationsElasticsearch
 
Exposing Bibliographic Information as Linked Open Data using Standards-based ...
Exposing Bibliographic Information as Linked Open Data using Standards-based ...Exposing Bibliographic Information as Linked Open Data using Standards-based ...
Exposing Bibliographic Information as Linked Open Data using Standards-based ...Nikolaos Konstantinou
 
One day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebOne day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebVictor de Boer
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples Victor de Boer
 
CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...
CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...
CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...Keith.May
 
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Georg Vogeler
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked DataLeon Wessels
 
Madrid Linked Data for Digital Humanities
Madrid Linked Data for Digital HumanitiesMadrid Linked Data for Digital Humanities
Madrid Linked Data for Digital HumanitiesVictor de Boer
 
Providing geospatial information as Linked Open Data
Providing geospatial information as Linked Open DataProviding geospatial information as Linked Open Data
Providing geospatial information as Linked Open DataPat Kenny
 

What's hot (13)

co:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatosco:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatos
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
 
Text and Data Mining - FutureTDM Knowledge Café
Text and Data Mining - FutureTDM Knowledge CaféText and Data Mining - FutureTDM Knowledge Café
Text and Data Mining - FutureTDM Knowledge Café
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
 
Harris County: Using Elastic to Accelerate Investigations
Harris County: Using Elastic to Accelerate InvestigationsHarris County: Using Elastic to Accelerate Investigations
Harris County: Using Elastic to Accelerate Investigations
 
Exposing Bibliographic Information as Linked Open Data using Standards-based ...
Exposing Bibliographic Information as Linked Open Data using Standards-based ...Exposing Bibliographic Information as Linked Open Data using Standards-based ...
Exposing Bibliographic Information as Linked Open Data using Standards-based ...
 
One day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebOne day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic Web
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples
 
CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...
CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...
CAA 2016 The Matrix: Connecting Time and Space with archaeological research q...
 
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked Data
 
Madrid Linked Data for Digital Humanities
Madrid Linked Data for Digital HumanitiesMadrid Linked Data for Digital Humanities
Madrid Linked Data for Digital Humanities
 
Providing geospatial information as Linked Open Data
Providing geospatial information as Linked Open DataProviding geospatial information as Linked Open Data
Providing geospatial information as Linked Open Data
 

Similar to The Ground Truth: Arabic Scientific Manuscripts Workshop

Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museumsdejp3
 
12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.Sofronijevic12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.SofronijevicNikola Smolenski
 
Doing Digital Research @ British Library
Doing Digital Research @ British LibraryDoing Digital Research @ British Library
Doing Digital Research @ British LibraryMia
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labsbenosteen
 
General ea en short
General ea en shortGeneral ea en short
General ea en shortMMI Group
 
Presentation to the National Science Library of the Chinese Academy of Sciences
Presentation to the National Science Library of the Chinese Academy of SciencesPresentation to the National Science Library of the Chinese Academy of Sciences
Presentation to the National Science Library of the Chinese Academy of Scienceslabsbl
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDan Brickley
 
Barc (bhabha atomic research center)
Barc (bhabha atomic research center)Barc (bhabha atomic research center)
Barc (bhabha atomic research center)avid
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
Clipper @ The Meccsa Symposium on Practice Based Research
Clipper @ The Meccsa Symposium on Practice Based ResearchClipper @ The Meccsa Symposium on Practice Based Research
Clipper @ The Meccsa Symposium on Practice Based ResearchJohn Casey
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017benosteen
 

Similar to The Ground Truth: Arabic Scientific Manuscripts Workshop (20)

AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101  AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101
 
Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museums
 
12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.Sofronijevic12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.Sofronijevic
 
Doing Digital Research @ British Library
Doing Digital Research @ British LibraryDoing Digital Research @ British Library
Doing Digital Research @ British Library
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
101 This is Digital Scholarship 2016
101 This is Digital Scholarship 2016101 This is Digital Scholarship 2016
101 This is Digital Scholarship 2016
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labs
 
Esad 12may2010
Esad 12may2010Esad 12may2010
Esad 12may2010
 
General ea en short
General ea en shortGeneral ea en short
General ea en short
 
Presentation to the National Science Library of the Chinese Academy of Sciences
Presentation to the National Science Library of the Chinese Academy of SciencesPresentation to the National Science Library of the Chinese Academy of Sciences
Presentation to the National Science Library of the Chinese Academy of Sciences
 
Digitallibrary
DigitallibraryDigitallibrary
Digitallibrary
 
Keynote csws2013
Keynote csws2013Keynote csws2013
Keynote csws2013
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classification
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
 
Barc (bhabha atomic research center)
Barc (bhabha atomic research center)Barc (bhabha atomic research center)
Barc (bhabha atomic research center)
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Aquiles imlr seminar
Aquiles imlr seminarAquiles imlr seminar
Aquiles imlr seminar
 
Clipper @ The Meccsa Symposium on Practice Based Research
Clipper @ The Meccsa Symposium on Practice Based ResearchClipper @ The Meccsa Symposium on Practice Based Research
Clipper @ The Meccsa Symposium on Practice Based Research
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017
 

More from Digital Research and Curator Team @ British Library

More from Digital Research and Curator Team @ British Library (20)

Digital Research and Creative Collaborations at the British Library
Digital Research and Creative Collaborations at the British LibraryDigital Research and Creative Collaborations at the British Library
Digital Research and Creative Collaborations at the British Library
 
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
 
Archiving Interactive Narratives at the British Library by Lynda Clark, Giuli...
Archiving Interactive Narratives at the British Library by Lynda Clark, Giuli...Archiving Interactive Narratives at the British Library by Lynda Clark, Giuli...
Archiving Interactive Narratives at the British Library by Lynda Clark, Giuli...
 
Collecting 80 days at The British Library, by Stella Wisdom and Giulia Carla ...
Collecting 80 days at The British Library, by Stella Wisdom and Giulia Carla ...Collecting 80 days at The British Library, by Stella Wisdom and Giulia Carla ...
Collecting 80 days at The British Library, by Stella Wisdom and Giulia Carla ...
 
Digital Research at the British Library, by Stella Wisdom
Digital Research at the British Library, by Stella WisdomDigital Research at the British Library, by Stella Wisdom
Digital Research at the British Library, by Stella Wisdom
 
Learning with Litcraft - encouraging reluctant readers
Learning with Litcraft - encouraging reluctant readersLearning with Litcraft - encouraging reluctant readers
Learning with Litcraft - encouraging reluctant readers
 
Places of Inspiration: Playing and Making in the Library
Places of Inspiration: Playing and Making in the LibraryPlaces of Inspiration: Playing and Making in the Library
Places of Inspiration: Playing and Making in the Library
 
Digital Scholarship at the British Library
Digital Scholarship at the British LibraryDigital Scholarship at the British Library
Digital Scholarship at the British Library
 
Places of inspiration; digital interactive writing and literary game making i...
Places of inspiration; digital interactive writing and literary game making i...Places of inspiration; digital interactive writing and literary game making i...
Places of inspiration; digital interactive writing and literary game making i...
 
The ethics of situating immersive fictional storytelling alongside factual in...
The ethics of situating immersive fictional storytelling alongside factual in...The ethics of situating immersive fictional storytelling alongside factual in...
The ethics of situating immersive fictional storytelling alongside factual in...
 
Digital Research Support by Stella Wisdom, for 20th & 21st Century Collection...
Digital Research Support by Stella Wisdom, for 20th & 21st Century Collection...Digital Research Support by Stella Wisdom, for 20th & 21st Century Collection...
Digital Research Support by Stella Wisdom, for 20th & 21st Century Collection...
 
Places of inspiration: playing and making in the library
Places of inspiration: playing and making in the libraryPlaces of inspiration: playing and making in the library
Places of inspiration: playing and making in the library
 
Talk for Games Fictioning
Talk for Games FictioningTalk for Games Fictioning
Talk for Games Fictioning
 
Talk for Digital Conversation: History and Games
Talk for Digital Conversation: History and GamesTalk for Digital Conversation: History and Games
Talk for Digital Conversation: History and Games
 
Playing and Making in Libraries
Playing and Making in LibrariesPlaying and Making in Libraries
Playing and Making in Libraries
 
Talk for RIVAL (Research Impact Value and LIS) event by Stella Wisdom
Talk for RIVAL (Research Impact Value and LIS) event by Stella WisdomTalk for RIVAL (Research Impact Value and LIS) event by Stella Wisdom
Talk for RIVAL (Research Impact Value and LIS) event by Stella Wisdom
 
Talk for Continue >> Videogames and Culture by Stella Wisdom
Talk for Continue >> Videogames and Culture by Stella WisdomTalk for Continue >> Videogames and Culture by Stella Wisdom
Talk for Continue >> Videogames and Culture by Stella Wisdom
 
Talk for BL Labs Roadshow at University of Leeds, by Stella Wisdom
Talk for BL Labs Roadshow at University of Leeds, by Stella WisdomTalk for BL Labs Roadshow at University of Leeds, by Stella Wisdom
Talk for BL Labs Roadshow at University of Leeds, by Stella Wisdom
 
Talk for The Library of Ideas: Creative Use of the British Library by Stella ...
Talk for The Library of Ideas: Creative Use of the British Library by Stella ...Talk for The Library of Ideas: Creative Use of the British Library by Stella ...
Talk for The Library of Ideas: Creative Use of the British Library by Stella ...
 
The British Library’s Gothic Adventures Off the Map by Stella Wisdom
The British Library’s Gothic Adventures Off the Map by Stella WisdomThe British Library’s Gothic Adventures Off the Map by Stella Wisdom
The British Library’s Gothic Adventures Off the Map by Stella Wisdom
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

The Ground Truth: Arabic Scientific Manuscripts Workshop

  • 1. The Ground Truth: Arabic Scientific Manuscripts Workshop Nora McGregor Digital Curator @ndalyrose
  • 2. www.bl.uk 2 10:00 Welcome & Introduction to the project 11:00 Meet the Curators and the Manuscripts 11:30 Getting started with the platform 12:00 Lunch & Digging into transcription 14:00 Tea & Coffee 16:00 Close Timetable
  • 3. www.bl.uk 3 The British Library is the national library of the UK and by many counts one of the largest research libraries in the world. By law (Legal Deposit) a copy of every UK and Ireland print publication must be given to the British Library by its publishers. In 2013 this extended to digital.
  • 4. www.bl.uk 4 Well over 150 Million items are currently stored in London and in York. The building in St Pancras can sit 1,200 researchers at any one time across 11 reading rooms. If you saw 5 items a day it would take you 80,000 years to see the whole collection. Digitisation is key to opening up access.
  • 5. www.bl.uk 5 BL Arabic scientific manuscript collections In 2012 the British Library Qatar Foundation Partnership launched the Qatar Digital Library a bilingual, online portal providing access to previously undigitised British Library archive materials relating to Gulf history and Arabic science. • 600 manuscripts • 1,500 texts • 184,000 pages • Manuscripts produced from Spain/North Africa to India • Manuscripts dating from the 10th-20th centuries • Authors dating from the 5th century BC to the 19th century
  • 6. www.bl.uk 6 Digital Scholarship @ British Library Founded in 2010, the Digital Scholarship Department at British Library supports researchers and staff to make innovative use of our digital collections and data. We are a group of cross disciplinary experts in the areas of digitisation, librarianship, digital history & humanities, computer and data science, looking at how technology is transforming research, and in turn, our services. @BL_DigiSchol
  • 7. www.bl.uk 7 • The Library has spent the last two decades creating digital assets through digitisation and preserving born-digital objects and will do far into the future. • We can now do much more than use technology to simply view these digital objects online and must embrace the opportunities afforded by analysing these digital collections at scale. The Digital Research View
  • 10. www.bl.uk 10 OCR http://www.explainthatstuff.com /how-ocr-works.html Optical Character Recognition (OCR) is the process of turning a picture of text into text itself—in other words, producing something like a TXT or DOC file from a scanned JPG of a printed or handwritten page. OCR software can automatically analyse text and turn it into a form that a computer can process more easily.
  • 11. www.bl.uk 11 Text & Data Mining Using a variety of computational techniques to derive information from and find patterns in texts and large datasets. Two common TM tasks: • Named-entity recognition: find and classify words in texts that might refer to names of things, such as a person or company • Topic modelling: a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.
  • 13. www.bl.uk 13 The East India Company archives include 900 log-books of ships containing daily instrumental measurements of temperature and pressure, and subjective estimates of wind speed and direction, from voyages across the Atlantic and Indian Oceans between 1789 and 1834. The Met Office digitised and transcribed these books, providing 273,000 new weather records offering an unprecedentedly detailed view of the weather and climate of the late eighteenth and early nineteenth centuries, which can be used to test the accuracy of their forecasting models. 18th Century Ships Logs + Modern Weather Forecasting
  • 14. www.bl.uk 14 “West and the rest” Buttressed by the rise of data science, faculty across humanities fields have harnessed search algorithms and optical character recognition (OCR) to conduct research on an unprecedented scale. Petabytes, not pages, are now the unit of analysis. Yet the majority of these tools only handle Latin script. “Digital databases and text corpora – the ‘raw material’ of text mining and computational text analysis – are far more abundant for English and other Latin alphabetic scripts than they are for Chinese, Japanese, Korean, Sanskrit, Hindi, Arabic and other non-Latin orthographies,” Mullaney said. Troves of unread primary sources lie dormant because no text mining technology exists to parse them…..” http://news.stanford.edu/thedish/2016/10/17/digita l-humanities-scholars-receive-mellon-support/ https://islamicdh.org/conference2013/ https://islamicdh.org/2016/03/31/new-publication- on-islamic-digital-humanities/
  • 15. www.bl.uk 15 Challenges with Arabic script Arabic script presents unique challenges for text recognition: • Arabic script writing styles are varied • Characters are written in cursive, joined right to left, they may take 2 to 4 shapes, and each is context sensitive. • The shape of each of the 28 Arabic characters for instance may change drastically depending on their location in the word while the existence of non-joining characters means that although the script is cursive, they do not join to the following letter resulting in a small space within a word. • Long strokes along the baseline • Complex combination of ascenders, descenders, diacritics, and special notation either above or below the baseline depending on the character pose further challenges.
  • 16. www.bl.uk 16 Ground Truth By knowing what the software is supposed to recognise on a page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained. Most OCR systems require ground truth, essentially a set of files which record the complete and accurate record of every element (text, line breaks etc.) of an image, in order to train and test their models. Ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.
  • 18. OCR Competition: RASM2018 ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts http://www.primaresearch.org/RASM2018/ The 16th International Conference on Frontiers in Handwriting Recognition August 5 - 8, 2018 ● Niagara Falls, USA
  • 19. www.bl.uk 19 Transkribus Transkribus is an open-source software for the automated recognition, transcription, indexing and enrichment of handwritten archival documents. It relies on crowdsourcing and machine learning. Each contribution helps train the model for automatic recognition.
  • 20. www.bl.uk 20 Kraken + Qanat + Open Islamicate Texts Initiative (OpenITI)
  • 22. www.bl.uk 22 And more, so let’s get started!

Editor's Notes

  1. Once a department of the British Museum –became its own, 1973 and moved into it’s own building in 1997. While we acquire items through purchase or gifts, much of the collection has been built up through legal deposit. Legal Deposit is a concept which has been part of English law since 1662. In 2013, legal deposit has been extended to cover non-print material which means by law we take in digitally published items as well, which means regular mass crawls of the entire UK web domain as well as ebooks, ejournals and the like. https://www.bl.uk/collection-guides/the-kings-library
  2. An example of a major digitisation project. Earliest MS: Or 2600 (348/959)
  3. https://www.youtube.com/watch?v=tp4y-_VoXdA&feature=youtu.be
  4. What you can do when pictures of text turn into text itself. https://stanfordnlp.github.io/CoreNLP/ http://www.scottbot.net/HIAL/index.html@p=19113.html
  5. In OCR we can locate where images might be…..see flickr. All these images are a result of mining OCR: https://www.flickr.com/photos/britishlibrary/albums
  6. http://www.clim-past.net/8/1551/2012/cp-8-1551-2012.html The future: automatically transcribe these historical handwritten documents and turn them into machine readable data for modern weather models.
  7. http://www.primaresearch.org/RASM2018/
  8. http://iti-corpus.github.io/ocrmain.html http://kitab-project.org/
  9. https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/