SlideShare a Scribd company logo
Wroclaw University Library 
Grażyna Piotrowicz
Wroclaw University Library: 
1. is one of the bigest academic libraries in Poland. Its collection has ca 2,4 million of volumes and in that number 0,5 million of special collections‟ items (i.e. manuscriptes, old printed books, incunabula, maps, graphic collecion, music collection, etc.); 
2.is a member of : IFLA, CERL, IAML, Technical Committee No 242 (for Information and Documentation) at Polish Committee for Standardization; 
3.has participated in many research projects (European, international, national, etc.); 
4.has the staff team with the long-standing experience in digitisation of printed items as well as processing and then presentation of digital objects; 
5.has started the digitisation of own physical resources since the year 2000 , has initiated the Digital Library of University of Wroclaw (DLUW) in 2005 and in 2013/2014 – the university repository (Repository of University of Wroclaw – RUW); Owing to the appropriate policy of human resources development, purchases of optical & electronic equipment and computers (hardware & software) as well as participation in many projects the Wroclaw University Library has at its disposal experienced staff and technological base that enable it the cooperation in the framework of the Impact Centre of Competence in Digitisation.
Use Case and Tools 
In order to improve digitisation workflow in DLUW it was required to implement tools that can help to speed – up and optimize the processes. 
For that pourpose two tools have been tested. First, Scan Tailor software was chosen as the post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, etc. It was used for raw scans, and enabled to receive pages ready to be printed or assembled into a PDF or DjVu files. 
The second one was Tesseract OCR software - open source OCR engine that combined with the Leptonica Image Processing Library can read a wide variety of image formats and convert them to text in over 60 languages. 
Both tools were tested while preparing presentation versions of chosen 12 old printed books (from 16th to18th century), all only with the single-column text layout, printed in different languages (e.g. Latin, Italian, German, Romance) and with different font types (e.g. Gothic, Roman). The aim of tests intended was working out the technological line and workflow for digitisation, processing and presentation of good quality delivery files in the DLUW. For the evaluation the ground truth in plain text format was used (5 pages from every marked out document). 
The evaluation was performed by: 1.comparing OCR with ground truth and measuring character error rate, 2. comparing OCR with ground truth and measuring word error rate; 3. comparing OCR from different engines.
Use Case and Tools 
The research proccess was realized on server in 3 following steps: 
1st step – the execution of Scan Tailor program with default adjustments. 
After the processing had been done by Scan Tailor program the visual control and manual correction of wrongly processed files had to be carried out by the operator. 
Owing to that operation it was possible to improve the parameters of the later processing to the satisfying level. We wanted to receive the best quality of „post master” files for the future processing by OCR and aesthetic digital presentations of the originals in DLUW. 
2nd step – saving manual corrections on the server. On the server were saved only these files, that had to be corrected by the operator. The rest of the results of Scan Tailor „s automation operations remained without changes. For supporting the realization of 2nd step the dedicated Web site on server was applied. 
3rd step – execution of Tesseract program. Earlier, the appropriate dictionaries were chosen. We used only the dictioneries which were available with Tesseract software and no additional training tools were applied. It turned out that small size of fonts were the great problems for Tesseract. Additionally, it does not have the tools that enable to point out with precision the text layout and to separate it from the area of graphics. The lack of such a function results in the attempts to apply the text recognition function for graphical objects, like: frames, floratura, seals, etc.
Evaluation Results 
The implementation of new solution consisting in the integration of dispersed digitisation processes and data processing can significantly decrease the costs and increase the efficiency of digital resources‟ creation in the DLUW. The tests carried out on the Scan Tailor and Tesseract programs are of great importance for preparing and organizing technological line for data processing in cloud. It is necessary to work out the procedures and interfaces which enable supporting of the remote processes by our staff. 
In the case of Scan Tailor program it is possible to carry out automatically and efficiently the following tasks: splitting master files into the single pages, turning split pages in order to level the text, removing of margins and rejection of artifacts, generating of files to be prepared for OCR process. The only problem is an appropriate recognition of the text area. That problem causes this task not to be solved automatically without carrying out any control process. That imperfection does not disparage Scan Tailor program and it will be applied in WUL as an important tool in the process of data processing. 
The Teseract program seems to be very promising tool and with absolute certainty can be said that trials will be done to implement it for supporting digitisation process of selected types of library materials. It is essential however to refine and improve the quality of document‟s layout analysis as well as the recognition of graphical elements and small fonts.
Evaluation Results 
The results of text recognition can be saved as the files: “txt” or hocr”. File “hocr” contains the following data: the recognized text, its location relative to the original image, style. These data are saved by means of XML in form of HTML or XHTML file. 
Taking into account the needs of archiving process the „hocr” files seem to be good form of files‟ saving. Each “hocr” file is assigned to specific graphic file. In this way the adjustment of particular pages of document is possible and thus the organization of adjustment process can be more flexible. The creation of hybrid publications (PDF, DjVu) can be executed automatically by server. „hocr” files can be a base for the further preparation of electronic publications. We noticed the potential of that solution and the tools created during the project we are going to use in the near future. 
Additionally, when we were carrying out the other project connected with processing of 19th – century newspapers printed in gothic fonts we observed very satisfying OCR results received by means of Tesseract used on the objects processed to the 1-bit version (black/white). http://www.bibliotekacyfrowa.pl/publication/59368. 
We have also repeated the recognition of samples of the object 319708 from the prepared monochromatic files (1-bit). The Tesseract results: CER 7,80% and WER 19,67% vs Tesseract results from our final report: CER 20,58% and WER 35,56%. 
So, it turned out that creation of good black-white image is essential element which very positively influences on the OCR‟s results.

More Related Content

Similar to Wroclaw university library - Grazyna Piotrowicz

IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET Journal
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx
DanielJDanso
 
Optical character recognization word
Optical character recognization wordOptical character recognization word
Optical character recognization word
Dhana K
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
IMPACT Centre of Competence
 
Niatalk24jan10
Niatalk24jan10Niatalk24jan10
Niatalk24jan10
Sunita Barve
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
sangeetadhamdhere
 
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
IRJET Journal
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Alex Zeltov
 
Olf2016
Olf2016Olf2016
Olf2016
Dru Lavigne
 
IRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using AndroidIRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using Android
IRJET Journal
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
cneudecker
 
How to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutionsHow to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutions
Monika Renate Barget
 
Abbyy fine reader-server
Abbyy fine reader-serverAbbyy fine reader-server
Abbyy fine reader-server
Man Minh
 
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosMuehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosEUscreen
 
OCR 's Functions
OCR 's FunctionsOCR 's Functions
OCR 's Functions
prithvi764
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
Bharat Kalia
 
Library tools and technologies
Library tools and technologiesLibrary tools and technologies
Library tools and technologies
Liaquat Rahoo
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 

Similar to Wroclaw university library - Grazyna Piotrowicz (20)

IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx
 
Optical character recognization word
Optical character recognization wordOptical character recognization word
Optical character recognization word
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
 
Niatalk24jan10
Niatalk24jan10Niatalk24jan10
Niatalk24jan10
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
 
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
50120130406005
5012013040600550120130406005
50120130406005
 
Olf2016
Olf2016Olf2016
Olf2016
 
IRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using AndroidIRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using Android
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
How to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutionsHow to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutions
 
Abbyy fine reader-server
Abbyy fine reader-serverAbbyy fine reader-server
Abbyy fine reader-server
 
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosMuehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
 
OCR 's Functions
OCR 's FunctionsOCR 's Functions
OCR 's Functions
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Library tools and technologies
Library tools and technologiesLibrary tools and technologies
Library tools and technologies
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Wroclaw university library - Grazyna Piotrowicz

  • 1. Wroclaw University Library Grażyna Piotrowicz
  • 2. Wroclaw University Library: 1. is one of the bigest academic libraries in Poland. Its collection has ca 2,4 million of volumes and in that number 0,5 million of special collections‟ items (i.e. manuscriptes, old printed books, incunabula, maps, graphic collecion, music collection, etc.); 2.is a member of : IFLA, CERL, IAML, Technical Committee No 242 (for Information and Documentation) at Polish Committee for Standardization; 3.has participated in many research projects (European, international, national, etc.); 4.has the staff team with the long-standing experience in digitisation of printed items as well as processing and then presentation of digital objects; 5.has started the digitisation of own physical resources since the year 2000 , has initiated the Digital Library of University of Wroclaw (DLUW) in 2005 and in 2013/2014 – the university repository (Repository of University of Wroclaw – RUW); Owing to the appropriate policy of human resources development, purchases of optical & electronic equipment and computers (hardware & software) as well as participation in many projects the Wroclaw University Library has at its disposal experienced staff and technological base that enable it the cooperation in the framework of the Impact Centre of Competence in Digitisation.
  • 3. Use Case and Tools In order to improve digitisation workflow in DLUW it was required to implement tools that can help to speed – up and optimize the processes. For that pourpose two tools have been tested. First, Scan Tailor software was chosen as the post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, etc. It was used for raw scans, and enabled to receive pages ready to be printed or assembled into a PDF or DjVu files. The second one was Tesseract OCR software - open source OCR engine that combined with the Leptonica Image Processing Library can read a wide variety of image formats and convert them to text in over 60 languages. Both tools were tested while preparing presentation versions of chosen 12 old printed books (from 16th to18th century), all only with the single-column text layout, printed in different languages (e.g. Latin, Italian, German, Romance) and with different font types (e.g. Gothic, Roman). The aim of tests intended was working out the technological line and workflow for digitisation, processing and presentation of good quality delivery files in the DLUW. For the evaluation the ground truth in plain text format was used (5 pages from every marked out document). The evaluation was performed by: 1.comparing OCR with ground truth and measuring character error rate, 2. comparing OCR with ground truth and measuring word error rate; 3. comparing OCR from different engines.
  • 4. Use Case and Tools The research proccess was realized on server in 3 following steps: 1st step – the execution of Scan Tailor program with default adjustments. After the processing had been done by Scan Tailor program the visual control and manual correction of wrongly processed files had to be carried out by the operator. Owing to that operation it was possible to improve the parameters of the later processing to the satisfying level. We wanted to receive the best quality of „post master” files for the future processing by OCR and aesthetic digital presentations of the originals in DLUW. 2nd step – saving manual corrections on the server. On the server were saved only these files, that had to be corrected by the operator. The rest of the results of Scan Tailor „s automation operations remained without changes. For supporting the realization of 2nd step the dedicated Web site on server was applied. 3rd step – execution of Tesseract program. Earlier, the appropriate dictionaries were chosen. We used only the dictioneries which were available with Tesseract software and no additional training tools were applied. It turned out that small size of fonts were the great problems for Tesseract. Additionally, it does not have the tools that enable to point out with precision the text layout and to separate it from the area of graphics. The lack of such a function results in the attempts to apply the text recognition function for graphical objects, like: frames, floratura, seals, etc.
  • 5. Evaluation Results The implementation of new solution consisting in the integration of dispersed digitisation processes and data processing can significantly decrease the costs and increase the efficiency of digital resources‟ creation in the DLUW. The tests carried out on the Scan Tailor and Tesseract programs are of great importance for preparing and organizing technological line for data processing in cloud. It is necessary to work out the procedures and interfaces which enable supporting of the remote processes by our staff. In the case of Scan Tailor program it is possible to carry out automatically and efficiently the following tasks: splitting master files into the single pages, turning split pages in order to level the text, removing of margins and rejection of artifacts, generating of files to be prepared for OCR process. The only problem is an appropriate recognition of the text area. That problem causes this task not to be solved automatically without carrying out any control process. That imperfection does not disparage Scan Tailor program and it will be applied in WUL as an important tool in the process of data processing. The Teseract program seems to be very promising tool and with absolute certainty can be said that trials will be done to implement it for supporting digitisation process of selected types of library materials. It is essential however to refine and improve the quality of document‟s layout analysis as well as the recognition of graphical elements and small fonts.
  • 6. Evaluation Results The results of text recognition can be saved as the files: “txt” or hocr”. File “hocr” contains the following data: the recognized text, its location relative to the original image, style. These data are saved by means of XML in form of HTML or XHTML file. Taking into account the needs of archiving process the „hocr” files seem to be good form of files‟ saving. Each “hocr” file is assigned to specific graphic file. In this way the adjustment of particular pages of document is possible and thus the organization of adjustment process can be more flexible. The creation of hybrid publications (PDF, DjVu) can be executed automatically by server. „hocr” files can be a base for the further preparation of electronic publications. We noticed the potential of that solution and the tools created during the project we are going to use in the near future. Additionally, when we were carrying out the other project connected with processing of 19th – century newspapers printed in gothic fonts we observed very satisfying OCR results received by means of Tesseract used on the objects processed to the 1-bit version (black/white). http://www.bibliotekacyfrowa.pl/publication/59368. We have also repeated the recognition of samples of the object 319708 from the prepared monochromatic files (1-bit). The Tesseract results: CER 7,80% and WER 19,67% vs Tesseract results from our final report: CER 20,58% and WER 35,56%. So, it turned out that creation of good black-white image is essential element which very positively influences on the OCR‟s results.