Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

•

1 like•820 views

By Joaquim Rocha. Even with all the existing alternatives, nowadays a lot of information is still printed on paper. From historical documents to places that only recently started to use digital documents in their usual workflow, this information is conditioned by the limitations of paper: hard to index and search, risk of deterioration, ecological implications, etc. Fortunately the state of the art Optical Character Recognition engines, including Free Software solutions, have high success rates in converting printed text into a digital format. However, these command-line tools only convert graphics in text, usually not taking into consideration the layout of the documents analyzed. This means that a regular page with a mixture of text in columns and eventual graphics will be read as if it was only text. There are some solutions that do take into account the structure of the documents besides performing OCR but these are proprietary and commercial and usually do not a version for Linux. OCRFeeder attempts to solve this problem. It automatically tries to outline the structure of the document (using its own algorithm), detect between graphics and text and performs OCR. Its main exportation format is ODT but it can also export to HTML and save or load projects. It is also possible to use different system-wide OCR engines in the same document and manually override any automatic action (for example, to correct its results, etc.). OCRFeeder is published under GPL v3 and was thought to be used mainly in the GNOME desktop environment and it's developed in its infrastructure. It stands as the only Free Software solution to provide a complete and easy to use graphical user interface application to convert printed documents. When used with the Orca screen reader, OCRFeeder is also a useful application for the visually impaired since it enables a printed document to be converted and read by Orca. Thus, during 2010, the main focus of OCRFeeder's development was the improvement of its accessibility Currently, the main features of OCRFeeder are: - Detection of the contents in a document's page; - Classification of those contents (graphics or text); - Deskew of images; - Importation from a scanner device or PDF; - Exportation to ODT or HTML; - Manual edition of any results. - Save and load projects. In this presentation I will explain how OCRFeeder's contents detection algorithm works, how a usual workflow to convert a document should be and give an overview of the main features of OCRFeeder with a demo.

Technology

$OCRFeeder static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); Converting printed documents into digital formats g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com Berlin, May 2011$

What is it?
Document Analysis and Optical
Character Recognition
for GNOME

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Why?
Paper has a number of problems
No applications for GNU/Linux to do
a fair job

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Paper problems:
Security

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

CC Photo by: http://www.flickr.com/photos/badwsky/

Paper problems:
Preservation

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Paper problems:
Data processing

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

CC Photo by: http://www.flickr.com/photos/hugovk/

Paper problems:
Ecology

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

CC Photo by: http://www.flickr.com/photos/pranavsingh/

No fair conversion apps for
GNU/Linux
apart from OCR engines, but...

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

What's needed is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Where are were we at?
* Some closed solutions
* Only for proprietary systems
* Various prices
* still... arguable results

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

How

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

So many layouts...

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Layouts vary with the type of
document
What works on detecting one, won't
work on others

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

OCRFeeder focuses on contents,
not on layouts!

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Recognition:
System-wide OCR engines are used
Engines are configured from the
GUI or XML files

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Most known free OCR engines are
detected and configured
automatically:
* Tesseract
* GOCR
* OCRAD
* Cuneiform
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Exportation formats:
ODT
HTML
Plain text

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ways
CLI only works in an unattended
mode
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Demo time!

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Other features:
* PDF importation
* Unpaper preprocessor
* Font style edition
* Image deskewing
* OCR results cleaning
* Project saving/loading
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

A11y:
* OCRFeeder is a very useful tool
for visually impaired users
* Last year, the main target of its
development was to improve a11y

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Future:
* Integrate Ocropus as an
alternative analysis backend
* More exportation formats: HOCR,
PDF, etc.
* Make OCR engines' management
easier
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Webpage:
http://live.gnome.org/OCRFeeder
git:
http://git.gnome.org/ocrfeeder
Bugzilla:
http://bugzilla.gnome.org
product: OCRFeeder
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Manual in German:
http://wiki.ubuntuusers.de/OCRFeeder

Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Thank you!
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Viewers also liked

OCRjacekb

Optical Character Recognition (OCR)Vidyut Singhania

Basics of-optical-character-recognitiondocument scanning services

Text extraction From Digital imageKaushik Godhani

Machine learning on Hadoop data lakesDataWorks Summit

Optical Character Recognition( OCR )Karan Panjwani

Project report of OCR RecognitionBharat Kalia

Text Detection and RecognitionBadruz Nasrin Basri

optical character recognition systemVijay Apurva

Hand Written Character Recognition Using Neural Networks Chiranjeevi Adi

Viewers also liked (10)

OCR

Optical Character Recognition (OCR)

Basics of-optical-character-recognition

Text extraction From Digital image

Machine learning on Hadoop data lakes

Optical Character Recognition( OCR )

Project report of OCR Recognition

Text Detection and Recognition

optical character recognition system

Hand Written Character Recognition Using Neural Networks

Recently uploaded (20)

2024 April Patch Tuesday

Connecting the Dots for Information Discovery.pdf

Time Series Foundation Models - current state and future directions

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

Generative Artificial Intelligence: How generative AI works.pdf

Decarbonising Buildings: Making a net-zero built environment a reality

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

How AI, OpenAI, and ChatGPT impact business and software.

A Framework for Development in the AI Age

Generative AI for Technical Writer or Information Developers

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Testing tools and AI - ideas what to try with some tool examples

DevEX - reference for building teams, processes, and platforms

Manual 508 Accessibility Compliance Audit

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Long journey of Ruby standard library at RubyConf AU 2024

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

Genislab builds better products and faster go-to-market with Lean project man...

Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

1. OCRFeeder static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); Converting printed documents into digital formats g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com Berlin, May 2011

2. What is it? Document Analysis and Optical Character Recognition for GNOME Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

3. Why? Paper has a number of problems No applications for GNU/Linux to do a fair job Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

4. Paper problems: Security Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011 CC Photo by: http://www.flickr.com/photos/badwsky/

5. Paper problems: Preservation Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011 CC Photo by: http://www.flickr.com/photos/98469445@N00/

6. Paper problems: Data processing Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011 CC Photo by: http://www.flickr.com/photos/hugovk/

7. Paper problems: Ecology Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011 CC Photo by: http://www.flickr.com/photos/pranavsingh/

8. No fair conversion apps for GNU/Linux apart from OCR engines, but... Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

9. OCR != Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents) Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

10. What's needed is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s) Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

11. Where are were we at? * Some closed solutions * Only for proprietary systems * Various prices * still... arguable results Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

12. How Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

13. So many layouts... Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011 CC Photo by: http://www.flickr.com/photos/uber-tuber/

14. Layouts vary with the type of document What works on detecting one, won't work on others Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

15. OCRFeeder focuses on contents, not on layouts! Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

16. Key concept: If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

17. Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

18. Recognition: System-wide OCR engines are used Engines are configured from the GUI or XML files Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

19. Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

20. Most known free OCR engines are detected and configured automatically: * Tesseract * GOCR * OCRAD * Cuneiform Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

21. Exportation formats: ODT HTML Plain text Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

22. User interaction: Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

23. Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

24. Demo time! Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

25. Other features: * PDF importation * Unpaper preprocessor * Font style edition * Image deskewing * OCR results cleaning * Project saving/loading Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

26. A11y: * OCRFeeder is a very useful tool for visually impaired users * Last year, the main target of its development was to improve a11y Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

27. Future: * Integrate Ocropus as an alternative analysis backend * More exportation formats: HOCR, PDF, etc. * Make OCR engines' management easier Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

28. Webpage: http://live.gnome.org/OCRFeeder git: http://git.gnome.org/ocrfeeder Bugzilla: http://bugzilla.gnome.org product: OCRFeeder Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

29. Manual in German: http://wiki.ubuntuusers.de/OCRFeeder Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

30. Thank you! Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011

Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)

Similar to Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011) (20)

More from Igalia

More from Igalia (20)

Recently uploaded

Recently uploaded (20)

Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)