OCRFeeder

static void
_f_do_barnacle_install_properties(GObjectClass
*gobject_class)
{
GParamSpec *pspec;
/* Party code a...
What is it?
Document Analysis and Optical
Character Recognition
for GNOME

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 201...
Why?
Paper has a number of problems
No applications for GNU/Linux to do
a fair job

Joaquim Rocha (Igalia) · OCRFeeder · G...
Paper problems:
Security

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

CC Photo by: http://www.flickr.com/photos/badw...
Paper problems:
Preservation

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

CC Photo by: http://www.flickr.com/photos/...
Paper problems:
Data processing

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

CC Photo by: http://www.flickr.com/phot...
Paper problems:
Ecology

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

CC Photo by: http://www.flickr.com/photos/prana...
Paper problems:
Accessibility

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

CC Photo by: http://www.flickr.com/photos...
No fair conversion apps for
GNU/Linux
apart from OCR engines, but...

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)

Joaq...
What's needed is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in th...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
How it works

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
So many layouts...

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

CC Photo by: http://www.flickr.com/photos/uber-tuber...
Layouts vary with the type of
document
What works on detecting one, won't
work on others

Joaquim Rocha (Igalia) · OCRFeed...
OCRFeeder focuses on contents,
not on layouts!

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Recognition:
System-wide OCR engines are used
Engines are configured from the
GUI or XML files

Joaquim Rocha (Igalia) · O...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Most known free OCR engines are
detected and configured
automatically:
* Tesseract
* GOCR
* OCRAD
* Cuneiform
Joaquim Roch...
Exportation formats:
ODT
HTML
Plain text
PDF

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Demo time!

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Other features:
* PDF importation
* Unpaper preprocessor
* Font style edition
* Image deskewing
* OCR results cleaning
* P...
Future:
* More exportation formats: HOCR,
etc.
* Make OCR engines' management
easier
Joaquim Rocha (Igalia) · OCRFeeder · ...
Webpage:
http://live.gnome.org/OCRFeeder
git:
http://git.gnome.org/ocrfeeder
Bugzilla:
http://bugzilla.gnome.org
product: ...
Thank you!
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Upcoming SlideShare
Loading in...5
×

OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

202
-1

Published on

By Joaquim Rocha.

Currently there are still a lot of documents still stored in paper format and this presents some problems related to preservation, flexibility and even ecology.

With the current Free Software OCR engines it is possible to get a good accuracy rate when converting printed text to digital format but these engines only perform that basic conversion and know nothing about a document's structure and elements.

OCRFeeder presents itself as an easy to use solution implemented for GNOME that performs automatic content detection in pages, allows manual correction and uses the system-wide OCR engines to convert the text. It allows to export the documents in various formats such as ODT, HTML or PDF.

This project stands as the most complete Free Software solution for converting printed documents to digital formats and competes with the proprietary alternatives.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
202
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

  1. 1. OCRFeeder static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); OCR Made Easy on GNOME g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com July 27 2012
  2. 2. What is it? Document Analysis and Optical Character Recognition for GNOME Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  3. 3. Why? Paper has a number of problems No applications for GNU/Linux to do a fair job Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  4. 4. Paper problems: Security Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 CC Photo by: http://www.flickr.com/photos/badwsky/
  5. 5. Paper problems: Preservation Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 CC Photo by: http://www.flickr.com/photos/98469445@N00/
  6. 6. Paper problems: Data processing Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 CC Photo by: http://www.flickr.com/photos/hugovk/
  7. 7. Paper problems: Ecology Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 CC Photo by: http://www.flickr.com/photos/pranavsingh/
  8. 8. Paper problems: Accessibility Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 CC Photo by: http://www.flickr.com/photos/illustrator/
  9. 9. No fair conversion apps for GNU/Linux apart from OCR engines, but... Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  10. 10. OCR != Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents) Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  11. 11. What's needed is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s) Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  12. 12. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  13. 13. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  14. 14. How it works Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  15. 15. So many layouts... Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 CC Photo by: http://www.flickr.com/photos/uber-tuber/
  16. 16. Layouts vary with the type of document What works on detecting one, won't work on others Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  17. 17. OCRFeeder focuses on contents, not on layouts! Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  18. 18. Key concept: If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  19. 19. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  20. 20. Recognition: System-wide OCR engines are used Engines are configured from the GUI or XML files Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  21. 21. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  22. 22. Most known free OCR engines are detected and configured automatically: * Tesseract * GOCR * OCRAD * Cuneiform Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  23. 23. Exportation formats: ODT HTML Plain text PDF Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  24. 24. User interaction: Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  25. 25. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  26. 26. Demo time! Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  27. 27. Other features: * PDF importation * Unpaper preprocessor * Font style edition * Image deskewing * OCR results cleaning * Project saving/loading Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  28. 28. Future: * More exportation formats: HOCR, etc. * Make OCR engines' management easier Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  29. 29. Webpage: http://live.gnome.org/OCRFeeder git: http://git.gnome.org/ocrfeeder Bugzilla: http://bugzilla.gnome.org product: OCRFeeder Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  30. 30. Thank you! Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×