OCRFeeder

static void
_f_do_barnacle_install_properties(GObjectClass
*gobject_class)
{
GParamSpec *pspec;
/* Party code a...
What is it?
Document Analysis and Optical
Character Recognition
for GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 201...
Why is it?
Paper has a number of problems
No applications for GNU/Linux to do
a fair job

Joaquim Rocha (Igalia) · OCRFeed...
Paper problems:
Security

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

CC Photo by: http://www.flickr.com/photos/badw...
Paper problems:
Preservation

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

CC Photo by: http://www.flickr.com/photos/...
Paper problems:
Data processing

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

CC Photo by: http://www.flickr.com/phot...
Paper problems:
Ecology

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

CC Photo by: http://www.flickr.com/photos/prana...
No fair conversion apps for
GNU/Linux
apart from OCR engines, but...

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)

Joaq...
what you want is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in th...
Where are were we at?
* Some closed solutions
* Only for proprietary systems
* Various prices
* still... arguable results
...
How?

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Base concept:
1. Clip the contents
2. Classify them
2.1. They are graphics → Paste on
document
2.2. They are text → Calcul...
So many layouts...

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

CC Photo by: http://www.flickr.com/photos/uber-tuber...
Layouts vary with the type of
document
What works on detecting one, won't
work on others

Joaquim Rocha (Igalia) · OCRFeed...
OCRFeeder focus on contents, not
on layouts!

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group...
Sliding Window Algorithm:
1. A NxN pixel window runs through
the document top to bottom, left to
right

Joaquim Rocha (Iga...
Sliding Window Algorithm:
2. For every iteration, if there's a
pixel inside the window which
contrasts with the background...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
It does not check all the pixels so
there is a better performance

Joaquim Rocha (Igalia) · OCRF...
Sliding Window Algorithm:
3. After all windows
have a value assigned,
the ones with the value 1 are
grouped

Joaquim Rocha...
Sliding Window Algorithm:
4. Every time a set of 1s is
grouped, each window is
reassigned the value 0
(these are called bl...
Sliding Window Algorithm:
5. When all windows have
the value 0, the algorithm
reached the end

Joaquim Rocha (Igalia) · OC...
Block structure:

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joining Blocks:
Blocks are check with each other
and joined when appropriate
When no blocks can be joined the
analysis par...
Recognition:
System-wide OCR engines are used
Engines are configured from the
GUI or XML files

Joaquim Rocha (Igalia) · O...
Engine configuration:
<?xml version="1.0" encoding="UTF-8"?>
<engine>
<name>Tesseract</name>
<image_format>TIFF</image_for...
Classification:
It is graphics if:
* Text is empty
* More than 50% of the chars are
failure chars, punctuation or spaces
J...
Font Size Detection:
“Measures” in pixels the size of
each text line

Although it results in different
sizes...
Joaquim Ro...
Font Size Detection:
The value equal or greater than the
average is chosen
(results in values equal or close to
the origin...
Font Size Detection:
The font size is calculated in inches
using the resolution (DPI)
(if there's no resultion info, assum...
Exportation to ODT:
Uses ODFPy
(abstracts ODF creation)
(just above XML)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ...
ABBY Finereader test

Nuance Omnipage test

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Finereader's
results
Omnipage's
results

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Demo time!

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Other features:
* PDF importation
* Unpaper preprocessor
* Font style edition
* Exportation to HTML
* Project saving/loadi...
Future:
* Integrate Ocropus as an
alternative analysis backend
* More exportation formats: HOCR,
txt, PDF
* Improved a11y
...
GNOME:
Development moved to GNOME's
infrastructure since last month

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Webpage:
http://live.gnome.org/OCRFeeder
git:
http://git.gnome.org/ocrfeeder
Bugzilla:
coming soon...
Joaquim Rocha (Igali...
Thank you!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Upcoming SlideShare
Loading in...5
×

OCRFeeder (FOSDEM 2010)

195

Published on

By Joaquim Rocha.

OCRFeeder is a document layout analysis and optical character recognition system that I wrote for my Master's Thesis project.

Like it says on its website, given the images it will automatically outline its contents, distinguish between what's graphics and text and perform OCR over the latter. It generates multiple formats being its main one ODT.

I think this is currently the most complete and user friendly OCR application for GNU/Linux out there and, of course, I wrote it to be used mainly with GNOME, featuring a GUI written in PyGTK and respecting, as far as I could, the GNOME User Interface Guidelines.

I would like to present how the application works on the inside, for example the page segmentation algorithm I created for it, etc. I think this would be interest for the GNOME community and general attendants of the GNOME Dev room at FOSDEM.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
195
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

OCRFeeder (FOSDEM 2010)

  1. 1. OCRFeeder static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); Documents conversion on GNOME g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com FOSDEM 2010
  2. 2. What is it? Document Analysis and Optical Character Recognition for GNOME Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  3. 3. Why is it? Paper has a number of problems No applications for GNU/Linux to do a fair job Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  4. 4. Paper problems: Security Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/badwsky/
  5. 5. Paper problems: Preservation Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/98469445@N00/
  6. 6. Paper problems: Data processing Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/hugovk/
  7. 7. Paper problems: Ecology Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/pranavsingh/
  8. 8. No fair conversion apps for GNU/Linux apart from OCR engines, but... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  9. 9. OCR != Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  10. 10. what you want is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  11. 11. Where are were we at? * Some closed solutions * Only for proprietary systems * Various prices * still... arguable results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  12. 12. How? Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  13. 13. How Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  14. 14. Base concept: 1. Clip the contents 2. Classify them 2.1. They are graphics → Paste on document 2.2. They are text → Calculate letter size; paste on document Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  15. 15. So many layouts... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/uber-tuber/
  16. 16. Layouts vary with the type of document What works on detecting one, won't work on others Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  17. 17. OCRFeeder focus on contents, not on layouts! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  18. 18. Key concept: If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  19. 19. Sliding Window Algorithm: 1. A NxN pixel window runs through the document top to bottom, left to right Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  20. 20. Sliding Window Algorithm: 2. For every iteration, if there's a pixel inside the window which contrasts with the background, then the window gets a 1, otherwise it gets a 0 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  21. 21. Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  22. 22. Sliding Window Algorithm: It does not check all the pixels so there is a better performance Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  23. 23. Sliding Window Algorithm: 3. After all windows have a value assigned, the ones with the value 1 are grouped Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  24. 24. Sliding Window Algorithm: 4. Every time a set of 1s is grouped, each window is reassigned the value 0 (these are called blocks) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  25. 25. Sliding Window Algorithm: 5. When all windows have the value 0, the algorithm reached the end Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  26. 26. Block structure: Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  27. 27. Joining Blocks: Blocks are check with each other and joined when appropriate When no blocks can be joined the analysis part is finished Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  28. 28. Recognition: System-wide OCR engines are used Engines are configured from the GUI or XML files Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  29. 29. Engine configuration: <?xml version="1.0" encoding="UTF-8"?> <engine> <name>Tesseract</name> <image_format>TIFF</image_format> <engine_path>/usr/bin/tesseract</engine_path> <arguments>$IMAGE $FILE; cat $FILE.txt; rm $FILE.txt</arguments> </engine> Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  30. 30. Classification: It is graphics if: * Text is empty * More than 50% of the chars are failure chars, punctuation or spaces Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  31. 31. Font Size Detection: “Measures” in pixels the size of each text line Although it results in different sizes... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  32. 32. Font Size Detection: The value equal or greater than the average is chosen (results in values equal or close to the original font size) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  33. 33. Font Size Detection: The font size is calculated in inches using the resolution (DPI) (if there's no resultion info, assume 300 DPI) The value is then divided by the DTP (DeskTop Publishing point): 72 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  34. 34. Exportation to ODT: Uses ODFPy (abstracts ODF creation) (just above XML) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  35. 35. User interaction: Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  36. 36. ABBY Finereader test Nuance Omnipage test Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  37. 37. Finereader's results Omnipage's results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  38. 38. Demo time! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  39. 39. Other features: * PDF importation * Unpaper preprocessor * Font style edition * Exportation to HTML * Project saving/loading Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  40. 40. Future: * Integrate Ocropus as an alternative analysis backend * More exportation formats: HOCR, txt, PDF * Improved a11y * Better integration with GNOME and other GNOME apps Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  41. 41. GNOME: Development moved to GNOME's infrastructure since last month Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  42. 42. Webpage: http://live.gnome.org/OCRFeeder git: http://git.gnome.org/ocrfeeder Bugzilla: coming soon... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  43. 43. Thank you! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×