What is it?
Document Analysis and Optical
Character Recognition
for GNOME
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Why is it?
Paper has a number of problems
No applications for GNU/Linux to do
a fair job
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:
Security
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
CC Photo by: http://www.flickr.com/photos/badwsky/
Paper problems:
Preservation
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
CC Photo by: http://www.flickr.com/photos/98469445@N00/
Paper problems:
Data processing
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
CC Photo by: http://www.flickr.com/photos/hugovk/
Paper problems:
Ecology
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
CC Photo by: http://www.flickr.com/photos/pranavsingh/
No fair conversion apps for
GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
what you want is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Where are were we at?
* Some closed solutions
* Only for proprietary systems
* Various prices
* still... arguable results
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Base concept:
1. Clip the contents
2. Classify them
2.1. They are graphics → Paste on
document
2.2. They are text → Calculate letter
size; paste on document
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
So many layouts...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
CC Photo by: http://www.flickr.com/photos/uber-tuber/
Layouts vary with the type of
document
What works on detecting one, won't
work on others
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCRFeeder focus on contents, not
on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
1. A NxN pixel window runs through
the document top to bottom, left to
right
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
2. For every iteration, if there's a
pixel inside the window which
contrasts with the background,
then the window gets a 1,
otherwise it gets a 0
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
It does not check all the pixels so
there is a better performance
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
3. After all windows
have a value assigned,
the ones with the value 1 are
grouped
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
4. Every time a set of 1s is
grouped, each window is
reassigned the value 0
(these are called blocks)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
5. When all windows have
the value 0, the algorithm
reached the end
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joining Blocks:
Blocks are check with each other
and joined when appropriate
When no blocks can be joined the
analysis part is finished
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Classification:
It is graphics if:
* Text is empty
* More than 50% of the chars are
failure chars, punctuation or spaces
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
“Measures” in pixels the size of
each text line
Although it results in different
sizes...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
The value equal or greater than the
average is chosen
(results in values equal or close to
the original font size)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
The font size is calculated in inches
using the resolution (DPI)
(if there's no resultion info, assume
300 DPI)
The value is then divided by the
DTP (DeskTop Publishing point): 72
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ways
CLI only works in an unattended
mode
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Other features:
* PDF importation
* Unpaper preprocessor
* Font style edition
* Exportation to HTML
* Project saving/loading
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Future:
* Integrate Ocropus as an
alternative analysis backend
* More exportation formats: HOCR,
txt, PDF
* Improved a11y
* Better integration with GNOME
and other GNOME apps
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
GNOME:
Development moved to GNOME's
infrastructure since last month
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010