By Joaquim Rocha.
Even with all the existing alternatives, nowadays a lot of information is still printed on paper. From historical documents to places that only recently started to use digital documents in their usual workflow, this information is conditioned by the limitations of paper: hard to index and search, risk of deterioration, ecological implications, etc.
Fortunately the state of the art Optical Character Recognition engines, including Free Software solutions, have high success rates in converting printed text into a digital format. However, these command-line tools only convert graphics in text, usually not taking into consideration the layout of the documents analyzed. This means that a regular page with a mixture of text in columns and eventual graphics will be read as if it was only text. There are some solutions that do take into account the structure of the documents besides performing OCR but these are proprietary and commercial and usually do not a version for Linux.
OCRFeeder attempts to solve this problem. It automatically tries to outline the structure of the document (using its own algorithm), detect between graphics and text and performs OCR. Its main exportation format is ODT but it can also export to HTML and save or load projects. It is also possible to use different system-wide OCR engines in the same document and manually override any automatic action (for example, to correct its results, etc.). OCRFeeder is published under GPL v3 and was thought to be used mainly in the GNOME desktop environment and it's developed in its infrastructure. It stands as the only Free Software solution to provide a complete and easy to use graphical user interface application to convert printed documents.
When used with the Orca screen reader, OCRFeeder is also a useful application for the visually impaired since it enables a printed document to be converted and read by Orca. Thus, during 2010, the main focus of OCRFeeder's development was the improvement of its accessibility
Currently, the main features of OCRFeeder are:
- Detection of the contents in a document's page;
- Classification of those contents (graphics or text);
- Deskew of images;
- Importation from a scanner device or PDF;
- Exportation to ODT or HTML;
- Manual edition of any results.
- Save and load projects.
In this presentation I will explain how OCRFeeder's contents detection algorithm works, how a usual workflow to convert a document should be and give an overview of the main features of OCRFeeder with a demo.
8. No fair conversion apps for
GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
9. OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
10. What's needed is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
11. Where are were we at?
* Some closed solutions
* Only for proprietary systems
* Various prices
* still... arguable results
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
13. So many layouts...
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
CC Photo by: http://www.flickr.com/photos/uber-tuber/
14. Layouts vary with the type of
document
What works on detecting one, won't
work on others
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
15. OCRFeeder focuses on contents,
not on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
16. Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
22. User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ways
CLI only works in an unattended
mode
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
26. A11y:
* OCRFeeder is a very useful tool
for visually impaired users
* Last year, the main target of its
development was to improve a11y
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011
27. Future:
* Integrate Ocropus as an
alternative analysis backend
* More exportation formats: HOCR,
PDF, etc.
* Make OCR engines' management
easier
Joaquim Rocha (Igalia) · OCRFeeder · LinuxTag 2011