Advertisement

OCRFeeder, documents conversion on GNOME

Igalia
Feb. 9, 2010
Advertisement

More Related Content

Advertisement

OCRFeeder, documents conversion on GNOME

  1. static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { OCRFeeder GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, Documents conversion on GNOME G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com FOSDEM 2010
  2. What is it? Document Analysis and Optical Character Recognition for GNOME Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  3. Why is it? Paper has a number of problems No applications for GNU/Linux to do a fair job Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  4. Paper problems: Security Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/badwsky/
  5. Paper problems: Preservation Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/98469445@N00/
  6. Paper problems: Data processing Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/hugovk/
  7. Paper problems: Ecology Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/pranavsingh/
  8. No fair conversion apps for GNU/Linux apart from OCR engines, but... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  9. OCR != Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  10. what you want is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  11. Where are were we at? * Some closed solutions * Only for proprietary systems * Various prices * still... arguable results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  12. How? Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  13. How Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  14. Base concept: 1. Clip the contents 2. Classify them 2.1. They are graphics → Paste on document 2.2. They are text → Calculate letter size; paste on document Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  15. So many layouts... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/uber-tuber/
  16. Layouts vary with the type of document What works on detecting one, won't work on others Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  17. OCRFeeder focus on contents, not on layouts! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  18. Key concept: If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  19. Sliding Window Algorithm: 1. A NxN pixel window runs through the document top to bottom, left to right Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  20. Sliding Window Algorithm: 2. For every iteration, if there's a pixel inside the window which contrasts with the background, then the window gets a 1, otherwise it gets a 0 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  21. Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  22. Sliding Window Algorithm: It does not check all the pixels so there is a better performance Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  23. Sliding Window Algorithm: 3. After all windows have a value assigned, the ones with the value 1 are grouped Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  24. Sliding Window Algorithm: 4. Every time a set of 1s is grouped, each window is reassigned the value 0 (these are called blocks) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  25. Sliding Window Algorithm: 5. When all windows have the value 0, the algorithm reached the end Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  26. Block structure: Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  27. Joining Blocks: Blocks are check with each other and joined when appropriate When no blocks can be joined the analysis part is finished Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  28. Recognition: System-wide OCR engines are used Engines are configured from the GUI or XML files Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  29. Engine configuration: <?xml version="1.0" encoding="UTF-8"?> <engine> <name>Tesseract</name> <image_format>TIFF</image_format> <engine_path>/usr/bin/tesseract</engine_path> <arguments>$IMAGE $FILE; cat $FILE.txt; rm $FILE.txt</arguments> </engine> Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  30. Classification: It is graphics if: * Text is empty * More than 50% of the chars are failure chars, punctuation or spaces Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  31. Font Size Detection: “Measures” in pixels the size of each text line Although it results in different sizes... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  32. Font Size Detection: The value equal or greater than the average is chosen (results in values equal or close to the original font size) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  33. Font Size Detection: The font size is calculated in inches using the resolution (DPI) (if there's no resultion info, assume 300 DPI) The value is then divided by the DTP (DeskTop Publishing point): 72 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  34. Exportation to ODT: Uses ODFPy (abstracts ODF creation) (just above XML) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  35. User interaction: Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  36. ABBY Finereader test Nuance Omnipage test Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  37. Finereader's results Omnipage's results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  38. Demo time! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  39. Other features: * PDF importation * Unpaper preprocessor * Font style edition * Exportation to HTML * Project saving/loading Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  40. Future: * Integrate Ocropus as an alternative analysis backend * More exportation formats: HOCR, txt, PDF * Improved a11y * Better integration with GNOME and other GNOME apps Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  41. GNOME: Development moved to GNOME's infrastructure since last month Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  42. Webpage: http://live.gnome.org/OCRFeeder git: http://git.gnome.org/ocrfeeder Bugzilla: coming soon... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  43. Thank you! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Advertisement