Your SlideShare is downloading. ×
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
OCRFeeder, documents conversion on GNOME
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

OCRFeeder, documents conversion on GNOME

2,654

Published on

The presentation of OCRFeeder for the GNOME track in FOSDEM 2010.

The presentation of OCRFeeder for the GNOME track in FOSDEM 2010.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,654
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { OCRFeeder GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, Documents conversion on GNOME G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com FOSDEM 2010
  • 2. What is it? Document Analysis and Optical Character Recognition for GNOME Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 3. Why is it? Paper has a number of problems No applications for GNU/Linux to do a fair job Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 4. Paper problems: Security Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/badwsky/
  • 5. Paper problems: Preservation Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/98469445@N00/
  • 6. Paper problems: Data processing Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/hugovk/
  • 7. Paper problems: Ecology Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/pranavsingh/
  • 8. No fair conversion apps for GNU/Linux apart from OCR engines, but... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 9. OCR != Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 10. what you want is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 11. Where are were we at? * Some closed solutions * Only for proprietary systems * Various prices * still... arguable results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 12. How? Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 13. How Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 14. Base concept: 1. Clip the contents 2. Classify them 2.1. They are graphics → Paste on document 2.2. They are text → Calculate letter size; paste on document Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 15. So many layouts... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/uber-tuber/
  • 16. Layouts vary with the type of document What works on detecting one, won't work on others Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 17. OCRFeeder focus on contents, not on layouts! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 18. Key concept: If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 19. Sliding Window Algorithm: 1. A NxN pixel window runs through the document top to bottom, left to right Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 20. Sliding Window Algorithm: 2. For every iteration, if there's a pixel inside the window which contrasts with the background, then the window gets a 1, otherwise it gets a 0 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 21. Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 22. Sliding Window Algorithm: It does not check all the pixels so there is a better performance Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 23. Sliding Window Algorithm: 3. After all windows have a value assigned, the ones with the value 1 are grouped Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 24. Sliding Window Algorithm: 4. Every time a set of 1s is grouped, each window is reassigned the value 0 (these are called blocks) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 25. Sliding Window Algorithm: 5. When all windows have the value 0, the algorithm reached the end Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 26. Block structure: Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 27. Joining Blocks: Blocks are check with each other and joined when appropriate When no blocks can be joined the analysis part is finished Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 28. Recognition: System-wide OCR engines are used Engines are configured from the GUI or XML files Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 29. Engine configuration: <?xml version="1.0" encoding="UTF-8"?> <engine> <name>Tesseract</name> <image_format>TIFF</image_format> <engine_path>/usr/bin/tesseract</engine_path> <arguments>$IMAGE $FILE; cat $FILE.txt; rm $FILE.txt</arguments> </engine> Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 30. Classification: It is graphics if: * Text is empty * More than 50% of the chars are failure chars, punctuation or spaces Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 31. Font Size Detection: “Measures” in pixels the size of each text line Although it results in different sizes... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 32. Font Size Detection: The value equal or greater than the average is chosen (results in values equal or close to the original font size) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 33. Font Size Detection: The font size is calculated in inches using the resolution (DPI) (if there's no resultion info, assume 300 DPI) The value is then divided by the DTP (DeskTop Publishing point): 72 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 34. Exportation to ODT: Uses ODFPy (abstracts ODF creation) (just above XML) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 35. User interaction: Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 36. ABBY Finereader test Nuance Omnipage test Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 37. Finereader's results Omnipage's results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 38. Demo time! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 39. Other features: * PDF importation * Unpaper preprocessor * Font style edition * Exportation to HTML * Project saving/loading Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 40. Future: * Integrate Ocropus as an alternative analysis backend * More exportation formats: HOCR, txt, PDF * Improved a11y * Better integration with GNOME and other GNOME apps Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 41. GNOME: Development moved to GNOME's infrastructure since last month Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 42. Webpage: http://live.gnome.org/OCRFeeder git: http://git.gnome.org/ocrfeeder Bugzilla: coming soon... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 43. Thank you! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

×