Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) www.devbg.org
Hot News! <ul><li>Microsoft Corporation just announced its strategic partnership with OpenFest </li></ul><ul><ul><li>OpenF...
What is OCR? <ul><li>Stands for Optical Character Recognition </li></ul><ul><li>Extracts the text from a given image </li>...
What is OCR? (2) <ul><li>Invented by  Gustav Tauschek </li></ul><ul><li>Tauschek  obtained a patent on OCR  </li></ul><ul>...
What is OCR? (3) <ul><li>OCR Predicates electronic computers! </li></ul>
Project Tesseract <ul><li>History of Tesseract </li></ul><ul><ul><li>Open source OCR engine </li></ul></ul><ul><ul><li>Dev...
Project Tesseract (2) <ul><li>Tesseract is an OCR Engine and is NOT a complete OCR program </li></ul><ul><ul><li>Originall...
Tesseract Versions <ul><li>Stable build – version 2.04 </li></ul><ul><ul><li>Has some documentation </li></ul></ul><ul><ul...
<ul><li>Downloading, Compiling and Running Tesseract </li></ul><ul><li>(Latest Version) </li></ul>Demo
How Tesseract Works? <ul><li>Adaptive thresholding on the input image </li></ul><ul><li>Analyze connected components in th...
Training Tesseract <ul><li>Prepare training images and .box files </li></ul><ul><ul><li>Files:  lang.tif  and  lang.box </...
Training Tesseract (2) <ul><li>Compute the character set properties </li></ul><ul><ul><li>isLetter, isDigit, isUpper, isPu...
<ul><li>Training Tesseract for Bulgarian and English </li></ul><ul><li>(Bulgarian for IT Professionals) </li></ul>Demo
Other OCR Engines <ul><li>OCRopus </li></ul><ul><ul><li>Open source document analysis and OCR system </li></ul></ul><ul><u...
Other OCR Engines (2) <ul><li>ABBYY FineReader OCR </li></ul><ul><ul><li>Supports a big number of features </li></ul></ul>...
Commercial OCR vs. Tesseract <ul><li>100+ languages </li></ul><ul><li>Accuracy is good now </li></ul><ul><li>Sophisticated...
Tesseract Future <ul><li>Page layout analysis </li></ul><ul><li>More languages </li></ul><ul><li>Improve accuracy </li></u...
Links <ul><li>For more information see: </li></ul><ul><ul><li>http://code.google.com/p/tesseract-ocr/ </li></ul></ul><ul><...
<ul><li>Questions ? </li></ul>Tesseract OCR
Upcoming SlideShare
Loading in...5
×

Tesseract OCR Engine - OpenFest 2009

3,225

Published on

The lecture presents the open source project Tesseract - a free OCR engine written in C++. The lecture presents the strong and weak sides of tesseract and explains how to train it in a new language. The lecture demonstration materials are available at the authors's blog: http://www.nakov.com/blog

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,225
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
75
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Tesseract OCR Engine - OpenFest 2009

  1. 1. Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) www.devbg.org
  2. 2. Hot News! <ul><li>Microsoft Corporation just announced its strategic partnership with OpenFest </li></ul><ul><ul><li>OpenFest is upgrading to Windows 7 and MS SQL Server 2008 </li></ul></ul>= +
  3. 3. What is OCR? <ul><li>Stands for Optical Character Recognition </li></ul><ul><li>Extracts the text from a given image </li></ul>
  4. 4. What is OCR? (2) <ul><li>Invented by Gustav Tauschek </li></ul><ul><li>Tauschek obtained a patent on OCR </li></ul><ul><ul><li>1929 in Germany </li></ul></ul><ul><ul><li>1935 in USA </li></ul></ul><ul><li>Tauschek’s machine </li></ul><ul><ul><li>Was a mechanical device </li></ul></ul><ul><ul><li>Uses templates, light and photodetector </li></ul></ul><ul><ul><li>When a light was directed towards the templates no light reach the photodetector </li></ul></ul>
  5. 5. What is OCR? (3) <ul><li>OCR Predicates electronic computers! </li></ul>
  6. 6. Project Tesseract <ul><li>History of Tesseract </li></ul><ul><ul><li>Open source OCR engine </li></ul></ul><ul><ul><li>Developed by HP between 1985 and 1995 </li></ul></ul><ul><ul><li>Never used in an HP product </li></ul></ul><ul><ul><li>Rated highly at The Fourth Annual Test of OCR Accuracy in 1995 </li></ul></ul><ul><ul><li>In 2005 HP transferred Tesseract to the ISRI and released it as open source </li></ul></ul><ul><ul><ul><li>ISRI == Information Science Research Institute </li></ul></ul></ul><ul><ul><li>The development is currently led by Google </li></ul></ul>
  7. 7. Project Tesseract (2) <ul><li>Tesseract is an OCR Engine and is NOT a complete OCR program </li></ul><ul><ul><li>Originally intended to serve as a component part of other programs </li></ul></ul><ul><ul><li>Works from the command line </li></ul></ul><ul><ul><li>Has no page layout analysis (will have soon) </li></ul></ul><ul><ul><li>Has no output formatting </li></ul></ul><ul><ul><li>Has no GUI </li></ul></ul>
  8. 8. Tesseract Versions <ul><li>Stable build – version 2.04 </li></ul><ul><ul><li>Has some documentation </li></ul></ul><ul><ul><li>Can be easily trained on a new language </li></ul></ul><ul><ul><li>Has memory leaks </li></ul></ul><ul><li>Development version – 3.0 (unstable) </li></ul><ul><ul><li>Not documented, unstable </li></ul></ul><ul><ul><li>Language files are not compatible (need special conversion) </li></ul></ul>
  9. 9. <ul><li>Downloading, Compiling and Running Tesseract </li></ul><ul><li>(Latest Version) </li></ul>Demo
  10. 10. How Tesseract Works? <ul><li>Adaptive thresholding on the input image </li></ul><ul><li>Analyze connected components in the binary image </li></ul><ul><li>Find text lines and words </li></ul><ul><li>First pass of recognition process </li></ul><ul><ul><li>Attempts to recognize each word in turn </li></ul></ul><ul><li>Satisfactory words are passed to adaptive trainer </li></ul><ul><li>Lessons learned are employed in a second pass </li></ul><ul><ul><li>Used for words not satisfactory recognized </li></ul></ul><ul><li>Producing the output text </li></ul>
  11. 11. Training Tesseract <ul><li>Prepare training images and .box files </li></ul><ul><ul><li>Files: lang.tif and lang.box </li></ul></ul><ul><ul><li>2.04 supports only uncompressed TIFFs </li></ul></ul><ul><ul><li>.box files contain characters with coordinates </li></ul></ul><ul><li>Extract the character features </li></ul><ul><ul><li>This produces lang.tr </li></ul></ul><ul><li>Perform character clustering </li></ul>tesseract lang.tif junk nobatch box.train mftraining lang.tr cntraining lang.tr
  12. 12. Training Tesseract (2) <ul><li>Compute the character set properties </li></ul><ul><ul><li>isLetter, isDigit, isUpper, isPunctuation, … </li></ul></ul><ul><ul><li>Unicode provides this information </li></ul></ul><ul><li>Train language dictionaries </li></ul><ul><ul><li>List of all words in the target language </li></ul></ul><ul><ul><li>List of the most frequent words </li></ul></ul>unicharset_extractor lang.box wordlist2dawg freq-words.txt lang.freq-dawg wordlist2dawg all-words.txt lang.word-dawg
  13. 13. <ul><li>Training Tesseract for Bulgarian and English </li></ul><ul><li>(Bulgarian for IT Professionals) </li></ul>Demo
  14. 14. Other OCR Engines <ul><li>OCRopus </li></ul><ul><ul><li>Open source document analysis and OCR system </li></ul></ul><ul><ul><li>Also funded by Google </li></ul></ul><ul><ul><li>Provides much of the layout analysis functionality missing from Tesseract </li></ul></ul><ul><ul><li>Capable to use engines other than Tesseract </li></ul></ul><ul><ul><li>http://code.google.com/p/ocropus/ </li></ul></ul>
  15. 15. Other OCR Engines (2) <ul><li>ABBYY FineReader OCR </li></ul><ul><ul><li>Supports a big number of features </li></ul></ul><ul><ul><li>Known for its highly accuracy </li></ul></ul><ul><ul><li>Commercial </li></ul></ul><ul><li>Microsoft Office Document Imaging (MODI) </li></ul><ul><ul><li>Supports editing documents scanned by Microsoft Office Document Scanning </li></ul></ul><ul><ul><li>It was firstly introduced in MS Office XP </li></ul></ul><ul><ul><li>Commercial </li></ul></ul>
  16. 16. Commercial OCR vs. Tesseract <ul><li>100+ languages </li></ul><ul><li>Accuracy is good now </li></ul><ul><li>Sophisticated app with complex UI </li></ul><ul><li>Works on complex magazine pages </li></ul><ul><li>Windows mostly </li></ul><ul><li>Costs $130-$500 </li></ul><ul><li>6 languages </li></ul><ul><li>Accuracy was good in 1995 </li></ul><ul><li>No UI yet </li></ul><ul><li>Page layout analysis coming soon </li></ul><ul><li>Running on Linux, Mac, Windows, more.. </li></ul><ul><li>Open source – Free! </li></ul>
  17. 17. Tesseract Future <ul><li>Page layout analysis </li></ul><ul><li>More languages </li></ul><ul><li>Improve accuracy </li></ul><ul><li>Add a UI </li></ul><ul><li>Support for connected scripts (like Arabian) </li></ul>
  18. 18. Links <ul><li>For more information see: </li></ul><ul><ul><li>http://code.google.com/p/tesseract-ocr/ </li></ul></ul><ul><ul><li>http://en.wikipedia.org/wiki/Optical_character_recognition </li></ul></ul><ul><ul><li>http://tesseract-ocr.repairfaq.org/ downloads/tesseract_overview.pdf </li></ul></ul><ul><li>Speakers </li></ul><ul><ul><li>http://nakov.com/blog </li></ul></ul><ul><ul><li>http://veskokolev.blogspot.com </li></ul></ul>
  19. 19. <ul><li>Questions ? </li></ul>Tesseract OCR
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×