Your SlideShare is downloading. ×
Tesseract OCR Engine - OpenFest 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Tesseract OCR Engine - OpenFest 2009

3,113
views

Published on

The lecture presents the open source project Tesseract - a free OCR engine written in C++. The lecture presents the strong and weak sides of tesseract and explains how to train it in a new language. …

The lecture presents the open source project Tesseract - a free OCR engine written in C++. The lecture presents the strong and weak sides of tesseract and explains how to train it in a new language. The lecture demonstration materials are available at the authors's blog: http://www.nakov.com/blog

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,113
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
68
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) www.devbg.org
  • 2. Hot News!
    • Microsoft Corporation just announced its strategic partnership with OpenFest
      • OpenFest is upgrading to Windows 7 and MS SQL Server 2008
    = +
  • 3. What is OCR?
    • Stands for Optical Character Recognition
    • Extracts the text from a given image
  • 4. What is OCR? (2)
    • Invented by Gustav Tauschek
    • Tauschek obtained a patent on OCR
      • 1929 in Germany
      • 1935 in USA
    • Tauschek’s machine
      • Was a mechanical device
      • Uses templates, light and photodetector
      • When a light was directed towards the templates no light reach the photodetector
  • 5. What is OCR? (3)
    • OCR Predicates electronic computers!
  • 6. Project Tesseract
    • History of Tesseract
      • Open source OCR engine
      • Developed by HP between 1985 and 1995
      • Never used in an HP product
      • Rated highly at The Fourth Annual Test of OCR Accuracy in 1995
      • In 2005 HP transferred Tesseract to the ISRI and released it as open source
        • ISRI == Information Science Research Institute
      • The development is currently led by Google
  • 7. Project Tesseract (2)
    • Tesseract is an OCR Engine and is NOT a complete OCR program
      • Originally intended to serve as a component part of other programs
      • Works from the command line
      • Has no page layout analysis (will have soon)
      • Has no output formatting
      • Has no GUI
  • 8. Tesseract Versions
    • Stable build – version 2.04
      • Has some documentation
      • Can be easily trained on a new language
      • Has memory leaks
    • Development version – 3.0 (unstable)
      • Not documented, unstable
      • Language files are not compatible (need special conversion)
  • 9.
    • Downloading, Compiling and Running Tesseract
    • (Latest Version)
    Demo
  • 10. How Tesseract Works?
    • Adaptive thresholding on the input image
    • Analyze connected components in the binary image
    • Find text lines and words
    • First pass of recognition process
      • Attempts to recognize each word in turn
    • Satisfactory words are passed to adaptive trainer
    • Lessons learned are employed in a second pass
      • Used for words not satisfactory recognized
    • Producing the output text
  • 11. Training Tesseract
    • Prepare training images and .box files
      • Files: lang.tif and lang.box
      • 2.04 supports only uncompressed TIFFs
      • .box files contain characters with coordinates
    • Extract the character features
      • This produces lang.tr
    • Perform character clustering
    tesseract lang.tif junk nobatch box.train mftraining lang.tr cntraining lang.tr
  • 12. Training Tesseract (2)
    • Compute the character set properties
      • isLetter, isDigit, isUpper, isPunctuation, …
      • Unicode provides this information
    • Train language dictionaries
      • List of all words in the target language
      • List of the most frequent words
    unicharset_extractor lang.box wordlist2dawg freq-words.txt lang.freq-dawg wordlist2dawg all-words.txt lang.word-dawg
  • 13.
    • Training Tesseract for Bulgarian and English
    • (Bulgarian for IT Professionals)
    Demo
  • 14. Other OCR Engines
    • OCRopus
      • Open source document analysis and OCR system
      • Also funded by Google
      • Provides much of the layout analysis functionality missing from Tesseract
      • Capable to use engines other than Tesseract
      • http://code.google.com/p/ocropus/
  • 15. Other OCR Engines (2)
    • ABBYY FineReader OCR
      • Supports a big number of features
      • Known for its highly accuracy
      • Commercial
    • Microsoft Office Document Imaging (MODI)
      • Supports editing documents scanned by Microsoft Office Document Scanning
      • It was firstly introduced in MS Office XP
      • Commercial
  • 16. Commercial OCR vs. Tesseract
    • 100+ languages
    • Accuracy is good now
    • Sophisticated app with complex UI
    • Works on complex magazine pages
    • Windows mostly
    • Costs $130-$500
    • 6 languages
    • Accuracy was good in 1995
    • No UI yet
    • Page layout analysis coming soon
    • Running on Linux, Mac, Windows, more..
    • Open source – Free!
  • 17. Tesseract Future
    • Page layout analysis
    • More languages
    • Improve accuracy
    • Add a UI
    • Support for connected scripts (like Arabian)
  • 18. Links
    • For more information see:
      • http://code.google.com/p/tesseract-ocr/
      • http://en.wikipedia.org/wiki/Optical_character_recognition
      • http://tesseract-ocr.repairfaq.org/ downloads/tesseract_overview.pdf
    • Speakers
      • http://nakov.com/blog
      • http://veskokolev.blogspot.com
  • 19.
    • Questions ?
    Tesseract OCR