Lessons from Indic OCR Development


Published on

Talk about Tesseract-OCR system for Malayalam in National Conference on Free Software

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lessons from Indic OCR Development

  1. 1. National Conference on Free Software<br />Nishad T R<br />NIT, Calicut<br />http://www.himili.com/ocr/<br />Lessons from Indic OCR Development<br />
  2. 2. 2<br />Overview<br />History and Evolution of OCR<br />When, Where, Why and How of OCR<br />Selection of an OCR Engine and other gears<br />Putting it all together, and why<br />Tesseractarchitectural style<br />Challenges in Indic OCR<br />Lessons learned and applied<br />Where is it NOW?<br />
  3. 3. OCR in General<br />Engine<br />Training Data<br />Input Tools<br />Output formatting tools<br />15-Nov-08<br />3<br />
  4. 4. 15-Nov-08<br />4<br />Three competents<br />Ocrad <br />Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL.<br />GOCR<br />GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL.<br />Tesseract<br />Under the sponsorship of Google, Tesseract was made open source in 2006. <br />
  5. 5. And how they performed<br />
  6. 6. Again how they performed<br />
  7. 7. And the winner is ….<br />Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal.<br />Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.<br />GOCR gives poor output at a slow speed.<br />
  8. 8. 15-Nov-08<br />8<br />Development Process Evolution<br />Fostering Contributions<br />developer focus and avoiding starvation<br />code, code review, documentation, support<br />Recognizing Ego<br />trust and good intentions<br />beware of maniacal focus<br />Limits of volunteerism<br />eight knives and an apple (dining developer problem)<br />eight knives and a pumpkin<br />eight pumpkins and no knives<br />
  9. 9. How Debayan tamed Matra<br />http://debayanin.googlepages.com/hackingtesseract<br />
  10. 10. And how they performed<br />To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard<br />tessdata/xxx.freq-dawg<br />tessdata/xxx.word-dawg<br />tessdata/xxx.user-words<br />tessdata/xxx.inttemp<br />tessdata/xxx.normproto<br />tessdata/xxx.pffmtable<br />tessdata/xxx.unicharset<br />tessdata/xxx.DangAmbigs<br />
  11. 11. The BOX File concept<br />Command<br />tesseract fontfile.tif fontfilebatch.nochopmakebox<br />Sample Box<br />അ 8 682 53 703<br />ആ 62 676 112 703<br />ഇ 121 676 155 705<br />ഈ 165 677 220 705<br />ഉ 232 677 256 704<br />ഊ 265 677 313 705<br />15-Nov-08<br />11<br />
  12. 12. In Kindergarten <br />15-Nov-08<br />12<br />
  13. 13. His Teacher<br />JTesseract<br />is the Tesseract GUI responsible for easing the training process. JTesseract is released under Apache 2.0 license. <br />JTesseractcurrently works only on Windows platform. <br />Developed by RuwanJanapriyaEgodaGamagehttp://www.janapriya.net<br />Features<br />Visual box file editing <br />Project based training process <br />13<br />
  14. 14. His Classmates<br />nopapaper<br />15-Nov-08<br />14<br />
  15. 15. LibTIFF<br />This software provides support for the Tag Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN&apos;s). <br />15-Nov-08<br />15<br />
  16. 16. Windows GUI<br />15-Nov-08<br />16<br />
  17. 17. 15-Nov-08<br />17<br />Questions?<br />Places to see:<br />Front Door http://code.google.com/p/tesseract-ocr<br />jtesseracthttp://code.google.com/p/jtesseract/<br />FreeOCRhttp://www.freeocr.net<br />http://www.himili.com/ocr<br />