SlideShare a Scribd company logo
1 of 17
National Conference on Free Software Nishad T R NIT, Calicut http://www.himili.com/ocr/ Lessons from Indic OCR Development
2 Overview History and Evolution of OCR When, Where, Why and How of OCR Selection of an OCR Engine and other gears Putting it all together, and why Tesseractarchitectural style Challenges in Indic OCR Lessons learned and applied Where is it NOW?
OCR in General Engine Training Data Input Tools Output formatting tools 15-Nov-08 3
15-Nov-08 4 Three competents Ocrad  Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL. GOCR GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL. Tesseract Under the sponsorship of Google, Tesseract was made open source in 2006.
And how they performed
Again how they performed
And the winner is …. Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal. Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy. GOCR gives poor output at a slow speed.
15-Nov-08 8 Development Process Evolution Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support Recognizing Ego trust and good intentions beware of maniacal focus Limits of volunteerism eight knives and an apple (dining developer problem) eight knives and a pumpkin eight pumpkins and no knives
How Debayan tamed Matra http://debayanin.googlepages.com/hackingtesseract
And how they performed To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs
The BOX File concept Command tesseract fontfile.tif fontfilebatch.nochopmakebox Sample Box അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705 15-Nov-08 11
In Kindergarten  15-Nov-08 12
His Teacher JTesseract is the Tesseract GUI responsible for easing the training process. JTesseract is released under Apache 2.0 license.  JTesseractcurrently works only on Windows platform.  Developed by RuwanJanapriyaEgodaGamagehttp://www.janapriya.net Features Visual box file editing  Project based training process  13
His Classmates nopapaper 15-Nov-08 14
LibTIFF This software provides support for the Tag Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's).  15-Nov-08 15
Windows GUI 15-Nov-08 16
15-Nov-08 17 Questions? Places to see: Front Door	http://code.google.com/p/tesseract-ocr jtesseracthttp://code.google.com/p/jtesseract/ FreeOCRhttp://www.freeocr.net http://www.himili.com/ocr

More Related Content

Viewers also liked

Signal &telicommunication doc/sanjeet-1308143
Signal &telicommunication doc/sanjeet-1308143Signal &telicommunication doc/sanjeet-1308143
Signal &telicommunication doc/sanjeet-1308143sanjeet kumar
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Vidyut Singhania
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition systemVijay Apurva
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR RecognitionBharat Kalia
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
Revision Guide A2 Media OCR
Revision Guide A2 Media OCRRevision Guide A2 Media OCR
Revision Guide A2 Media OCRreigatemedia
 

Viewers also liked (9)

Signal &telicommunication doc/sanjeet-1308143
Signal &telicommunication doc/sanjeet-1308143Signal &telicommunication doc/sanjeet-1308143
Signal &telicommunication doc/sanjeet-1308143
 
OCR
OCROCR
OCR
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
 
Basics of-optical-character-recognition
Basics of-optical-character-recognitionBasics of-optical-character-recognition
Basics of-optical-character-recognition
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
 
Revision Guide A2 Media OCR
Revision Guide A2 Media OCRRevision Guide A2 Media OCR
Revision Guide A2 Media OCR
 

Similar to Lessons from Indic OCR Development

Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
The trials and tribulations of providing engineering infrastructure
 The trials and tribulations of providing engineering infrastructure  The trials and tribulations of providing engineering infrastructure
The trials and tribulations of providing engineering infrastructure TechExeter
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data storeJ On The Beach
 
Rocking the microservice world with Helidon-LAOUCTour2023.pdf
Rocking the microservice world with Helidon-LAOUCTour2023.pdfRocking the microservice world with Helidon-LAOUCTour2023.pdf
Rocking the microservice world with Helidon-LAOUCTour2023.pdfAlberto Salazar
 
Cyber Security Workshop Presentation.pptx
Cyber Security Workshop Presentation.pptxCyber Security Workshop Presentation.pptx
Cyber Security Workshop Presentation.pptxYashSomalkar
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiTaswar Bhatti
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Anand Sampat
 
Machine learning in cybersecutiry
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiryVishwas N
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
 
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)Dan York
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 

Similar to Lessons from Indic OCR Development (20)

Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
The trials and tribulations of providing engineering infrastructure
 The trials and tribulations of providing engineering infrastructure  The trials and tribulations of providing engineering infrastructure
The trials and tribulations of providing engineering infrastructure
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 
Rocking the microservice world with Helidon-LAOUCTour2023.pdf
Rocking the microservice world with Helidon-LAOUCTour2023.pdfRocking the microservice world with Helidon-LAOUCTour2023.pdf
Rocking the microservice world with Helidon-LAOUCTour2023.pdf
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
Cyber Security Workshop Presentation.pptx
Cyber Security Workshop Presentation.pptxCyber Security Workshop Presentation.pptx
Cyber Security Workshop Presentation.pptx
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 
2. introduction
2. introduction2. introduction
2. introduction
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
Machine learning in cybersecutiry
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiry
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Lessons from Indic OCR Development

  • 1. National Conference on Free Software Nishad T R NIT, Calicut http://www.himili.com/ocr/ Lessons from Indic OCR Development
  • 2. 2 Overview History and Evolution of OCR When, Where, Why and How of OCR Selection of an OCR Engine and other gears Putting it all together, and why Tesseractarchitectural style Challenges in Indic OCR Lessons learned and applied Where is it NOW?
  • 3. OCR in General Engine Training Data Input Tools Output formatting tools 15-Nov-08 3
  • 4. 15-Nov-08 4 Three competents Ocrad  Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL. GOCR GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL. Tesseract Under the sponsorship of Google, Tesseract was made open source in 2006.
  • 5. And how they performed
  • 6. Again how they performed
  • 7. And the winner is …. Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal. Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy. GOCR gives poor output at a slow speed.
  • 8. 15-Nov-08 8 Development Process Evolution Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support Recognizing Ego trust and good intentions beware of maniacal focus Limits of volunteerism eight knives and an apple (dining developer problem) eight knives and a pumpkin eight pumpkins and no knives
  • 9. How Debayan tamed Matra http://debayanin.googlepages.com/hackingtesseract
  • 10. And how they performed To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs
  • 11. The BOX File concept Command tesseract fontfile.tif fontfilebatch.nochopmakebox Sample Box അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705 15-Nov-08 11
  • 12. In Kindergarten 15-Nov-08 12
  • 13. His Teacher JTesseract is the Tesseract GUI responsible for easing the training process. JTesseract is released under Apache 2.0 license. JTesseractcurrently works only on Windows platform. Developed by RuwanJanapriyaEgodaGamagehttp://www.janapriya.net Features Visual box file editing Project based training process 13
  • 14. His Classmates nopapaper 15-Nov-08 14
  • 15. LibTIFF This software provides support for the Tag Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's). 15-Nov-08 15
  • 17. 15-Nov-08 17 Questions? Places to see: Front Door http://code.google.com/p/tesseract-ocr jtesseracthttp://code.google.com/p/jtesseract/ FreeOCRhttp://www.freeocr.net http://www.himili.com/ocr