SlideShare a Scribd company logo
Tesseract OCR Engine
Contents 
• Introduction & history of OCR 
• Tesseract architecture & methods 
• Announcing Tesseract 2.00 
• Training Tesseract 
• Future enhancements
A Brief History of OCR 
• What is Optical Character Recognition? 
• A Brief History of OCR 
• OCR predates electronic computers!
A Brief History of OCR 
• 1929 – Digit recognition machine 
• 1953 – Alphanumeric recognition machine 
• 1965 – US Mail sorting 
• 1965 – British banking system 
• 1976 – Kurzweil reading machine 
• 1985 – Hardware-assisted PC software 
• 1988 – Software-only PC software 
• 1994-2000 – Industry consolidation
Tesseract Background 
• Developed on HP-UX at HP between 1985 
• and 1994 to run in a desktop scanner. 
• Came neck and neck with Caere and XIS 
• in the 1995 UNLV test. 
• (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) 
• Never used in an HP product. 
• Open sourced in 2005. Now on: 
• http://code.google.com/p/tesseract-ocr 
• Highly portable.
Tesseract OCR Architecture 
• Baselines are rarely perfectly straight 
• Text Line Finding – skew independent – 
• published at ICDAR’95 Montreal. 
• (http://scholar.google.com/scholar?q=skew+detection+smith) 
• Baselines are approximated by quadratic splines 
• to account for skew and curl. 
• Meanline, ascender and descender lines are a 
• constant displacement from baseline. 
• Critical value is the x-height.
Spaces between words are tricky too 
• Italics, digits, punctuation all create special-case font-dependent 
spacing. 
• • Fully justified text in narrow columns can have vastly varying 
spacing on different lines. 
• Tesseract: Recognize Word 
• Outline Approximation 
• Polygonal approximation is a double-edged sword. 
• Noise and some pertinent information are both lost.
Tesseract: Features and Matching 
• Static classifier uses outline fragments as features. Broken characters 
are easily recognizable by a small->large matching process in classifier. 
(This is slow.) 
• Adaptive classifier uses the same technique! 
• (Apart from normalization method)
Announcing tesseract-2.00 
• Fully Unicode (UTF-8) capable 
• Already trained for 6 Latin-based 
• languages (Eng, Fra, Ita, Deu, Spa, Nld) 
• Code and documented process to train at 
• http://code.google.com/p/tesseract-ocr 
• UNLV regression test framework 
• Other minor fixes
Commercial OCR v Tesseract 
• Page layout analysis. 
• More languages. 
• Improve accuracy. 
• Add a UI.

More Related Content

What's hot

Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Devashish Shanker
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Databricks
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
HostedbyConfluent
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Pranav Gupta
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
Rubén Izquierdo Beviá
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
VikasBisoi
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Taehoon Kim
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
ananth
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APPLICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
Aditya Mishra
 
Pose Graph based SLAM
Pose Graph based SLAMPose Graph based SLAM
Pose Graph based SLAM
EdwardIm1
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
Daniel Schneiter
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
stackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introductionstackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introduction
NETWAYS
 

What's hot (20)

Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APPLICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
 
ocr
ocrocr
ocr
 
Pose Graph based SLAM
Pose Graph based SLAMPose Graph based SLAM
Pose Graph based SLAM
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
stackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introductionstackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introduction
 

Viewers also liked

OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
Shobhit Chittora
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
balamurugan.k Kalibalamurugan
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009
Svetlin Nakov
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
Karan Panjwani
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
Badruz Nasrin Basri
 
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...Christos Demetriou
 
基于Python构建可扩展的自动化运维平台
基于Python构建可扩展的自动化运维平台基于Python构建可扩展的自动化运维平台
基于Python构建可扩展的自动化运维平台
liuts
 
Introduction to python for Beginners
Introduction to python for Beginners Introduction to python for Beginners
Introduction to python for Beginners
Sujith Kumar
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
amiable_indian
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
DataWorks Summit/Hadoop Summit
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
Vidyut Singhania
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)
Paige Bailey
 
Basics of-optical-character-recognition
Basics of-optical-character-recognitionBasics of-optical-character-recognition
Basics of-optical-character-recognition
document scanning services
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition systemVijay Apurva
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
Narendra Sisodiya
 
Déposer une thèse dans TEL ou HAL
Déposer une thèse dans TEL ou HALDéposer une thèse dans TEL ou HAL
Déposer une thèse dans TEL ou HAL
OAccsd
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
Nowell Strite
 

Viewers also liked (20)

OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
 
Tasract OCR
Tasract OCRTasract OCR
Tasract OCR
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
 
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
 
基于Python构建可扩展的自动化运维平台
基于Python构建可扩展的自动化运维平台基于Python构建可扩展的自动化运维平台
基于Python构建可扩展的自动化运维平台
 
Introduction to python for Beginners
Introduction to python for Beginners Introduction to python for Beginners
Introduction to python for Beginners
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)
 
Basics of-optical-character-recognition
Basics of-optical-character-recognitionBasics of-optical-character-recognition
Basics of-optical-character-recognition
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Raspberry pi
Raspberry pi Raspberry pi
Raspberry pi
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 
Déposer une thèse dans TEL ou HAL
Déposer une thèse dans TEL ou HALDéposer une thèse dans TEL ou HAL
Déposer une thèse dans TEL ou HAL
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to Tesseract OCR Engine

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Zachary S. Brown
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
Chen Xu
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
Emanuel Calvo
 
ReFRESCO-General-Jan2015
ReFRESCO-General-Jan2015ReFRESCO-General-Jan2015
ReFRESCO-General-Jan2015
Guilherme Vaz
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
Dani Solà Lagares
 
CBDW2014 - Down the RabbitMQ hole with ColdFusion
CBDW2014 - Down the RabbitMQ hole with ColdFusionCBDW2014 - Down the RabbitMQ hole with ColdFusion
CBDW2014 - Down the RabbitMQ hole with ColdFusion
Ortus Solutions, Corp
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-timeMarc Sturlese
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
Amazon Web Services
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur
 
Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16
allingeek
 
OCR by Abdullah Ahmed Abu Rtima
OCR by Abdullah Ahmed Abu Rtima OCR by Abdullah Ahmed Abu Rtima
OCR by Abdullah Ahmed Abu Rtima
as af
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
Ricard Clau
 
2018 12-kube con-ballerinacon
2018 12-kube con-ballerinacon2018 12-kube con-ballerinacon
2018 12-kube con-ballerinacon
Sanjiva Weerawarana
 
Whole Platform LWC11 Submission
Whole Platform LWC11 SubmissionWhole Platform LWC11 Submission
Whole Platform LWC11 Submission
Riccardo Solmi
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
Eonblast
 
Hunting for anglerfish in datalakes
Hunting for anglerfish in datalakesHunting for anglerfish in datalakes
Hunting for anglerfish in datalakes
Dominic Egger
 

Similar to Tesseract OCR Engine (20)

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
Enterprise messaging
Enterprise messagingEnterprise messaging
Enterprise messaging
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
ReFRESCO-General-Jan2015
ReFRESCO-General-Jan2015ReFRESCO-General-Jan2015
ReFRESCO-General-Jan2015
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
 
CBDW2014 - Down the RabbitMQ hole with ColdFusion
CBDW2014 - Down the RabbitMQ hole with ColdFusionCBDW2014 - Down the RabbitMQ hole with ColdFusion
CBDW2014 - Down the RabbitMQ hole with ColdFusion
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
 
Verification with LoLA: 1 Basics
Verification with LoLA: 1 BasicsVerification with LoLA: 1 Basics
Verification with LoLA: 1 Basics
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16
 
OCR by Abdullah Ahmed Abu Rtima
OCR by Abdullah Ahmed Abu Rtima OCR by Abdullah Ahmed Abu Rtima
OCR by Abdullah Ahmed Abu Rtima
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
 
2018 12-kube con-ballerinacon
2018 12-kube con-ballerinacon2018 12-kube con-ballerinacon
2018 12-kube con-ballerinacon
 
Whole Platform LWC11 Submission
Whole Platform LWC11 SubmissionWhole Platform LWC11 Submission
Whole Platform LWC11 Submission
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
Hunting for anglerfish in datalakes
Hunting for anglerfish in datalakesHunting for anglerfish in datalakes
Hunting for anglerfish in datalakes
 
Raptor codes
Raptor codesRaptor codes
Raptor codes
 

More from Raghu nath

Mongo db
Mongo dbMongo db
Mongo db
Raghu nath
 
Ftp (file transfer protocol)
Ftp (file transfer protocol)Ftp (file transfer protocol)
Ftp (file transfer protocol)
Raghu nath
 
Javascript part1
Javascript part1Javascript part1
Javascript part1Raghu nath
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaghu nath
 
Selection sort
Selection sortSelection sort
Selection sortRaghu nath
 
Binary search
Binary search Binary search
Binary search Raghu nath
 
JSON(JavaScript Object Notation)
JSON(JavaScript Object Notation)JSON(JavaScript Object Notation)
JSON(JavaScript Object Notation)Raghu nath
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithmsRaghu nath
 
Step by step guide to install dhcp role
Step by step guide to install dhcp roleStep by step guide to install dhcp role
Step by step guide to install dhcp roleRaghu nath
 
Network essentials chapter 4
Network essentials  chapter 4Network essentials  chapter 4
Network essentials chapter 4Raghu nath
 
Network essentials chapter 3
Network essentials  chapter 3Network essentials  chapter 3
Network essentials chapter 3Raghu nath
 
Network essentials chapter 2
Network essentials  chapter 2Network essentials  chapter 2
Network essentials chapter 2Raghu nath
 
Network essentials - chapter 1
Network essentials - chapter 1Network essentials - chapter 1
Network essentials - chapter 1Raghu nath
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2Raghu nath
 
python chapter 1
python chapter 1python chapter 1
python chapter 1Raghu nath
 
Linux Shell Scripting
Linux Shell ScriptingLinux Shell Scripting
Linux Shell ScriptingRaghu nath
 

More from Raghu nath (20)

Mongo db
Mongo dbMongo db
Mongo db
 
Ftp (file transfer protocol)
Ftp (file transfer protocol)Ftp (file transfer protocol)
Ftp (file transfer protocol)
 
MS WORD 2013
MS WORD 2013MS WORD 2013
MS WORD 2013
 
Msword
MswordMsword
Msword
 
Ms word
Ms wordMs word
Ms word
 
Javascript part1
Javascript part1Javascript part1
Javascript part1
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Selection sort
Selection sortSelection sort
Selection sort
 
Binary search
Binary search Binary search
Binary search
 
JSON(JavaScript Object Notation)
JSON(JavaScript Object Notation)JSON(JavaScript Object Notation)
JSON(JavaScript Object Notation)
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
Step by step guide to install dhcp role
Step by step guide to install dhcp roleStep by step guide to install dhcp role
Step by step guide to install dhcp role
 
Network essentials chapter 4
Network essentials  chapter 4Network essentials  chapter 4
Network essentials chapter 4
 
Network essentials chapter 3
Network essentials  chapter 3Network essentials  chapter 3
Network essentials chapter 3
 
Network essentials chapter 2
Network essentials  chapter 2Network essentials  chapter 2
Network essentials chapter 2
 
Network essentials - chapter 1
Network essentials - chapter 1Network essentials - chapter 1
Network essentials - chapter 1
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2
 
python chapter 1
python chapter 1python chapter 1
python chapter 1
 
Linux Shell Scripting
Linux Shell ScriptingLinux Shell Scripting
Linux Shell Scripting
 
Perl
PerlPerl
Perl
 

Recently uploaded

PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
PedroFerreira53928
 

Recently uploaded (20)

PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 

Tesseract OCR Engine

  • 2. Contents • Introduction & history of OCR • Tesseract architecture & methods • Announcing Tesseract 2.00 • Training Tesseract • Future enhancements
  • 3. A Brief History of OCR • What is Optical Character Recognition? • A Brief History of OCR • OCR predates electronic computers!
  • 4.
  • 5. A Brief History of OCR • 1929 – Digit recognition machine • 1953 – Alphanumeric recognition machine • 1965 – US Mail sorting • 1965 – British banking system • 1976 – Kurzweil reading machine • 1985 – Hardware-assisted PC software • 1988 – Software-only PC software • 1994-2000 – Industry consolidation
  • 6. Tesseract Background • Developed on HP-UX at HP between 1985 • and 1994 to run in a desktop scanner. • Came neck and neck with Caere and XIS • in the 1995 UNLV test. • (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) • Never used in an HP product. • Open sourced in 2005. Now on: • http://code.google.com/p/tesseract-ocr • Highly portable.
  • 7. Tesseract OCR Architecture • Baselines are rarely perfectly straight • Text Line Finding – skew independent – • published at ICDAR’95 Montreal. • (http://scholar.google.com/scholar?q=skew+detection+smith) • Baselines are approximated by quadratic splines • to account for skew and curl. • Meanline, ascender and descender lines are a • constant displacement from baseline. • Critical value is the x-height.
  • 8. Spaces between words are tricky too • Italics, digits, punctuation all create special-case font-dependent spacing. • • Fully justified text in narrow columns can have vastly varying spacing on different lines. • Tesseract: Recognize Word • Outline Approximation • Polygonal approximation is a double-edged sword. • Noise and some pertinent information are both lost.
  • 9. Tesseract: Features and Matching • Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) • Adaptive classifier uses the same technique! • (Apart from normalization method)
  • 10. Announcing tesseract-2.00 • Fully Unicode (UTF-8) capable • Already trained for 6 Latin-based • languages (Eng, Fra, Ita, Deu, Spa, Nld) • Code and documented process to train at • http://code.google.com/p/tesseract-ocr • UNLV regression test framework • Other minor fixes
  • 11. Commercial OCR v Tesseract • Page layout analysis. • More languages. • Improve accuracy. • Add a UI.