SlideShare a Scribd company logo
Tesseract Tutorial: DAS 2014 Tours France
6. Character
Segmentation, Language
Models and Beam Search
The heart of Tesseract
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
Approaches to Segmentation
● Segment first using only geometry.
● Maximally chop, then combine with a beam
search. (Over-segmentation.)
● Sliding window to "avoid" segmentation
altogether.
● Tesseract: Chop only as needed, then
combine as needed.
Tesseract Tutorial: DAS 2014 Tours France
Over-Segmentation
● Aim is to maximize recall of chops with the compromise of
reduction in precision.
Tesseract Tutorial: DAS 2014 Tours France
Segmentation Graph
● Segmentation possibilities and classifier results form a directed
graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u %
Tesseract Tutorial: DAS 2014 Tours France
Searching the Segmentation Graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
higher
%
Tesseract Tutorial: DAS 2014 Tours France
Searching the Segmentation Graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
higher
}uglier
%
%
Tesseract Tutorial: DAS 2014 Tours France
Integration of Language Models (General Methods)
● Implement Language Model as Finite State Machine.
● Search Language Model and Segmentation Graph in parallel.
● Combine “probabilities” in some sensible way.
● Hidden Markov Model methods are good example.
Tesseract Tutorial: DAS 2014 Tours France
Segmentation Free = Extreme Over-Segmentation
● Slide over the word/textline with a classifier/HMM.
● Beam search + shape model probs + language model probs solves
the segmentation internally.
● Really just mega-over-segmentation.
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Segmentation Approach based on observations:
● Initial segmentation is often correct or close.
● Classifier generally doesn’t like incorrectly segmented
text.
● Over-segmentation often leads to poor results., eg m->iii
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Segmentation Approach
Classify Initial Segmentation
Search Word: OK? Yes => Done
while any Bad Blob has any Chops available
Chop and classify pieces of Worst Choppable Blob
Search Word: OK? Yes => Done
while any fixable “Pain point”
Associate adjacent blobs and classify
Search Word: OK? Yes => Done
Tesseract Tutorial: DAS 2014 Tours France
Types of Pain Point
● Initial: Join each adjacent pair
● Ambiguity: Eg m/rn
● Path: Neighbors of blobs in the current best path
Tesseract Tutorial: DAS 2014 Tours France
Ratings MATRIX = Segmentation Graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
%
l}Ii
hb il1
un il1
m nu il1
g9q l}I1
% l}I1
hb il1
m ec
m rT
Each entry holds a BLOB_CHOICE_LIST providing classifier choices
with rating and certainty.
Tesseract Tutorial: DAS 2014 Tours France
Live Demo of Segmentation Search
Chopper demo
api/tesseract unlv/mag.3B/2/8022_028.3B.tif test1 inter segdemo chopdebug3
Col 2, line 6, word 1
Associator demo
api/tesseract unlv/doe3.3B/4/2214_007.3B.tif test1 inter segdemo
Col 2, line 8, word 2
Tesseract Tutorial: DAS 2014 Tours France
Evaluation of a WORD_CHOICE (no params-model)
Word Rating = word_factor ∑blob_choice->rating()
word_factor = segmentation
Condition base word_factor Add-ons
Frequent dawg word 1.0 Inconsistent case +0.1
Other dawg word 1.1 Inconsistent case +0.1
Non-dawg word 1.25 +0.01 for each
char over 3.
Inconsistent case +0.1
Inconsistent punc +0.2
Inconsistent chartype +0.3
Inconsistent script +0.5
Inconsisted char spacing +0.01
All except script +0.01 for each
additional occurrence.
Tesseract Tutorial: DAS 2014 Tours France
Evaluation of a WORD_CHOICE (with params-model)
Word Rating = word_factor ∑outline length
word_factor = weighted sum of word features:
● mean blob rating
● num inconsistent spaces
● num inconsistent char type
● num x-height inconsistencies
● num case inconsistencies
● word length (in type categories)
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Word Recognizer
Tesseract Tutorial: DAS 2014 Tours France
Example of Chopping (unlv/mag.3B/2/8022_028.3B.tif
Col 2, line 6, word 1)
Word Distance Worst blob
Momm 212.2 7.7
Mommn 186.3 8.3
Momtfln178.0 9.2
Momtain 124.9 5.3
Mounm 184.0 7.7
Mountain 80.6 3.1 ACCEPT!
Tesseract Tutorial: DAS 2014 Tours France
Example of Combining (unlv/doe3.3B/4/2214_007.3B.tif,
col 2, line 8, word 2)
Word Distance
lilllit- 77.58
limit57.1
Emit 89.7
Unfit 95.4
Hulk 122.7
Bulk 136.8
Word Distance
HUM 120.6
MUM 127.0
BUM 134.7
1mm 112.8
Milk147.2
huh 137.1
fink 140.7
Emu 129.7
BMW 140.3
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

More Related Content

Similar to 6 char segmentation

XKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & ConstructsXKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & ConstructsNicolas Demengel
 
Lecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.pptLecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.pptMaiGaafar
 
Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013Cyrille Martraire
 
Basics of Programming - A Review Guide
Basics of Programming - A Review GuideBasics of Programming - A Review Guide
Basics of Programming - A Review GuideBenjamin Kissinger
 
Python breakdown-workbook
Python breakdown-workbookPython breakdown-workbook
Python breakdown-workbookHARUN PEHLIVAN
 
Reflective Plan Examples
Reflective Plan ExamplesReflective Plan Examples
Reflective Plan ExamplesMonica Turner
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdfStephenLeo7
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst PracticesBurt Beckwith
 
Ruby for .NET developers
Ruby for .NET developersRuby for .NET developers
Ruby for .NET developersMax Titov
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar
 
Conducting ux research
Conducting ux researchConducting ux research
Conducting ux researchVina Sectiana
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detectionbutest
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 

Similar to 6 char segmentation (20)

XKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & ConstructsXKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & Constructs
 
Max diff
Max diffMax diff
Max diff
 
Lecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.pptLecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.ppt
 
Design smells
Design smellsDesign smells
Design smells
 
Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013
 
Basics of Programming - A Review Guide
Basics of Programming - A Review GuideBasics of Programming - A Review Guide
Basics of Programming - A Review Guide
 
Decision tree
Decision tree Decision tree
Decision tree
 
Ninja Cursors
Ninja CursorsNinja Cursors
Ninja Cursors
 
Python breakdown-workbook
Python breakdown-workbookPython breakdown-workbook
Python breakdown-workbook
 
Reflective Plan Examples
Reflective Plan ExamplesReflective Plan Examples
Reflective Plan Examples
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdf
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
12.6-12.9.pptx
12.6-12.9.pptx12.6-12.9.pptx
12.6-12.9.pptx
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst Practices
 
Ruby for .NET developers
Ruby for .NET developersRuby for .NET developers
Ruby for .NET developers
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
 
Conducting ux research
Conducting ux researchConducting ux research
Conducting ux research
 
Code Metrics
Code MetricsCode Metrics
Code Metrics
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detection
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 

More from Solin TEM

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2Solin TEM
 
4 downloading
4 downloading4 downloading
4 downloadingSolin TEM
 
1 intro history
1 intro history1 intro history
1 intro historySolin TEM
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contentsSolin TEM
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_daySolin TEM
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfdSolin TEM
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itcSolin TEM
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itcSolin TEM
 

More from Solin TEM (9)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
4 downloading
4 downloading4 downloading
4 downloading
 
1 intro history
1 intro history1 intro history
1 intro history
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单nscud
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationBoston Institute of Analytics
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxzahraomer517
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 

6 char segmentation

  • 1. Tesseract Tutorial: DAS 2014 Tours France 6. Character Segmentation, Language Models and Beam Search The heart of Tesseract Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France Approaches to Segmentation ● Segment first using only geometry. ● Maximally chop, then combine with a beam search. (Over-segmentation.) ● Sliding window to "avoid" segmentation altogether. ● Tesseract: Chop only as needed, then combine as needed.
  • 3. Tesseract Tutorial: DAS 2014 Tours France Over-Segmentation ● Aim is to maximize recall of chops with the compromise of reduction in precision.
  • 4. Tesseract Tutorial: DAS 2014 Tours France Segmentation Graph ● Segmentation possibilities and classifier results form a directed graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u %
  • 5. Tesseract Tutorial: DAS 2014 Tours France Searching the Segmentation Graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u higher %
  • 6. Tesseract Tutorial: DAS 2014 Tours France Searching the Segmentation Graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u higher }uglier % %
  • 7. Tesseract Tutorial: DAS 2014 Tours France Integration of Language Models (General Methods) ● Implement Language Model as Finite State Machine. ● Search Language Model and Segmentation Graph in parallel. ● Combine “probabilities” in some sensible way. ● Hidden Markov Model methods are good example.
  • 8. Tesseract Tutorial: DAS 2014 Tours France Segmentation Free = Extreme Over-Segmentation ● Slide over the word/textline with a classifier/HMM. ● Beam search + shape model probs + language model probs solves the segmentation internally. ● Really just mega-over-segmentation.
  • 9. Tesseract Tutorial: DAS 2014 Tours France Tesseract Segmentation Approach based on observations: ● Initial segmentation is often correct or close. ● Classifier generally doesn’t like incorrectly segmented text. ● Over-segmentation often leads to poor results., eg m->iii
  • 10. Tesseract Tutorial: DAS 2014 Tours France Tesseract Segmentation Approach Classify Initial Segmentation Search Word: OK? Yes => Done while any Bad Blob has any Chops available Chop and classify pieces of Worst Choppable Blob Search Word: OK? Yes => Done while any fixable “Pain point” Associate adjacent blobs and classify Search Word: OK? Yes => Done
  • 11. Tesseract Tutorial: DAS 2014 Tours France Types of Pain Point ● Initial: Join each adjacent pair ● Ambiguity: Eg m/rn ● Path: Neighbors of blobs in the current best path
  • 12. Tesseract Tutorial: DAS 2014 Tours France Ratings MATRIX = Segmentation Graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u % l}Ii hb il1 un il1 m nu il1 g9q l}I1 % l}I1 hb il1 m ec m rT Each entry holds a BLOB_CHOICE_LIST providing classifier choices with rating and certainty.
  • 13. Tesseract Tutorial: DAS 2014 Tours France Live Demo of Segmentation Search Chopper demo api/tesseract unlv/mag.3B/2/8022_028.3B.tif test1 inter segdemo chopdebug3 Col 2, line 6, word 1 Associator demo api/tesseract unlv/doe3.3B/4/2214_007.3B.tif test1 inter segdemo Col 2, line 8, word 2
  • 14. Tesseract Tutorial: DAS 2014 Tours France Evaluation of a WORD_CHOICE (no params-model) Word Rating = word_factor ∑blob_choice->rating() word_factor = segmentation Condition base word_factor Add-ons Frequent dawg word 1.0 Inconsistent case +0.1 Other dawg word 1.1 Inconsistent case +0.1 Non-dawg word 1.25 +0.01 for each char over 3. Inconsistent case +0.1 Inconsistent punc +0.2 Inconsistent chartype +0.3 Inconsistent script +0.5 Inconsisted char spacing +0.01 All except script +0.01 for each additional occurrence.
  • 15. Tesseract Tutorial: DAS 2014 Tours France Evaluation of a WORD_CHOICE (with params-model) Word Rating = word_factor ∑outline length word_factor = weighted sum of word features: ● mean blob rating ● num inconsistent spaces ● num inconsistent char type ● num x-height inconsistencies ● num case inconsistencies ● word length (in type categories)
  • 16. Tesseract Tutorial: DAS 2014 Tours France Tesseract Word Recognizer
  • 17. Tesseract Tutorial: DAS 2014 Tours France Example of Chopping (unlv/mag.3B/2/8022_028.3B.tif Col 2, line 6, word 1) Word Distance Worst blob Momm 212.2 7.7 Mommn 186.3 8.3 Momtfln178.0 9.2 Momtain 124.9 5.3 Mounm 184.0 7.7 Mountain 80.6 3.1 ACCEPT!
  • 18. Tesseract Tutorial: DAS 2014 Tours France Example of Combining (unlv/doe3.3B/4/2214_007.3B.tif, col 2, line 8, word 2) Word Distance lilllit- 77.58 limit57.1 Emit 89.7 Unfit 95.4 Hulk 122.7 Bulk 136.8 Word Distance HUM 120.6 MUM 127.0 BUM 134.7 1mm 112.8 Milk147.2 huh 137.1 fink 140.7 Emu 129.7 BMW 140.3
  • 19. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?