SlideShare a Scribd company logo
3 Training Tesseract
An Introduction to the Training
Process
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
The Big Picture
Web Crawl
Repository
Language ID
Map-Reduce
Eng
Dirty Language
Corpora
Cleaned Language
Corpora
Text Filtration
Eng
Language Model
Generation
Realistic Text
Rendering
OCR Engine
Training
Eng
Eng
OCR Shape Files
Language Model Files
Eng
Manually generated Files
Tesseract Tutorial: DAS 2014 Tours France
Language Model
Generation
Realistic Text
Rendering
OCR Engine
Training
Eng
Eng
OCR Shape Files
Language Model Files
Bigram list
Word list
Realistic Training text
Punctuation patterns
combine_tessdata
wordlist2dawg
mftraining
cntraining
shapetraining
unicharset_extractor
set_unicharset_properties
Eng
Manually generated Files
text2image
<lang>.traineddata
The Open Source Parts
Tesseract Tutorial: DAS 2014 Tours France
Training Fundamentals
● Character samples must be segregated by font
=> Trained on synthetic (rendered, distorted) data
● Few samples required (4-10 of each combination is good. 1 is OK)
● Not many fonts required. (32 used for Latin)
● Not many fonts allowed. (MAX_NUM_CONFIGS=64: Long story.)
● Number of different “characters” now limited only by memory.
Tesseract Tutorial: DAS 2014 Tours France
What Data Needs to be Created by Training? (1)
Name Type Status Creator Description
config Text Optional Manual Lang-specific engine settings if needed
unicharset Text Mandatory unicharset_extractor The set of recognizable units
unicharambigs Text Optional Manual* Intrinsic ambiguities for the language
inttemp Binary Mandatory mftraining Classifier shape data
pffmtable Text Mandatory mftraining Extra classifier data (num expected features)
normproto Text Mandatory cntraining Classifier baseline position info
cube-unicharset Text Optional unicharset_extractor Cube’s set of recognizable units
shapetable Binary Optional shapetraining Indirection between classifier and unicharset
params-model Text Optional Google tool Alternative method for combining LM & classifier
Tesseract Tutorial: DAS 2014 Tours France
Inside tesstrain.sh
Input Program Output
Realistic Training Text text2image *.tif, *.box
*.box unicharset_extractor unicharset
unicharset, <script>.unicharset set_unicharset_properties unicharset
Word List wordlist2dawg word-dawg
Frequent Word List wordlist2dawg freq-dawg
*.tif, *.box tesseract *.tr
*.tr cntraining normproto
*.tr mftraining inttemp, pffmtable
unicharset, dawgs, normproto, inttemp, pffmtable, config combine_tessdata traineddata
Tesseract Tutorial: DAS 2014 Tours France
Training Text:
Real [frequent]
Words covering
whole character
set
Rendering +
Image
Degradation
Training
Images +
Box Files
Fonts
Tesseract
box.train
Unicharset
Extractor
UniversalUnicharsetProperties Script
Unicharsets
Set Unicharset
Properties
Initial
UnicharsetExtracted Features (.tr files)
All from correctly segmented characters
Text Corpus
Training Data
Text Filtration
Filtered
Corpus
Text To
Wordlist
cn Training mf Training
inttemppffmtablenormproto
Words Sorted
by Frequency
Wordlist 2
DAWG
Combine Tessdata Traineddata
Overview of Tesseract Training Process
NOT open source
Ambigs
Training
Unichar
Ambigs
punc
dawg
word
dawg
number
dawg
freq
dawg
bigram
dawg
Tesseract Tutorial: DAS 2014 Tours France
Language-Specific Data
● Training text: defines the character set.
● Wordlists: define the language model. (Including bigrams when
present.)
● Pango layout: defines the grapheme clusters (recognition units).
Eg: 0xca6 + 0xccd + 0xca6 + 0xcc7 ->
● ICU: determines what is right-to-left.
● Script.unicharset: Stores typical font metrics for unicode chars.
● Config files: in training/langdata/<lang>/<lang>.config.
● Vertical rendering: Determined manually.
Tesseract Tutorial: DAS 2014 Tours France
MFtraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: inttemp (knn classifier data)
pffmtable (helper information for class pruner)
Operation:
1. Independently cluster features in each font/char class combination.
2. Combine similar cluster means across fonts (single char class).
3. Define each font/char class as a combination of cluster means (a
font config).
4. Build class pruner and main knn classifier.
Tesseract Tutorial: DAS 2014 Tours France
Clustering Result
Protos of Arial ‘A’ Protos of Times Italic ‘A’
Tesseract Tutorial: DAS 2014 Tours France
CNtraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: normproto (GMM means of the CN feature)
Operation:
1. Independently cluster the CN feature of all fonts for each char
class.
2. Write cluster means (a Gaussian Mixture Model) to normproto file.
Tesseract Tutorial: DAS 2014 Tours France
Shapetraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: shapetable (Mapping from an index to a collection of unichar-
ids, fonts)
Operation:
1. Cluster across all fonts and all char classes.
2. Merge ambiguous classes into shapes.
3. Works OK for Indic, but not so good for others.
Tesseract Tutorial: DAS 2014 Tours France
Training Text:
Real [frequent]
Words covering
whole character
set
Rendering +
Image
Degradation
Training
Images +
Box Files
Fonts
Tesseract
box.train
Shape
Clustering
Unicharset
Extractor
UniversalUnicharsetProperties Script
Unicharsets
Set Unicharset
Properties
Initial
UnicharsetChopped
Fragments
Natural
Fragments
Correctly
Segmented
Naturally
touching
Extracted Features (.tr files)
Text Corpus
Training Data
Text filtration
Filtered
Corpus
Text To
Wordlist
Correctly
Segmented
Validation
Set
Junk
Master Trainer FileShape
Unicharset
Shape Table
cn Training mf Training
inttemppffmtablenormproto
Words Sorted
by Frequency
Wordlist 2
DAWG
punc
dawg
word
dawg
number
dawg
freq
dawg
Combine Tessdata Traineddata
Overview of Tesseract Training Process with Shapes
NOT open source
bigram
dawg
Tesseract Tutorial: DAS 2014 Tours France
DAWGs (Directed Acyclic Word Graph)
From “The world’s fastest Scrabble program” A. W. Appel, G.J. Jacobson,
CACM 31(5) May 1988, pp572-585:
“The lexicon represented as a raw word list takes about 780 Kbytes, while our dawg can be represented in 175 Kbytes. The
relatively small size of this data structure allows us to keep it entirely in core, even on a fairly modest computer.”
Trie: Dawg:
Tesseract Tutorial: DAS 2014 Tours France
What Data Needs to be Created by Training? (2)
Name Type Status Creator Description
punc-dawg Binary Optional wordlist2dawg Patterns of punctuation around words
word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model
number-dawg Binary Optional wordlist2dawg Acceptable number patterns (with units?)
freq-dawg Binary Optional wordlist2dawg Shorter dictionary of frequent words
fixed-length-dawgs Binary Deprecated wordlist2dawg Was used for CJK
cube-word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model for cube
bigram-dawg Binary Optional wordlist2dawg Word bigram language model
unambig-dawg Binary Optional wordlist2dawg List of unambiguous words (not used?)
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

More Related Content

Similar to 3 training

8 modernization efforts
8 modernization efforts8 modernization efforts
8 modernization effortsSolin TEM
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneDeep Learning Italia
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Naoki Nakatani
 
1 intro history
1 intro history1 intro history
1 intro historySolin TEM
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with RMartin Jung
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowS N
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usagehyunyoung Lee
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language DefinitionEelco Visser
 
Text Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksText Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksLucas M
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentationmskayed
 

Similar to 3 training (20)

8 modernization efforts
8 modernization efforts8 modernization efforts
8 modernization efforts
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdfFeature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
 
C3 w2
C3 w2C3 w2
C3 w2
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
 
1 intro history
1 intro history1 intro history
1 intro history
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with R
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
NLP
NLPNLP
NLP
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usage
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language Definition
 
Text Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksText Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social Networks
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
 

More from Solin TEM

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2Solin TEM
 
4 downloading
4 downloading4 downloading
4 downloadingSolin TEM
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructuresSolin TEM
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contentsSolin TEM
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_daySolin TEM
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfdSolin TEM
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itcSolin TEM
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itcSolin TEM
 

More from Solin TEM (9)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
4 downloading
4 downloading4 downloading
4 downloading
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单ewymefz
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxbenishzehra469
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxzahraomer517
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

3 training

  • 1. 3 Training Tesseract An Introduction to the Training Process Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France The Big Picture Web Crawl Repository Language ID Map-Reduce Eng Dirty Language Corpora Cleaned Language Corpora Text Filtration Eng Language Model Generation Realistic Text Rendering OCR Engine Training Eng Eng OCR Shape Files Language Model Files Eng Manually generated Files
  • 3. Tesseract Tutorial: DAS 2014 Tours France Language Model Generation Realistic Text Rendering OCR Engine Training Eng Eng OCR Shape Files Language Model Files Bigram list Word list Realistic Training text Punctuation patterns combine_tessdata wordlist2dawg mftraining cntraining shapetraining unicharset_extractor set_unicharset_properties Eng Manually generated Files text2image <lang>.traineddata The Open Source Parts
  • 4. Tesseract Tutorial: DAS 2014 Tours France Training Fundamentals ● Character samples must be segregated by font => Trained on synthetic (rendered, distorted) data ● Few samples required (4-10 of each combination is good. 1 is OK) ● Not many fonts required. (32 used for Latin) ● Not many fonts allowed. (MAX_NUM_CONFIGS=64: Long story.) ● Number of different “characters” now limited only by memory.
  • 5. Tesseract Tutorial: DAS 2014 Tours France What Data Needs to be Created by Training? (1) Name Type Status Creator Description config Text Optional Manual Lang-specific engine settings if needed unicharset Text Mandatory unicharset_extractor The set of recognizable units unicharambigs Text Optional Manual* Intrinsic ambiguities for the language inttemp Binary Mandatory mftraining Classifier shape data pffmtable Text Mandatory mftraining Extra classifier data (num expected features) normproto Text Mandatory cntraining Classifier baseline position info cube-unicharset Text Optional unicharset_extractor Cube’s set of recognizable units shapetable Binary Optional shapetraining Indirection between classifier and unicharset params-model Text Optional Google tool Alternative method for combining LM & classifier
  • 6. Tesseract Tutorial: DAS 2014 Tours France Inside tesstrain.sh Input Program Output Realistic Training Text text2image *.tif, *.box *.box unicharset_extractor unicharset unicharset, <script>.unicharset set_unicharset_properties unicharset Word List wordlist2dawg word-dawg Frequent Word List wordlist2dawg freq-dawg *.tif, *.box tesseract *.tr *.tr cntraining normproto *.tr mftraining inttemp, pffmtable unicharset, dawgs, normproto, inttemp, pffmtable, config combine_tessdata traineddata
  • 7. Tesseract Tutorial: DAS 2014 Tours France Training Text: Real [frequent] Words covering whole character set Rendering + Image Degradation Training Images + Box Files Fonts Tesseract box.train Unicharset Extractor UniversalUnicharsetProperties Script Unicharsets Set Unicharset Properties Initial UnicharsetExtracted Features (.tr files) All from correctly segmented characters Text Corpus Training Data Text Filtration Filtered Corpus Text To Wordlist cn Training mf Training inttemppffmtablenormproto Words Sorted by Frequency Wordlist 2 DAWG Combine Tessdata Traineddata Overview of Tesseract Training Process NOT open source Ambigs Training Unichar Ambigs punc dawg word dawg number dawg freq dawg bigram dawg
  • 8. Tesseract Tutorial: DAS 2014 Tours France Language-Specific Data ● Training text: defines the character set. ● Wordlists: define the language model. (Including bigrams when present.) ● Pango layout: defines the grapheme clusters (recognition units). Eg: 0xca6 + 0xccd + 0xca6 + 0xcc7 -> ● ICU: determines what is right-to-left. ● Script.unicharset: Stores typical font metrics for unicode chars. ● Config files: in training/langdata/<lang>/<lang>.config. ● Vertical rendering: Determined manually.
  • 9. Tesseract Tutorial: DAS 2014 Tours France MFtraining Input: font.tr files (Stored Features with utf-8 labels) Output: inttemp (knn classifier data) pffmtable (helper information for class pruner) Operation: 1. Independently cluster features in each font/char class combination. 2. Combine similar cluster means across fonts (single char class). 3. Define each font/char class as a combination of cluster means (a font config). 4. Build class pruner and main knn classifier.
  • 10. Tesseract Tutorial: DAS 2014 Tours France Clustering Result Protos of Arial ‘A’ Protos of Times Italic ‘A’
  • 11. Tesseract Tutorial: DAS 2014 Tours France CNtraining Input: font.tr files (Stored Features with utf-8 labels) Output: normproto (GMM means of the CN feature) Operation: 1. Independently cluster the CN feature of all fonts for each char class. 2. Write cluster means (a Gaussian Mixture Model) to normproto file.
  • 12. Tesseract Tutorial: DAS 2014 Tours France Shapetraining Input: font.tr files (Stored Features with utf-8 labels) Output: shapetable (Mapping from an index to a collection of unichar- ids, fonts) Operation: 1. Cluster across all fonts and all char classes. 2. Merge ambiguous classes into shapes. 3. Works OK for Indic, but not so good for others.
  • 13. Tesseract Tutorial: DAS 2014 Tours France Training Text: Real [frequent] Words covering whole character set Rendering + Image Degradation Training Images + Box Files Fonts Tesseract box.train Shape Clustering Unicharset Extractor UniversalUnicharsetProperties Script Unicharsets Set Unicharset Properties Initial UnicharsetChopped Fragments Natural Fragments Correctly Segmented Naturally touching Extracted Features (.tr files) Text Corpus Training Data Text filtration Filtered Corpus Text To Wordlist Correctly Segmented Validation Set Junk Master Trainer FileShape Unicharset Shape Table cn Training mf Training inttemppffmtablenormproto Words Sorted by Frequency Wordlist 2 DAWG punc dawg word dawg number dawg freq dawg Combine Tessdata Traineddata Overview of Tesseract Training Process with Shapes NOT open source bigram dawg
  • 14. Tesseract Tutorial: DAS 2014 Tours France DAWGs (Directed Acyclic Word Graph) From “The world’s fastest Scrabble program” A. W. Appel, G.J. Jacobson, CACM 31(5) May 1988, pp572-585: “The lexicon represented as a raw word list takes about 780 Kbytes, while our dawg can be represented in 175 Kbytes. The relatively small size of this data structure allows us to keep it entirely in core, even on a fairly modest computer.” Trie: Dawg:
  • 15. Tesseract Tutorial: DAS 2014 Tours France What Data Needs to be Created by Training? (2) Name Type Status Creator Description punc-dawg Binary Optional wordlist2dawg Patterns of punctuation around words word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model number-dawg Binary Optional wordlist2dawg Acceptable number patterns (with units?) freq-dawg Binary Optional wordlist2dawg Shorter dictionary of frequent words fixed-length-dawgs Binary Deprecated wordlist2dawg Was used for CJK cube-word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model for cube bigram-dawg Binary Optional wordlist2dawg Word bigram language model unambig-dawg Binary Optional wordlist2dawg List of unambiguous words (not used?)
  • 16. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?