SlideShare a Scribd company logo
1 of 47
Anshul Gupta | CSE@TAMU 2
What are historical documents?
– Correspondence
– Diaries
– Newspapers
– Government Documents
– Books
Anshul Gupta | CSE@TAMU 3
Digitizing historical documents
• Why?
– Historical records are in analog
form
– Due to their fragility, most of them
are not accessible
– Not searchable
• How to make them accessible?
– Digital text transcription
• Ways of digitization
– Hand transcribe each book
• Resource intensive
– OCR: optical character recognition
• high-error in text transcription
• Mass digitization projects
Anshul Gupta | CSE@TAMU 4
Early modern OCR project (eMOP)
• Goal
– Improve OCR accuracy for
early modern texts
• 300k documents, 45M
pages
– Open source OCR tools
• Challenges
– Early modern printing
• Irregular fonts
• Decorative page elements
– Document image
problems
– Problems get severe
• Images are binarized
Pictures
Decorative page
elements
Anshul Gupta | CSE@TAMU 5
Goals
Font
Metadata
Automatic Quality
Assessment*hOCR
Quality Score
Active Font
Identification
Black
font
Roman
Mixed
Good Documents
Bad Documents
Goal1 Goal2
Denoised
hOCR+
Document
Images
*hOCR output from Tesseract OCR.
Anshul Gupta | CSE@TAMU 6
Why we want to assess OCR quality?
• Improve runtime
– Focus on documents with good OCR quality
– Send bad quality documents to a separate diagnostics pipeline
• How to measure OCR quality?
– Number of methods exists
• EMOP use Juxta score
• Measures similarity between OCR output and ground truth text
– But, such scores need ground truth
• Not available for all documents
• Automated way to assess OCR quality
Anshul Gupta | CSE@TAMU 7
Our approach
• Post-process OCR output
– Page segmentation result such as bounding box (BB) coordinates
– OCR word confidence
• Build ML models to remove noise
– Binary classification: classify each BB either as text or noise
• 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐎𝐂𝐑 ∝
𝟏
% 𝐧𝐨𝐢𝐬𝐞 𝐁𝐁𝐬
Anshul Gupta | CSE@TAMU 8
Language-agnostic approach
Anshul Gupta | CSE@TAMU 9
Quality assessment algorithm
Pre-filteringOCR Output Column
segmentation
BB labels
Local iterative
relabeling
Quality score
Anshul Gupta | CSE@TAMU 10
• Prefiltering
– Provides initial labels to
be refined in later stages
– Rule based classifier
• BB properties
and OCR word confidence
• Conjunction of rules
– Problems
• Many text BBs classified as
noise
• Need a way to recover mis-
classified text BBs
Area of BB > 1st percentile ?
OCR word confidence
in (0,0.95)?
Yes
Non-text
No
Height/Width < 2?
YesNo
Non-text
NoYes
Non-textText
Height
Width
Anshul Gupta | CSE@TAMU 11
• Column extraction
– Extract individual column and then process each column
Leftmost
text BB
Rightmost
text BB
Trough
Anshul Gupta | CSE@TAMU 12
• Local iterative relabeling
– Refines initial labels
– Based on BB properties
and its neighborhood
– Applies an MLP classifier
iteratively to refine BB
labels (text/noise)
Features used during local iterative relabeling
Features Description
𝑆 Score from nearest neighbors ; see eq. (1)
𝐶 𝑂𝐶𝑅 OCR word confidence*
𝐻/𝑊 Height-to-width ratio of BB*
𝐴 Area of BB*
𝐻 𝑛𝑜𝑟𝑚 Normalized height: 𝐻 𝑛𝑜𝑟𝑚 = 𝐻 − 𝐻 𝑚𝑒𝑑 𝐻𝐼𝑄𝑅
𝐻 𝑑𝑖𝑠𝑡 Horizontal distance from the middle of the page
𝑉𝑑𝑖𝑠𝑡 Vertical distance from the top margin
*available from the pre-filtering stage
BBs for a
column
Initial labels from pre-filter stage
Multi-layer
perceptron
model
New
labels
old
labels=
= new
labels?
No
Yes
Labels: Text or Noise
𝑘=1
𝑃
𝑤 𝑘 𝐿 𝑘
𝑘=1
𝑃
𝑤 𝑘
Geometric
Features
𝑫 𝒎𝒂𝒙
Anshul Gupta | CSE@TAMU 13
Final output
3
Confidence
1 1
0 0
Text
Noise
1
2
4
Anshul Gupta | CSE@TAMU 14
Results
• Label refinement: local iterative relabeling.
0.85
0.9
0.95
1
Precision Recall F1 score
Pre-filtering After iterative relabeling
• Dataset
– Binarized page images
– Images are selected to represent variety in the eMOP corpora.
• Multi-page; single column; ink bleed-through; multiple skew; warping;
printed margins
• Label creation
– Each BB returned by OCR is manually labelled as 0:noise and
1: text
– 72,366 BBs are labelled0%
20%
40%
60%
80%
100%
1 2 3 4 5
PROPORTION
NUMBER OF ITERATIONS
Anshul Gupta | CSE@TAMU 15
• Quality assessment result
– % noise BBs = 𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
– Juxta Score
• 𝑆𝐽𝑊 similarity b/w OCR
output and ground truth
• eMOP uses juxta-cl* to
generate 𝑆𝐽𝑊
– 6,775 test documents with
ground truth text
– Compare % noise BBs and
Juxta score 𝑠𝐽𝑊 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
𝜌 = −0.7038
*implementations from juxtacommons.org
Anshul Gupta | CSE@TAMU 16
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1)
0
0.1
0.2
𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
Changein𝑠𝐽𝑊
• Filtering quality
– Does removing predicted noise BBs help in improving Juxta
score?
*implementations from juxtacommons.org
85.4
10.6
4
0
10
20
30
40
50
60
70
80
90
>0 <0 =0
%Documents
Change in similarity
Anshul Gupta | CSE@TAMU 17
Recap
Font
Metadata
Automatic Quality
Assessment
Yes
No
*hOCR
Quality Score
Active Font
Identification
Black
font
Roman
Mixed
Good Documents
Bad Documents
Goal 1 Goal 2
Denoised
hOCR
+ Document
Images
Anshul Gupta | CSE@TAMU 18
Why we need font identification?
• Improve OCR quality
– EMOP collections have documents in multiple fonts
– OCR system works best when knowledge of font is available
– Don’t have font database for EMOP collections
• How?
– Manual tagging
• Human can label/tag each document
– Automatic tagging
• Machine learning models that can recognize fonts
– But for font identification for EMOP
• Need a labeled data (a training set)
• Getting labeled data from millions of page images
– Need an efficient way to train supervised ML models
Anshul Gupta | CSE@TAMU 19
Our approach
• Active learning
– Allow ML algorithm to
– Acquire its own training data
– Select most informative examples for labelling
– Build ML models using as few labeled data
Anshul Gupta | CSE@TAMU 20
Active learning
• A learning paradigm
– Train a classifier using labelled data
– Sample most informative instances : Active sampling
– Ask for labels from a human
L
U
ML algo
Query
Selects most
informative
instances
{X,?}
{X, Label}
Add to L
Small L
Anshul Gupta | CSE@TAMU 21
• Active learning for font identification
OCR
TIFF
Feature
extraction
Select
samples
Tag
sampleshOCR
TIFF
Train font
classifier
Training Active sampling
Font classes and characteristics
Blackletter
• Examples
• Characteristics
Roman
• Examples
• Characteristics
Thick stroke
Thin stroke
Thick stroke
Angles strokes
Horizontal
serifs
Similar vertical stroke width
Anshul Gupta | CSE@TAMU 23
Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
Anshul Gupta | CSE@TAMU 24
Preprocess word image
• Normalize the height of
word images
– Resize each word image to
have same height
• Remove salt and pepper
noise
• Correct skew
– Calculate a time frequency
distribution for different skew
angles
– Skew angle at which
distributions shows a peak
(a)
(b)
(a)
(b)
Anshul Gupta | CSE@TAMU 25
Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
Anshul Gupta | CSE@TAMU 26
Mean and IQR character width
• Roman fonts have smaller vertical stroke width than the
Blackletter
– Mean character width
• Blackletter fonts have drastic differences in the stroke
widths
– IQR character width
• How to capture these characteristics?
Mid
Mid + 20
Mid - 20
Anshul Gupta | CSE@TAMU 27
Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
Anshul Gupta | CSE@TAMU 28
Slant line density
• Blackletters fonts are characterized by angled lines and
serifs
– Capture the amount of angled straight lines in a word image
– Density of angled lines per character
• How?
– Hough transform
– Number of lines with slope between 45° ± 5° and -45° ± 5°
– Divide by number of characters (hOCR)
Edge
Detection
Hough
Transform
Word Image Edge Image
Anshul Gupta | CSE@TAMU 29
Feature extractionFeature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
Anshul Gupta | CSE@TAMU 30
Zernike Moments
• Zernike Moments (ZMs) are
shape descriptors
– To capture the visual
appearance of the text (words)
• 6 ZMs along with their
transformations
– Total of 15 features similar ones
used in tumor classification
problem by Tahmasbi et al.
(2011).
Anshul Gupta | CSE@TAMU 31
Normalized histogram
(BoF vector)
Bag-of-word features
feature
extraction
Vector
Quantization
Codebook
Document Image
Quantized Image
K-means
feature
extraction
Images
Word images
Anshul Gupta | CSE@TAMU 32
Recap
OCR
TIFF
Feature
extractionhOCR
TIFF
Train font
classifier
Anshul Gupta | CSE@TAMU 33
Classifier
• Label propagation
– Graph based semi-supervised classifier
– Uses labeled and unlabeled data to form a graph structure
– Labeled data act like source that transmit labels to
unlabeled data according to similarity wij
Unlabeled
example
Two labeled
examples
wij
Anshul Gupta | CSE@TAMU 34
Active
sampling
Recap
OCR
TIFF
Feature
extraction
Select
sampleshOCR
TIFF
Train font
classifier
Anshul Gupta | CSE@TAMU 35
Active sampling
Feature space
Classifier decision boundary
Class 1
Class 2
Uncertainty based sampling
(HS)
Dissimilarity based
sampling (DS)
Diversity (D’)
Anshul Gupta | CSE@TAMU 36
Recap
OCR
TIFF
Feature
extraction
Select
samples
Tag
sampleshOCR
TIFF
Train font
classifier
Anshul Gupta | CSE@TAMU 37
Results
• Dataset
– 3272 documents from ECCO and EEBO collections
– eMOP experts labeled documents
• 1005 Black documents
• 1768 Roman documents
• 498 Mixed documents – text printed in both fonts
Anshul Gupta | CSE@TAMU 38
Experiment 1
• Quality of extracted features : Word level
500 Roman
& 500
blackletter
Images
Feature
extraction
Classifier
Blackletter word
Roman word
ALL
Zernike
Moments
(ZMs)
Mean and
IQR CW,
SLD
Anshul Gupta | CSE@TAMU 39
Result 1
0.8433
0.805
0.6717
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ALL Only ZMs CW(mean &
IQR) and SLD
Cross-validationF1score
Feature sets
Anshul Gupta | CSE@TAMU 40
Experiment 2
• Quality of Bag-of-word features
Normalized
histogram
(BoF vector)
feature
extraction
Vector
Quantization
Codebook
Document Image
Quantized Image
K-means
feature
extraction
Images
Word images
PCA
3-dimensional feature vector
Anshul Gupta | CSE@TAMU 41
Result 2
Anshul Gupta | CSE@TAMU 42
Experiment 3
• Performance of active learning model
L
U
Train ML
model
Query labels for 20
instances
Select most
informative
instances{X,?}
{X, Label}
Add to L
3 labeled examples
T
Repeat 20 times
Store validation
accuracy
Anshul Gupta | CSE@TAMU 43
Result 3
Anshul Gupta | CSE@TAMU 44
30.4
26.6
29.3
24
25
26
27
28
29
30
31
Uncertainity Random Combination
Areaunderlearningcurves(%)
Anshul Gupta | CSE@TAMU 45
Future work
• Automatic assessment of OCR quality
– Linguistic features can be explored
– Denoise hOCR can be used to detect unknown noises
• Bleedthrough, irregular fonts, speckle noise, etc
• Are there any other types of page problems exists in eMOP
collections?
• Active learning based font identification
OCR
TIFF
Feature
extraction
Select
samples
Tag
samples
Update font
classifier
hOCR
TIFF
Feature
extraction
Select
samples
Tag
samples
Update font
classifier
Font
Bleedthrough
Musical scripts
Picture
Anshul Gupta | CSE@TAMU 46
Conclusion
• Summary
– Automatic assessment of OCR quality
• Non-text OCR outputs suffice to
– Identify text and noise in a document image
– Estimate the document’s overall quality
– Improve OCR transcription performance when image processing
based preprocessing is prohibitive
– Active learning based font identification
• Word image features capture the font characteristics
• Bag-of-word features show good class separability
• A robust font classifier is trained using just 443 labeled instances
Thank you

More Related Content

Similar to Assessment of OCR quality and font identification in historical documents

IncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP TalkIncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP TalkIncQuery Labs
 
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...Jonathon Hare
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval
 
Text to speech conversation in gujarati
Text to speech conversation in gujaratiText to speech conversation in gujarati
Text to speech conversation in gujaratiAshvin Nakum
 
Measuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonMeasuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonSujit Pal
 
TCDL15 Beyond eMOP
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOPMatt Christy
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...eSAT Journals
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler DesignKuppusamy P
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Institute of Contemporary Sciences
 
Script Identification Using MATLAB
Script Identification Using MATLABScript Identification Using MATLAB
Script Identification Using MATLABAnimesh Mishra
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_daySolin TEM
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationChen Xu
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 

Similar to Assessment of OCR quality and font identification in historical documents (20)

IncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP TalkIncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP Talk
 
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
 
Text to speech conversation in gujarati
Text to speech conversation in gujaratiText to speech conversation in gujarati
Text to speech conversation in gujarati
 
Measuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonMeasuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and Python
 
TCDL15 Beyond eMOP
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOP
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
co:op-READ-Convention Marburg - Roger Labahn
co:op-READ-Convention Marburg - Roger Labahnco:op-READ-Convention Marburg - Roger Labahn
co:op-READ-Convention Marburg - Roger Labahn
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
 
Script Identification Using MATLAB
Script Identification Using MATLABScript Identification Using MATLAB
Script Identification Using MATLAB
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 

Recently uploaded

Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 

Recently uploaded (20)

Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 

Assessment of OCR quality and font identification in historical documents

  • 1.
  • 2. Anshul Gupta | CSE@TAMU 2 What are historical documents? – Correspondence – Diaries – Newspapers – Government Documents – Books
  • 3. Anshul Gupta | CSE@TAMU 3 Digitizing historical documents • Why? – Historical records are in analog form – Due to their fragility, most of them are not accessible – Not searchable • How to make them accessible? – Digital text transcription • Ways of digitization – Hand transcribe each book • Resource intensive – OCR: optical character recognition • high-error in text transcription • Mass digitization projects
  • 4. Anshul Gupta | CSE@TAMU 4 Early modern OCR project (eMOP) • Goal – Improve OCR accuracy for early modern texts • 300k documents, 45M pages – Open source OCR tools • Challenges – Early modern printing • Irregular fonts • Decorative page elements – Document image problems – Problems get severe • Images are binarized Pictures Decorative page elements
  • 5. Anshul Gupta | CSE@TAMU 5 Goals Font Metadata Automatic Quality Assessment*hOCR Quality Score Active Font Identification Black font Roman Mixed Good Documents Bad Documents Goal1 Goal2 Denoised hOCR+ Document Images *hOCR output from Tesseract OCR.
  • 6. Anshul Gupta | CSE@TAMU 6 Why we want to assess OCR quality? • Improve runtime – Focus on documents with good OCR quality – Send bad quality documents to a separate diagnostics pipeline • How to measure OCR quality? – Number of methods exists • EMOP use Juxta score • Measures similarity between OCR output and ground truth text – But, such scores need ground truth • Not available for all documents • Automated way to assess OCR quality
  • 7. Anshul Gupta | CSE@TAMU 7 Our approach • Post-process OCR output – Page segmentation result such as bounding box (BB) coordinates – OCR word confidence • Build ML models to remove noise – Binary classification: classify each BB either as text or noise • 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐎𝐂𝐑 ∝ 𝟏 % 𝐧𝐨𝐢𝐬𝐞 𝐁𝐁𝐬
  • 8. Anshul Gupta | CSE@TAMU 8 Language-agnostic approach
  • 9. Anshul Gupta | CSE@TAMU 9 Quality assessment algorithm Pre-filteringOCR Output Column segmentation BB labels Local iterative relabeling Quality score
  • 10. Anshul Gupta | CSE@TAMU 10 • Prefiltering – Provides initial labels to be refined in later stages – Rule based classifier • BB properties and OCR word confidence • Conjunction of rules – Problems • Many text BBs classified as noise • Need a way to recover mis- classified text BBs Area of BB > 1st percentile ? OCR word confidence in (0,0.95)? Yes Non-text No Height/Width < 2? YesNo Non-text NoYes Non-textText Height Width
  • 11. Anshul Gupta | CSE@TAMU 11 • Column extraction – Extract individual column and then process each column Leftmost text BB Rightmost text BB Trough
  • 12. Anshul Gupta | CSE@TAMU 12 • Local iterative relabeling – Refines initial labels – Based on BB properties and its neighborhood – Applies an MLP classifier iteratively to refine BB labels (text/noise) Features used during local iterative relabeling Features Description 𝑆 Score from nearest neighbors ; see eq. (1) 𝐶 𝑂𝐶𝑅 OCR word confidence* 𝐻/𝑊 Height-to-width ratio of BB* 𝐴 Area of BB* 𝐻 𝑛𝑜𝑟𝑚 Normalized height: 𝐻 𝑛𝑜𝑟𝑚 = 𝐻 − 𝐻 𝑚𝑒𝑑 𝐻𝐼𝑄𝑅 𝐻 𝑑𝑖𝑠𝑡 Horizontal distance from the middle of the page 𝑉𝑑𝑖𝑠𝑡 Vertical distance from the top margin *available from the pre-filtering stage BBs for a column Initial labels from pre-filter stage Multi-layer perceptron model New labels old labels= = new labels? No Yes Labels: Text or Noise 𝑘=1 𝑃 𝑤 𝑘 𝐿 𝑘 𝑘=1 𝑃 𝑤 𝑘 Geometric Features 𝑫 𝒎𝒂𝒙
  • 13. Anshul Gupta | CSE@TAMU 13 Final output 3 Confidence 1 1 0 0 Text Noise 1 2 4
  • 14. Anshul Gupta | CSE@TAMU 14 Results • Label refinement: local iterative relabeling. 0.85 0.9 0.95 1 Precision Recall F1 score Pre-filtering After iterative relabeling • Dataset – Binarized page images – Images are selected to represent variety in the eMOP corpora. • Multi-page; single column; ink bleed-through; multiple skew; warping; printed margins • Label creation – Each BB returned by OCR is manually labelled as 0:noise and 1: text – 72,366 BBs are labelled0% 20% 40% 60% 80% 100% 1 2 3 4 5 PROPORTION NUMBER OF ITERATIONS
  • 15. Anshul Gupta | CSE@TAMU 15 • Quality assessment result – % noise BBs = 𝐵𝐵 𝑛𝑜𝑖𝑠𝑒 – Juxta Score • 𝑆𝐽𝑊 similarity b/w OCR output and ground truth • eMOP uses juxta-cl* to generate 𝑆𝐽𝑊 – 6,775 test documents with ground truth text – Compare % noise BBs and Juxta score 𝑠𝐽𝑊 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 𝐵𝐵 𝑛𝑜𝑖𝑠𝑒 𝜌 = −0.7038 *implementations from juxtacommons.org
  • 16. Anshul Gupta | CSE@TAMU 16 [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1) 0 0.1 0.2 𝐵𝐵 𝑛𝑜𝑖𝑠𝑒 Changein𝑠𝐽𝑊 • Filtering quality – Does removing predicted noise BBs help in improving Juxta score? *implementations from juxtacommons.org 85.4 10.6 4 0 10 20 30 40 50 60 70 80 90 >0 <0 =0 %Documents Change in similarity
  • 17. Anshul Gupta | CSE@TAMU 17 Recap Font Metadata Automatic Quality Assessment Yes No *hOCR Quality Score Active Font Identification Black font Roman Mixed Good Documents Bad Documents Goal 1 Goal 2 Denoised hOCR + Document Images
  • 18. Anshul Gupta | CSE@TAMU 18 Why we need font identification? • Improve OCR quality – EMOP collections have documents in multiple fonts – OCR system works best when knowledge of font is available – Don’t have font database for EMOP collections • How? – Manual tagging • Human can label/tag each document – Automatic tagging • Machine learning models that can recognize fonts – But for font identification for EMOP • Need a labeled data (a training set) • Getting labeled data from millions of page images – Need an efficient way to train supervised ML models
  • 19. Anshul Gupta | CSE@TAMU 19 Our approach • Active learning – Allow ML algorithm to – Acquire its own training data – Select most informative examples for labelling – Build ML models using as few labeled data
  • 20. Anshul Gupta | CSE@TAMU 20 Active learning • A learning paradigm – Train a classifier using labelled data – Sample most informative instances : Active sampling – Ask for labels from a human L U ML algo Query Selects most informative instances {X,?} {X, Label} Add to L Small L
  • 21. Anshul Gupta | CSE@TAMU 21 • Active learning for font identification OCR TIFF Feature extraction Select samples Tag sampleshOCR TIFF Train font classifier Training Active sampling
  • 22. Font classes and characteristics Blackletter • Examples • Characteristics Roman • Examples • Characteristics Thick stroke Thin stroke Thick stroke Angles strokes Horizontal serifs Similar vertical stroke width
  • 23. Anshul Gupta | CSE@TAMU 23 Feature extraction Denoise hOCR Extract features from word images Mean and IQR character widths Slant line density Zernike Moments Preprocess
  • 24. Anshul Gupta | CSE@TAMU 24 Preprocess word image • Normalize the height of word images – Resize each word image to have same height • Remove salt and pepper noise • Correct skew – Calculate a time frequency distribution for different skew angles – Skew angle at which distributions shows a peak (a) (b) (a) (b)
  • 25. Anshul Gupta | CSE@TAMU 25 Feature extraction Denoise hOCR Extract features from word images Mean and IQR character widths Slant line density Zernike Moments Preprocess
  • 26. Anshul Gupta | CSE@TAMU 26 Mean and IQR character width • Roman fonts have smaller vertical stroke width than the Blackletter – Mean character width • Blackletter fonts have drastic differences in the stroke widths – IQR character width • How to capture these characteristics? Mid Mid + 20 Mid - 20
  • 27. Anshul Gupta | CSE@TAMU 27 Feature extraction Denoise hOCR Extract features from word images Mean and IQR character widths Slant line density Zernike Moments Preprocess
  • 28. Anshul Gupta | CSE@TAMU 28 Slant line density • Blackletters fonts are characterized by angled lines and serifs – Capture the amount of angled straight lines in a word image – Density of angled lines per character • How? – Hough transform – Number of lines with slope between 45° ± 5° and -45° ± 5° – Divide by number of characters (hOCR) Edge Detection Hough Transform Word Image Edge Image
  • 29. Anshul Gupta | CSE@TAMU 29 Feature extractionFeature extraction Denoise hOCR Extract features from word images Mean and IQR character widths Slant line density Zernike Moments Preprocess
  • 30. Anshul Gupta | CSE@TAMU 30 Zernike Moments • Zernike Moments (ZMs) are shape descriptors – To capture the visual appearance of the text (words) • 6 ZMs along with their transformations – Total of 15 features similar ones used in tumor classification problem by Tahmasbi et al. (2011).
  • 31. Anshul Gupta | CSE@TAMU 31 Normalized histogram (BoF vector) Bag-of-word features feature extraction Vector Quantization Codebook Document Image Quantized Image K-means feature extraction Images Word images
  • 32. Anshul Gupta | CSE@TAMU 32 Recap OCR TIFF Feature extractionhOCR TIFF Train font classifier
  • 33. Anshul Gupta | CSE@TAMU 33 Classifier • Label propagation – Graph based semi-supervised classifier – Uses labeled and unlabeled data to form a graph structure – Labeled data act like source that transmit labels to unlabeled data according to similarity wij Unlabeled example Two labeled examples wij
  • 34. Anshul Gupta | CSE@TAMU 34 Active sampling Recap OCR TIFF Feature extraction Select sampleshOCR TIFF Train font classifier
  • 35. Anshul Gupta | CSE@TAMU 35 Active sampling Feature space Classifier decision boundary Class 1 Class 2 Uncertainty based sampling (HS) Dissimilarity based sampling (DS) Diversity (D’)
  • 36. Anshul Gupta | CSE@TAMU 36 Recap OCR TIFF Feature extraction Select samples Tag sampleshOCR TIFF Train font classifier
  • 37. Anshul Gupta | CSE@TAMU 37 Results • Dataset – 3272 documents from ECCO and EEBO collections – eMOP experts labeled documents • 1005 Black documents • 1768 Roman documents • 498 Mixed documents – text printed in both fonts
  • 38. Anshul Gupta | CSE@TAMU 38 Experiment 1 • Quality of extracted features : Word level 500 Roman & 500 blackletter Images Feature extraction Classifier Blackletter word Roman word ALL Zernike Moments (ZMs) Mean and IQR CW, SLD
  • 39. Anshul Gupta | CSE@TAMU 39 Result 1 0.8433 0.805 0.6717 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ALL Only ZMs CW(mean & IQR) and SLD Cross-validationF1score Feature sets
  • 40. Anshul Gupta | CSE@TAMU 40 Experiment 2 • Quality of Bag-of-word features Normalized histogram (BoF vector) feature extraction Vector Quantization Codebook Document Image Quantized Image K-means feature extraction Images Word images PCA 3-dimensional feature vector
  • 41. Anshul Gupta | CSE@TAMU 41 Result 2
  • 42. Anshul Gupta | CSE@TAMU 42 Experiment 3 • Performance of active learning model L U Train ML model Query labels for 20 instances Select most informative instances{X,?} {X, Label} Add to L 3 labeled examples T Repeat 20 times Store validation accuracy
  • 43. Anshul Gupta | CSE@TAMU 43 Result 3
  • 44. Anshul Gupta | CSE@TAMU 44 30.4 26.6 29.3 24 25 26 27 28 29 30 31 Uncertainity Random Combination Areaunderlearningcurves(%)
  • 45. Anshul Gupta | CSE@TAMU 45 Future work • Automatic assessment of OCR quality – Linguistic features can be explored – Denoise hOCR can be used to detect unknown noises • Bleedthrough, irregular fonts, speckle noise, etc • Are there any other types of page problems exists in eMOP collections? • Active learning based font identification OCR TIFF Feature extraction Select samples Tag samples Update font classifier hOCR TIFF Feature extraction Select samples Tag samples Update font classifier Font Bleedthrough Musical scripts Picture
  • 46. Anshul Gupta | CSE@TAMU 46 Conclusion • Summary – Automatic assessment of OCR quality • Non-text OCR outputs suffice to – Identify text and noise in a document image – Estimate the document’s overall quality – Improve OCR transcription performance when image processing based preprocessing is prohibitive – Active learning based font identification • Word image features capture the font characteristics • Bag-of-word features show good class separability • A robust font classifier is trained using just 443 labeled instances

Editor's Notes

  1. Good Afternoon everyone! I am Anshul Gupta and today I going to present my work on Automatic assessment of OCR quality in historical documents.
  2. So… what are historical documents? Anything that can give us information about certain event in past or about a particular period. Some examples of historical documents are correspondence, diaris, newspaper, gov docs, and books. In this presentation we will focus on old printed books, along with algorithms to improve their digitization quality.
  3. Since, these documents are in printed form – with time they degrade. Due to their fragility most of these documents are not accessible. So, basically Digitization helps to preserve these documents and also makes them searchable. How can we make then searchable? We can get get digital text for these documents and then plug this text into a search engine. Sudeenly we can make billions of historical documents searchable. But the challenge here is that how can we get this digital text transcription? One naïve way is to had transcribe each documents – definatly given billions of documents it is not a feasible option. We need to use a automated method that is Optical character recognition. These system are great but generate high error text output. Hence, we need to customize these system for historical documents. Some of the successful mass digitizations projects are from library of congress, google books, Proquest, gale and early modern OCR projects. The EMOP project is still in progress and the work that I am presenting today is a part of EMOP.
  4. So, lets see what emop is all about? Explain the font identification… introduce The two goals of emop are: Two improve OCR of early texts that is texts printed between 14th centaury to 18th centaury. Second goal is to produce open source tools such as font databases, crowdsourcing correction tools and post processing tools. The database of images contains 45 million page images and these images have variety of problems with them. So, the first set of problems arise due to early modern printing. During that time the printing process was not formalized. They used very odd fonts as shown in the zoomed picture. This is called as blackletter font and these fonts varies from image to image. As shown in the highlighted region, these are the decorative elemnts and when OCR this page, OCR system sometimes recognize these elemnts as valid text. Other set of issues are related to the degradation of these documents. So when the images of these documents are generated , we get issues such as faded fonts, black patchs due to torn page, multiple skew on a same page image. All the issues are gets severe because all the images that we have are of low-quality binarized images. Hence, when we OCR these documents we get lots of junk text. But the fact is that not all the documents are of such bad quality. So, the challenge here is that can we separate the good quality documents are the bad ones? Talk about binarzed images, so you need not talk about it later.
  5. Talk about two main goals First part is… Animate it Increase the fonts
  6. So, our approach to measure OCR quality is by post processing OCR outputs such as OCR bounding box coordinates and OCR word confidence. We basically, pose this problem of measuring quality as a binary classification problem where we want to classify each bounding box either as text or noise. Once we have our labels that is noise or text of all bounding boxes, we can get OCR quality as percentage of predicted noise BBs.
  7. Also, our approach does not depend on text written on the page image. Here, when we passed this document image from OCR pipeline, we got these green bounding boxes as output. When pass this OCR out to our algorithm, we just passing these rectangle. Hence, this makes our algorithm language agnoistic.
  8. So, here is the block diagram showing steps in assessment algorithm. In step 1, the algorithm generates and initial set of labels. Then it divides page image into its constituent columns. It locally refines the bounding box labels.
  9. As I mentioned prefiltering generates intial set of labels. It use a rule based classifier as represented by this tree. Since, we designed our algorithm to be conservative in predicting text. It loses many text BBs at this point. Hence, we need to recover these text BBs
  10. In order to extract contextual information, page image is divided into its constituent columns. For this, we first generated bounding box density profile along x axis. Then the troughs in this profile represent the column separator.
  11. In this step we process each columns separately, So, the idea behind this step is that in a book, a word is usually surrounded by more words. Hence, we embedded this locak information into our algorithm by constructing local feature. With this local feature and other geometric features, we trained a Multi layer perceptron. We then used this trained MLP model to iteratively refine the bb labels. So, the process goes like this, We get the bounding boxes for a column, for each bounding box we calculate local feature as a weighted average of labels of neighbouring boxes. We then pass this local feature and other features to MLP model, which then outputs new labels. If new labels are not equal to old labels, we use new lables to recalculated local feature and the process is repated until labels stops changing.
  12. So the final output looks like this. Here the predicted noise is in red and predicted text is in green. We can see that the algorithm has done a good job in predict non-text as noise such as here picture is classified as noise. Also, it has found out noise even when it is buried in text bounding boxes.
  13. Now lets see how well the proposed algorithm works. To evaluate the algorithm , we selected a set of images which represents variety in the emop database. Then we hand labelled around 72,000 bounding boxes. So now lets look at how well local iterative relabeling works. So, in this plot, blue bars are the prefilterng result and red are the result after local iterative relabeling. Here can see that both the precision and recall has improved after local refinement. Thins means that local refinement make the algorithm more precisice in predicting what is text and also, recovers lost text from prefiltering step. Another important aspect of the local refinement is its convergence rate. So for that what we dd we plotted the proportion of document for which local refinement convergeved within certain iterations. We can see that for almost all documents local refinement step converged in 4 iterations.
  14. So the classification prolem that I presented here, is basically a filtering problem. Here we are trying to filterout noise. Hence, it make sense to see how good is the filtering quality. For this, we selected around 6700 documents to generate our test set. For all these documents we had ground truth text transcription. Also, whenever a groundtruth text is available, emop compare OCR text output for that document image with its ground truth. Asimilarity score Sjw is generated by emop. So, we did, for each of these test documents, we calculated this similarity score, before and after application of our algorithm. Then we calculated the change in similarity. We plotted that along y axis verses noiseness present on a document image. We can see that the filtering has a large effect on very noise documents and small on good quality. Also, for 85 % documents out of these 6700, our algorithm gave a positive change.
  15. Correct the lp diagram
  16. Feature works Features
  17. Get results for just zms, cw+sl+iqr; combine
  18. So, to summarize This work proves that OCR output such as BB corrdinates and word confidence can be used to identify tex and noise, can also be used to measure documents overall quality and wherever prepreocessing based filtering is prohibitive, this algorithm can be used in postprocessing stage to remove noise. Currently, I am working on building a diagnostic pipeline using Active learning Also, adding linguistic features such as character n-grams can give us cues about certain kinds of noises.
  19. Thank you all for your attention. And now I am open to questions.