Assessment of OCR quality and font identification in historical documents

Anshul Gupta | CSE@TAMU 2
What are historical documents?
– Correspondence
– Diaries
– Newspapers
– Government Documents
– Books

Digitizing historical documents
• Why?
– Historical records are in analog
form
– Due to their fragility, most of them
are not accessible
– Not searchable
• How to make them accessible?
– Digital text transcription
• Ways of digitization
– Hand transcribe each book
• Resource intensive
– OCR: optical character recognition
• high-error in text transcription
• Mass digitization projects

Early modern OCR project (eMOP)
• Goal
– Improve OCR accuracy for
early modern texts
• 300k documents, 45M
pages
– Open source OCR tools
• Challenges
– Early modern printing
• Irregular fonts
• Decorative page elements
– Document image
problems
– Problems get severe
• Images are binarized
Pictures
Decorative page
elements

Goals
Font
Metadata
Automatic Quality
Assessment*hOCR
Quality Score
Active Font
Identification
Black
font
Roman
Mixed
Good Documents
Bad Documents
Goal1 Goal2
Denoised
hOCR+
Document
Images
*hOCR output from Tesseract OCR.

Why we want to assess OCR quality?
• Improve runtime
– Focus on documents with good OCR quality
– Send bad quality documents to a separate diagnostics pipeline
• How to measure OCR quality?
– Number of methods exists
• EMOP use Juxta score
• Measures similarity between OCR output and ground truth text
– But, such scores need ground truth
• Not available for all documents
• Automated way to assess OCR quality

Our approach
• Post-process OCR output
– Page segmentation result such as bounding box (BB) coordinates
– OCR word confidence
• Build ML models to remove noise
– Binary classification: classify each BB either as text or noise
• 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐎𝐂𝐑 ∝
𝟏
% 𝐧𝐨𝐢𝐬𝐞 𝐁𝐁𝐬

Language-agnostic approach

Quality assessment algorithm
Pre-filteringOCR Output Column
segmentation
BB labels
Local iterative
relabeling
Quality score

• Prefiltering
– Provides initial labels to
be refined in later stages
– Rule based classifier
• BB properties
and OCR word confidence
• Conjunction of rules
– Problems
• Many text BBs classified as
noise
• Need a way to recover mis-
classified text BBs
Area of BB > 1st percentile ?
OCR word confidence
in (0,0.95)?
Yes
Non-text
No
Height/Width < 2?
YesNo
Non-text
NoYes
Non-textText
Height
Width

• Column extraction
– Extract individual column and then process each column
Leftmost
text BB
Rightmost
text BB
Trough

• Local iterative relabeling
– Refines initial labels
– Based on BB properties
and its neighborhood
– Applies an MLP classifier
iteratively to refine BB
labels (text/noise)
Features used during local iterative relabeling
Features Description
𝑆 Score from nearest neighbors ; see eq. (1)
𝐶 𝑂𝐶𝑅 OCR word confidence*
𝐻/𝑊 Height-to-width ratio of BB*
𝐴 Area of BB*
𝐻 𝑛𝑜𝑟𝑚 Normalized height: 𝐻 𝑛𝑜𝑟𝑚 = 𝐻 − 𝐻 𝑚𝑒𝑑 𝐻𝐼𝑄𝑅
𝐻 𝑑𝑖𝑠𝑡 Horizontal distance from the middle of the page
𝑉𝑑𝑖𝑠𝑡 Vertical distance from the top margin
*available from the pre-filtering stage
BBs for a
column
Initial labels from pre-filter stage
Multi-layer
perceptron
model
New
labels
old
labels=
= new
labels?
No
Yes
Labels: Text or Noise
𝑘=1
𝑃
𝑤 𝑘 𝐿 𝑘
𝑘=1
𝑃
𝑤 𝑘
Geometric
Features
𝑫 𝒎𝒂𝒙

Final output
3
Confidence
1 1
0 0
Text
Noise
1
2
4

Results
• Label refinement: local iterative relabeling.
0.85
0.9
0.95
1
Precision Recall F1 score
Pre-filtering After iterative relabeling
• Dataset
– Binarized page images
– Images are selected to represent variety in the eMOP corpora.
• Multi-page; single column; ink bleed-through; multiple skew; warping;
printed margins
• Label creation
– Each BB returned by OCR is manually labelled as 0:noise and
1: text
– 72,366 BBs are labelled0%
20%
40%
60%
80%
100%
1 2 3 4 5
PROPORTION
NUMBER OF ITERATIONS

• Quality assessment result
– % noise BBs = 𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
– Juxta Score
• 𝑆𝐽𝑊 similarity b/w OCR
output and ground truth
• eMOP uses juxta-cl* to
generate 𝑆𝐽𝑊
– 6,775 test documents with
ground truth text
– Compare % noise BBs and
Juxta score 𝑠𝐽𝑊 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
𝜌 = −0.7038
*implementations from juxtacommons.org

[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1)
0
0.1
0.2
𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
Changein𝑠𝐽𝑊
• Filtering quality
– Does removing predicted noise BBs help in improving Juxta
score?
*implementations from juxtacommons.org
85.4
10.6
4
0
10
20
30
40
50
60
70
80
90
>0 <0 =0
%Documents
Change in similarity

Recap
Font
Metadata
Automatic Quality
Assessment
Yes
No
*hOCR
Quality Score
Active Font
Identification
Black
font
Roman
Mixed
Good Documents
Bad Documents
Goal 1 Goal 2
Denoised
hOCR
+ Document
Images

Why we need font identification?
• Improve OCR quality
– EMOP collections have documents in multiple fonts
– OCR system works best when knowledge of font is available
– Don’t have font database for EMOP collections
• How?
– Manual tagging
• Human can label/tag each document
– Automatic tagging
• Machine learning models that can recognize fonts
– But for font identification for EMOP
• Need a labeled data (a training set)
• Getting labeled data from millions of page images
– Need an efficient way to train supervised ML models

Our approach
• Active learning
– Allow ML algorithm to
– Acquire its own training data
– Select most informative examples for labelling
– Build ML models using as few labeled data

Active learning
• A learning paradigm
– Train a classifier using labelled data
– Sample most informative instances : Active sampling
– Ask for labels from a human
L
U
ML algo
Query
Selects most
informative
instances
{X,?}
{X, Label}
Add to L
Small L

• Active learning for font identification
OCR
TIFF
Feature
extraction
Select
samples
Tag
sampleshOCR
TIFF
Train font
classifier
Training Active sampling

Font classes and characteristics
Blackletter
• Examples
• Characteristics
Roman
• Examples
• Characteristics
Thick stroke
Thin stroke
Thick stroke
Angles strokes
Horizontal
serifs
Similar vertical stroke width

Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess

Preprocess word image
• Normalize the height of
word images
– Resize each word image to
have same height
• Remove salt and pepper
noise
• Correct skew
– Calculate a time frequency
distribution for different skew
angles
– Skew angle at which
distributions shows a peak
(a)
(b)
(a)
(b)

Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess

Mean and IQR character width
• Roman fonts have smaller vertical stroke width than the
Blackletter
– Mean character width
• Blackletter fonts have drastic differences in the stroke
widths
– IQR character width
• How to capture these characteristics?
Mid
Mid + 20
Mid - 20

Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess

Slant line density
• Blackletters fonts are characterized by angled lines and
serifs
– Capture the amount of angled straight lines in a word image
– Density of angled lines per character
• How?
– Hough transform
– Number of lines with slope between 45° ± 5° and -45° ± 5°
– Divide by number of characters (hOCR)
Edge
Detection
Hough
Transform
Word Image Edge Image

Feature extractionFeature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess

Zernike Moments
• Zernike Moments (ZMs) are
shape descriptors
– To capture the visual
appearance of the text (words)
• 6 ZMs along with their
transformations
– Total of 15 features similar ones
used in tumor classification
problem by Tahmasbi et al.
(2011).

Normalized histogram
(BoF vector)
Bag-of-word features
feature
extraction
Vector
Quantization
Codebook
Document Image
Quantized Image
K-means
feature
extraction
Images
Word images

Recap
OCR
TIFF
Feature
extractionhOCR
TIFF
Train font
classifier

Classifier
• Label propagation
– Graph based semi-supervised classifier
– Uses labeled and unlabeled data to form a graph structure
– Labeled data act like source that transmit labels to
unlabeled data according to similarity wij
Unlabeled
example
Two labeled
examples
wij

Active
sampling
Recap
OCR
TIFF
Feature
extraction
Select
sampleshOCR
TIFF
Train font
classifier

Active sampling
Feature space
Classifier decision boundary
Class 1
Class 2
Uncertainty based sampling
(HS)
Dissimilarity based
sampling (DS)
Diversity (D’)

Recap
OCR
TIFF
Feature
extraction
Select
samples
Tag
sampleshOCR
TIFF
Train font
classifier

Results
• Dataset
– 3272 documents from ECCO and EEBO collections
– eMOP experts labeled documents
• 1005 Black documents
• 1768 Roman documents
• 498 Mixed documents – text printed in both fonts

Experiment 1
• Quality of extracted features : Word level
500 Roman
& 500
blackletter
Images
Feature
extraction
Classifier
Blackletter word
Roman word
ALL
Zernike
Moments
(ZMs)
Mean and
IQR CW,
SLD

Result 1
0.8433
0.805
0.6717
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ALL Only ZMs CW(mean &
IQR) and SLD
Cross-validationF1score
Feature sets

Experiment 2
• Quality of Bag-of-word features
Normalized
histogram
(BoF vector)
feature
extraction
Vector
Quantization
Codebook
Document Image
Quantized Image
K-means
feature
extraction
Images
Word images
PCA
3-dimensional feature vector

Result 2

Experiment 3
• Performance of active learning model
L
U
Train ML
model
Query labels for 20
instances
Select most
informative
instances{X,?}
{X, Label}
Add to L
3 labeled examples
T
Repeat 20 times
Store validation
accuracy

Result 3

30.4
26.6
29.3
24
25
26
27
28
29
30
31
Uncertainity Random Combination
Areaunderlearningcurves(%)

Future work
• Automatic assessment of OCR quality
– Linguistic features can be explored
– Denoise hOCR can be used to detect unknown noises
• Bleedthrough, irregular fonts, speckle noise, etc
• Are there any other types of page problems exists in eMOP
collections?
• Active learning based font identification
OCR
TIFF
Feature
extraction
Select
samples
Tag
samples
Update font
classifier
hOCR
TIFF
Feature
extraction
Select
samples
Tag
samples
Update font
classifier
Font
Bleedthrough
Musical scripts
Picture

Conclusion
• Summary
– Automatic assessment of OCR quality
• Non-text OCR outputs suffice to
– Identify text and noise in a document image
– Estimate the document’s overall quality
– Improve OCR transcription performance when image processing
based preprocessing is prohibitive
– Active learning based font identification
• Word image features capture the font characteristics
• Bag-of-word features show good class separability
• A robust font classifier is trained using just 443 labeled instances

Assessment of OCR quality and font identification in historical documents

Recommended

Recommended

More Related Content

Similar to Assessment of OCR quality and font identification in historical documents

Similar to Assessment of OCR quality and font identification in historical documents (20)

Recently uploaded

Recently uploaded (20)

Assessment of OCR quality and font identification in historical documents

Editor's Notes