1. End-to-End Text Recognition with
Convolutional Neural Networks
Tao Wang*, David J. Wu*, Adam Coates, Andrew Y. Ng
Computer Science Department
Stanford University
* Denotes equal contribution
2. Tao Wang 2
Scene Text Recognition Overview
• Text “in the wild” are hard to recognize
• Wide range of variations in backgrounds,
textures, fonts, and lighting conditions
Street View Text Dataset
K.Wang et al., 2011
ICDAR 2003 Dataset
S. Lucas et al., 2003
4. Tao Wang 4
Exhaustive
Graph Search
MSER + SVM with RBF
Kernel
Neumann and
Matas, 2012
CRF + N-gram
model
HOG + SVM with RBF
Kernel
Mishra et al., 2012
Pictorial
Structure
HOG + Random Ferns
K. Wang et al., 2011
Semi-Markov
CRF
Appearance + Geometry
Weinman et al.,
2008
High-level
inference
Classification and
detection
Works
5. Tao Wang 5
Simple
off-the-shelf
heuristics
Learnt features +
2-layer CNN
Our approach
Graph based
inference
models
Hand-designed
features + off-the-shelf
classifier
Most other
approaches
High-level
inference
Classification and
detection
6. Tao Wang 6
ICDAR 62-way cropped
character classification
Detection/Classification End-to-end system after high-level inference
Various Benchmarks
ICDAR and SVT end-to-end text recognition
ICDAR and SVT Cropped
word recognition Lexicon
SOTA
SOTA on ICDAR SOTA
7. Tao Wang 7
Unsupervised Feature Learning
Contrast Normalization + ZCA whitening
K-Means
Coates et al., 2011
8. Tao Wang 8
Convolution Convolution
Spatial Pooling Spatial Pooling
L2-SVM Classifier
√ Text × Non-Text
Backpropagation
Large representation but not enough data.
Overfitting?
96
256
~10K parameters for detection
~50K parameters for classification
1st layer 2nd layer
9. Tao Wang 9
Synthetic Data
Color Statistics
Synthetic “hard negatives”
Real Synthetic
Unrealistic Synthetic Data
Real Data
Java.Font + Natural backgrounds
22. Tao Wang 22
max( )
max({ })
n c
m n c n
-- “confidence margin”
PEOSTEL
PEOST
POST
POS
Hunspell
POSE
POST
PEOPLE
PISTOL
…
LEXICON
Suggested
Words
Our F-score: 0.38
Neumann and
Matas, 2010: 0.40
c
23. Tao Wang 23
• Learnt features + 2-layer CNN for+ character detection and classification
• Simple heuristics to build end-to-end scene text recognition system
• State-of-the-art performances on
- ICDAR cropped character classification
- ICDAR cropped word recognition
- Lexicon based end-to-end recognition on ICDAR and SVT
• Extensible to more general lexicon with off-the-shelf spelling checker
Conclusion