Scene Text Recognition in Images – A
Deep Learning Era Survey
Scene texts contains more semantic information which has increased
attention in recent years. Basically, text identification has brought
enormous applications in computer vision as well as natural language
processing fields. Techniques developed in the recent times are
addressing traditional problems and identifying new applications. I am
interested to survey on the approaches and state-of-art methods that are
developed from past couple of years in the Text Recognition system.
This article gives you an understanding on following
1. Work on a High-Level
2. Previous techniques
3. Recent advances
4. Popular Datasets available
5. Future areas
Briefing of the work:
Techniques used to detect and recognize the text in scene images and
videos are categorized into 3
Text Detection and Localization: The technique helps in finding out
whether a text is present in the given scene image or video, if present it
identifies the location of the text.
Challenges: Different orientation, different languages, colors and sizes,
complex background occlusion, blur, noise non-uniform illumination
Text Recognition: Text recognition aims at converting the localized text
in images into character coding.
Challenges:
1. Scene Complexity: Images or videos generally suffer from noise,
distortion, non-uniform illumination, partial occlusion, as well as
confusion of the text and background. Complex background brings
some obstacles to text detection or recognition in real world.
2. Text Diversity and more stringent practical requirements: Scene
text vary in color, size, orientation, font, language, and text partial
deletion, etc.
End-to-End Text Recognition: This combines both text detection and
localization with recognition.
ClassicalPreviousMethodsand theiradapted
Methodologies
 Photo OCR system for text extraction
1. Text detection
2. Segmentation (Niblack Binarization (a morphological approach),
Binary sliding window classifier)
3. Beam search
4. character classifier (A full connected network with Relu Units)
5. Language Modeling (A standard N-gram model)
 Real-Time lexicon Free scene Text localization and recognition
1. Character Detection: (Extremal Regions), Incrementally Computable
Descriptors (Area, Bounding Box, Perimeter), Sequential classifier
(firstly-a real Ada-boost classifier with decision trees, secondly —
SVM+ RBF kernel)
2. Text Line Formation
3. Character Recognition
4. Sequence Selection
 RRN with Attention Modeling for OCR in the wild
1. Character Sequence Model (encodes image features using Recursive
Convolution layers decodes text using RNN’s)
2. RNN with attention function.
RecentAdvances
Text Detection and localization: This mainly focuses on processing
the images identifying the text and its positions. This is broadly
categorized in to below three forms.
connected component (CC)-based methods: Find the smaller
components and combine it to one larger and filter out non-text
components using classifier. Finally extract the text from images and
combine into one text region. An arguable limitation for the above
methodology is Handling rotation scale changes and complex
backgrounds.
Overcoming techniques:
1. Maximally stable extremal Regions (MSER): It provides robustness
for geometric and illumination conditions. Whereas it only adapting
to horizontal texts is its disadvantage.
2. Stroke Width Transformer (SWT): It seems very efficient and has
advantages of detecting text in any fonts, languages. Insensitive to
directions and multi-scales.
Texture-based methods: Idea is finding the text in images which has
distinct textual properties which can be separated from background.
Below techniques are mostly used in this method.
1. Gabor filters
2. Wavelet Transformation
3. Fast Fourier transformation
4. Discrete cosine Transform (DCT) Domain
5. Laplacian Wavelet, Wavelet decomposition
6. symmetry-based text line detector under the observation of the
symmetry and self- similarity properties of character groups.
Deep Learning-based methods: CNN’s has entirely changed, widely
explored and answered the unresolved questions. Main advantage hit by
the CNN’s is with less computationalcosts able to extract the features
from the images directly. Advanced properties of CNN have helped a lot
in scene detection in natural images. CNN’s implementations can be
broadly fall into 3 different groups within this deep learning-based
method they are below…
1. Region Proposal Based Methods: Simple CNN’s with MSER, R-
CNN’s and their advances— Its approach is to instance segmentation.
A Masked R-CNN uses a Bounding box to generate the object
segmentation by a shaded mask also called as semantic segmentation
technique. A Faster R-CNN can process a classification and detection
of objects in images. R-CNN’s uses a bounding box detection thus
creating the boxes around the objects in the images. By using
Regional proposal network an attention mechanism can happen in
faster R-CNN in 2 stages. These bounding boxes and determining
regions of interest using RPN protocol for each ROI we define the
class label called ROI Pooling. A Pixel-by-Pixel fully convolutional
networks can also be categorized in this method.
2. Segmentation-Based Methods: These mainly focus on producing
more precise multi size text detection but ineffective in detecting
individual words. Like a Cascaded two-convolutional network +
TextSegNet + YOLO(WordDetNet). Also, a super pixel segmentation
with hierarchical clustering for new character candidate extraction
method also comes in segmentation technique.
3. Hybrid Methods Using Multitask Learning: It accounts both CC
based methods and texture-based methods. Below techniques can
account for.
a. Character candidate detection (cascade boosting technique), min-cost
flow network (False character candidate removal, text line extraction,
text line verification).
b. Connectionist Text Proposal Network which can be extended to
multilingual and multi-scale text detection (vertical anchor mechanism,
in-network RNN)
c. Text-attentional convolutional neural network (Text-CNN)-contrast-
enhanced MSER.
Text Recognition: Identifying and understanding the text in the
candidate region. This is mainly classified into 3 ways as below.
Character Based Methods:
1. “Strokelets” -whose essence is a set of multi-scale mid-level
primitives and can be automatically learned from bounding box labels.
It’s very good in describing the characters.
2. Character Recognition by extracting low level features and integrated
automatically via region-based feature pooling technique.
Word Based Methods-Recognizing text at word level:
1. Dense SIFT in a bag-of-key points framework, character could be
recognized robustly
2. word segmentation with recognition in the probabilistic framework,
Lexical decision and sparse beam search tools were used to improve
the recognition accuracy.
Sequence-Based Methods: text is represented in character sequences.
1. An irregular text recognition which is called RARE (Robust text
recognizer with Automatic Rectification). This model combined
spatial transformer network (STN) and a sequence recognition
network (SRN)
2. Lexicon-free photo-OCR system called recursive recurrent neural
networks with attention modeling (R2AM)
3. Convolutional recurrent neural network (CRNN)
4. Auxiliary dense character detection model and an attention model.
End-to-End Text recognition System: CNN’s drastically changed
the procedures and techniques of attempting for combined model which
is one stop place of doing Text detection and text recognition. Text boxes
are been developed as part of advances in proposed end-to-end models.
A sliding window and connected component methods are best proposed
techniques, where they proposed an unconstrained end-to-end real-time
text localization and recognition method.
Benchmark Datasets for scene text identification in Images are below:
https://airtable.com/tbl3hLG7GneipFrDD/viwGd2IomvBf5D5U
d?blocks=show
Future areas:
We have witnessed numerous approaches under different categories and
methods with the same rapid development, especially CNN+RNN
framework is quite popular. Now, let’s discuss some future exciting stuff.
1. Complex scene and large-scale dataset: you can refer COCO dataset
and try to build multi challenging cases
2. Multilingual detection and recognition: Identifying multiple
languages like scripts identification within broad backgrounds and
text sizes project is one of my aspiring areas
3. Real-time detection and recognition: Identifying the texts from
images on real time is also a greatest focusing areas.

Scene Text detection in Images-A Deep Learning Survey

  • 1.
    Scene Text Recognitionin Images – A Deep Learning Era Survey
  • 2.
    Scene texts containsmore semantic information which has increased attention in recent years. Basically, text identification has brought enormous applications in computer vision as well as natural language processing fields. Techniques developed in the recent times are addressing traditional problems and identifying new applications. I am interested to survey on the approaches and state-of-art methods that are developed from past couple of years in the Text Recognition system. This article gives you an understanding on following 1. Work on a High-Level 2. Previous techniques 3. Recent advances 4. Popular Datasets available 5. Future areas Briefing of the work: Techniques used to detect and recognize the text in scene images and videos are categorized into 3 Text Detection and Localization: The technique helps in finding out whether a text is present in the given scene image or video, if present it identifies the location of the text.
  • 3.
    Challenges: Different orientation,different languages, colors and sizes, complex background occlusion, blur, noise non-uniform illumination Text Recognition: Text recognition aims at converting the localized text in images into character coding. Challenges: 1. Scene Complexity: Images or videos generally suffer from noise, distortion, non-uniform illumination, partial occlusion, as well as confusion of the text and background. Complex background brings some obstacles to text detection or recognition in real world. 2. Text Diversity and more stringent practical requirements: Scene text vary in color, size, orientation, font, language, and text partial deletion, etc. End-to-End Text Recognition: This combines both text detection and localization with recognition. ClassicalPreviousMethodsand theiradapted Methodologies  Photo OCR system for text extraction 1. Text detection 2. Segmentation (Niblack Binarization (a morphological approach), Binary sliding window classifier) 3. Beam search
  • 4.
    4. character classifier(A full connected network with Relu Units) 5. Language Modeling (A standard N-gram model)  Real-Time lexicon Free scene Text localization and recognition 1. Character Detection: (Extremal Regions), Incrementally Computable Descriptors (Area, Bounding Box, Perimeter), Sequential classifier (firstly-a real Ada-boost classifier with decision trees, secondly — SVM+ RBF kernel) 2. Text Line Formation 3. Character Recognition 4. Sequence Selection  RRN with Attention Modeling for OCR in the wild 1. Character Sequence Model (encodes image features using Recursive Convolution layers decodes text using RNN’s) 2. RNN with attention function. RecentAdvances Text Detection and localization: This mainly focuses on processing the images identifying the text and its positions. This is broadly categorized in to below three forms.
  • 5.
    connected component (CC)-basedmethods: Find the smaller components and combine it to one larger and filter out non-text components using classifier. Finally extract the text from images and combine into one text region. An arguable limitation for the above methodology is Handling rotation scale changes and complex backgrounds. Overcoming techniques: 1. Maximally stable extremal Regions (MSER): It provides robustness for geometric and illumination conditions. Whereas it only adapting to horizontal texts is its disadvantage. 2. Stroke Width Transformer (SWT): It seems very efficient and has advantages of detecting text in any fonts, languages. Insensitive to directions and multi-scales. Texture-based methods: Idea is finding the text in images which has distinct textual properties which can be separated from background. Below techniques are mostly used in this method. 1. Gabor filters 2. Wavelet Transformation 3. Fast Fourier transformation 4. Discrete cosine Transform (DCT) Domain 5. Laplacian Wavelet, Wavelet decomposition
  • 6.
    6. symmetry-based textline detector under the observation of the symmetry and self- similarity properties of character groups. Deep Learning-based methods: CNN’s has entirely changed, widely explored and answered the unresolved questions. Main advantage hit by the CNN’s is with less computationalcosts able to extract the features from the images directly. Advanced properties of CNN have helped a lot in scene detection in natural images. CNN’s implementations can be broadly fall into 3 different groups within this deep learning-based method they are below… 1. Region Proposal Based Methods: Simple CNN’s with MSER, R- CNN’s and their advances— Its approach is to instance segmentation. A Masked R-CNN uses a Bounding box to generate the object segmentation by a shaded mask also called as semantic segmentation technique. A Faster R-CNN can process a classification and detection of objects in images. R-CNN’s uses a bounding box detection thus creating the boxes around the objects in the images. By using Regional proposal network an attention mechanism can happen in faster R-CNN in 2 stages. These bounding boxes and determining regions of interest using RPN protocol for each ROI we define the class label called ROI Pooling. A Pixel-by-Pixel fully convolutional networks can also be categorized in this method. 2. Segmentation-Based Methods: These mainly focus on producing more precise multi size text detection but ineffective in detecting individual words. Like a Cascaded two-convolutional network + TextSegNet + YOLO(WordDetNet). Also, a super pixel segmentation with hierarchical clustering for new character candidate extraction method also comes in segmentation technique.
  • 7.
    3. Hybrid MethodsUsing Multitask Learning: It accounts both CC based methods and texture-based methods. Below techniques can account for. a. Character candidate detection (cascade boosting technique), min-cost flow network (False character candidate removal, text line extraction, text line verification). b. Connectionist Text Proposal Network which can be extended to multilingual and multi-scale text detection (vertical anchor mechanism, in-network RNN) c. Text-attentional convolutional neural network (Text-CNN)-contrast- enhanced MSER. Text Recognition: Identifying and understanding the text in the candidate region. This is mainly classified into 3 ways as below. Character Based Methods: 1. “Strokelets” -whose essence is a set of multi-scale mid-level primitives and can be automatically learned from bounding box labels. It’s very good in describing the characters. 2. Character Recognition by extracting low level features and integrated automatically via region-based feature pooling technique.
  • 8.
    Word Based Methods-Recognizingtext at word level: 1. Dense SIFT in a bag-of-key points framework, character could be recognized robustly 2. word segmentation with recognition in the probabilistic framework, Lexical decision and sparse beam search tools were used to improve the recognition accuracy. Sequence-Based Methods: text is represented in character sequences. 1. An irregular text recognition which is called RARE (Robust text recognizer with Automatic Rectification). This model combined spatial transformer network (STN) and a sequence recognition network (SRN) 2. Lexicon-free photo-OCR system called recursive recurrent neural networks with attention modeling (R2AM) 3. Convolutional recurrent neural network (CRNN) 4. Auxiliary dense character detection model and an attention model. End-to-End Text recognition System: CNN’s drastically changed the procedures and techniques of attempting for combined model which is one stop place of doing Text detection and text recognition. Text boxes are been developed as part of advances in proposed end-to-end models. A sliding window and connected component methods are best proposed
  • 9.
    techniques, where theyproposed an unconstrained end-to-end real-time text localization and recognition method. Benchmark Datasets for scene text identification in Images are below: https://airtable.com/tbl3hLG7GneipFrDD/viwGd2IomvBf5D5U d?blocks=show Future areas: We have witnessed numerous approaches under different categories and methods with the same rapid development, especially CNN+RNN framework is quite popular. Now, let’s discuss some future exciting stuff. 1. Complex scene and large-scale dataset: you can refer COCO dataset and try to build multi challenging cases 2. Multilingual detection and recognition: Identifying multiple languages like scripts identification within broad backgrounds and text sizes project is one of my aspiring areas 3. Real-time detection and recognition: Identifying the texts from images on real time is also a greatest focusing areas.