Scene Text detection in Images-A Deep Learning Survey

Scene Text Recognition in Images – A
Deep Learning Era Survey

Scene texts contains more semantic information which has increased
attention in recent years. Basically, text identification has brought
enormous applications in computer vision as well as natural language
processing fields. Techniques developed in the recent times are
addressing traditional problems and identifying new applications. I am
interested to survey on the approaches and state-of-art methods that are
developed from past couple of years in the Text Recognition system.
This article gives you an understanding on following
1. Work on a High-Level
2. Previous techniques
3. Recent advances
4. Popular Datasets available
5. Future areas
Briefing of the work:
Techniques used to detect and recognize the text in scene images and
videos are categorized into 3
Text Detection and Localization: The technique helps in finding out
whether a text is present in the given scene image or video, if present it
identifies the location of the text.

Challenges: Different orientation, different languages, colors and sizes,
complex background occlusion, blur, noise non-uniform illumination
Text Recognition: Text recognition aims at converting the localized text
in images into character coding.
Challenges:
1. Scene Complexity: Images or videos generally suffer from noise,
distortion, non-uniform illumination, partial occlusion, as well as
confusion of the text and background. Complex background brings
some obstacles to text detection or recognition in real world.
2. Text Diversity and more stringent practical requirements: Scene
text vary in color, size, orientation, font, language, and text partial
deletion, etc.
End-to-End Text Recognition: This combines both text detection and
localization with recognition.
ClassicalPreviousMethodsand theiradapted
Methodologies
 Photo OCR system for text extraction
1. Text detection
2. Segmentation (Niblack Binarization (a morphological approach),
Binary sliding window classifier)
3. Beam search

4. character classifier (A full connected network with Relu Units)
5. Language Modeling (A standard N-gram model)
 Real-Time lexicon Free scene Text localization and recognition
1. Character Detection: (Extremal Regions), Incrementally Computable
Descriptors (Area, Bounding Box, Perimeter), Sequential classifier
(firstly-a real Ada-boost classifier with decision trees, secondly —
SVM+ RBF kernel)
2. Text Line Formation
3. Character Recognition
4. Sequence Selection
 RRN with Attention Modeling for OCR in the wild
1. Character Sequence Model (encodes image features using Recursive
Convolution layers decodes text using RNN’s)
2. RNN with attention function.
RecentAdvances
Text Detection and localization: This mainly focuses on processing
the images identifying the text and its positions. This is broadly
categorized in to below three forms.

connected component (CC)-based methods: Find the smaller
components and combine it to one larger and filter out non-text
components using classifier. Finally extract the text from images and
combine into one text region. An arguable limitation for the above
methodology is Handling rotation scale changes and complex
backgrounds.
Overcoming techniques:
1. Maximally stable extremal Regions (MSER): It provides robustness
for geometric and illumination conditions. Whereas it only adapting
to horizontal texts is its disadvantage.
2. Stroke Width Transformer (SWT): It seems very efficient and has
advantages of detecting text in any fonts, languages. Insensitive to
directions and multi-scales.
Texture-based methods: Idea is finding the text in images which has
distinct textual properties which can be separated from background.
Below techniques are mostly used in this method.
1. Gabor filters
2. Wavelet Transformation
3. Fast Fourier transformation
4. Discrete cosine Transform (DCT) Domain
5. Laplacian Wavelet, Wavelet decomposition

6. symmetry-based text line detector under the observation of the
symmetry and self- similarity properties of character groups.
Deep Learning-based methods: CNN’s has entirely changed, widely
explored and answered the unresolved questions. Main advantage hit by
the CNN’s is with less computationalcosts able to extract the features
from the images directly. Advanced properties of CNN have helped a lot
in scene detection in natural images. CNN’s implementations can be
broadly fall into 3 different groups within this deep learning-based
method they are below…
1. Region Proposal Based Methods: Simple CNN’s with MSER, R-
CNN’s and their advances— Its approach is to instance segmentation.
A Masked R-CNN uses a Bounding box to generate the object
segmentation by a shaded mask also called as semantic segmentation
technique. A Faster R-CNN can process a classification and detection
of objects in images. R-CNN’s uses a bounding box detection thus
creating the boxes around the objects in the images. By using
Regional proposal network an attention mechanism can happen in
faster R-CNN in 2 stages. These bounding boxes and determining
regions of interest using RPN protocol for each ROI we define the
class label called ROI Pooling. A Pixel-by-Pixel fully convolutional
networks can also be categorized in this method.
2. Segmentation-Based Methods: These mainly focus on producing
more precise multi size text detection but ineffective in detecting
individual words. Like a Cascaded two-convolutional network +
TextSegNet + YOLO(WordDetNet). Also, a super pixel segmentation
with hierarchical clustering for new character candidate extraction
method also comes in segmentation technique.

3. Hybrid Methods Using Multitask Learning: It accounts both CC
based methods and texture-based methods. Below techniques can
account for.
a. Character candidate detection (cascade boosting technique), min-cost
flow network (False character candidate removal, text line extraction,
text line verification).
b. Connectionist Text Proposal Network which can be extended to
multilingual and multi-scale text detection (vertical anchor mechanism,
in-network RNN)
c. Text-attentional convolutional neural network (Text-CNN)-contrast-
enhanced MSER.
Text Recognition: Identifying and understanding the text in the
candidate region. This is mainly classified into 3 ways as below.
Character Based Methods:
1. “Strokelets” -whose essence is a set of multi-scale mid-level
primitives and can be automatically learned from bounding box labels.
It’s very good in describing the characters.
2. Character Recognition by extracting low level features and integrated
automatically via region-based feature pooling technique.

Word Based Methods-Recognizing text at word level:
1. Dense SIFT in a bag-of-key points framework, character could be
recognized robustly
2. word segmentation with recognition in the probabilistic framework,
Lexical decision and sparse beam search tools were used to improve
the recognition accuracy.
Sequence-Based Methods: text is represented in character sequences.
1. An irregular text recognition which is called RARE (Robust text
recognizer with Automatic Rectification). This model combined
spatial transformer network (STN) and a sequence recognition
network (SRN)
2. Lexicon-free photo-OCR system called recursive recurrent neural
networks with attention modeling (R2AM)
3. Convolutional recurrent neural network (CRNN)
4. Auxiliary dense character detection model and an attention model.
End-to-End Text recognition System: CNN’s drastically changed
the procedures and techniques of attempting for combined model which
is one stop place of doing Text detection and text recognition. Text boxes
are been developed as part of advances in proposed end-to-end models.
A sliding window and connected component methods are best proposed

techniques, where they proposed an unconstrained end-to-end real-time
text localization and recognition method.
Benchmark Datasets for scene text identification in Images are below:
https://airtable.com/tbl3hLG7GneipFrDD/viwGd2IomvBf5D5U
d?blocks=show
Future areas:
We have witnessed numerous approaches under different categories and
methods with the same rapid development, especially CNN+RNN
framework is quite popular. Now, let’s discuss some future exciting stuff.
1. Complex scene and large-scale dataset: you can refer COCO dataset
and try to build multi challenging cases
2. Multilingual detection and recognition: Identifying multiple
languages like scripts identification within broad backgrounds and
text sizes project is one of my aspiring areas
3. Real-time detection and recognition: Identifying the texts from
images on real time is also a greatest focusing areas.

Scene Text detection in Images-A Deep Learning Survey

More Related Content

What's hot

Similar to Scene Text detection in Images-A Deep Learning Survey

Recently uploaded

Scene Text detection in Images-A Deep Learning Survey