SCENE TEXT DETECTION AND
RECOGNITION
Presented by: J. Hemanth Kumar
B. Kishore Kumar
ABSTRACT
Text characters in natural scenes and surroundings provide us with valuable
information about the place and even provide us with some
legal/important information. Hence it’s very important for us to detect such
text and recognise them which helps a lot. But , it’s not really easy to
recognize those text information because of the diverse backgrounds and
fonts used for the text. In this paper, a method is proposed to extract the
text information from the surroundings. First, a character descriptor is
designed with existing standard detectors and descriptors. Then, character
structure is modelled at each character class by designing stroke
configuration maps.
INTRODUCTION
In natural scenes , the text part is generally found on nearby sign boards
and other objects. The extraction of such text is difficult because of noisy
backgrounds and diverse fonts and text sizes. But many applications have
been proven to be efficient in extraction of text from surroundings. For this
, the method of text extraction is divided into two processes;
1. Text detection
2. Text recognition
TEXT DETECTION
 It is the process of localizing various regions of the scene which contain text.
 It helps in removing most of the non-text regions which act as noise during the
extraction of required text.
contd.
TEXT RECOGNITION
 The process of converting pixel-based text (image text) to readable code.
 The main purpose of it is to distinguish between different text types and
properly compose them.
The main focus of this paper is on text recognition. It involves 62 different
identity categories of text characters,
 10 digits (0-9)
 26 upper case alphabets(A-Z)
 26 lower case alphabets(a-z)
The text regions are generally distinguished with the aid of color uniformity and
alignment of text in a line. Two different schemes are designed for achieving
text recognition.
contd.
SCHEME-1:
• Training a character recognizer to predict the category of a
character in an image patch.
SCHEME-2:
• Training a binary character classifier for each character class
to predict the existence of this category in an image patch.
TEXT UNDERSTANDING: To acquire text information from natural scene to
understand surrounding environments and objects.
TEXT RETRIEVAL: To verify whether piece of text exists in the natural scene.
 This understanding is used for mobile applications. Generally, a binary classifier is
generated by assigning a stroke configuration for each character by the aid of its
boundary and skeleton.
contd.
• By the character recognizer, text
understanding is able to provide
useful surrounding text information
for mobile applications.
• By the character classifier of each
character class, text retrieval is able
to help search for expect objects
from environment
Fig. presents a flowchart
of scene text extraction method
This method is made different from other existing methods by adding stroke
configuration to model text character structure.
LAYOUT-BASED SCENE TEXT DETECTION
• The text detection is done by taking the help of color decomposition and horizontal
alignment of the text.
A. LAYOUT ANALYSIS OF COLOR DECOMPOSITION
From the scene image, similar colored pixels are grouped together into same layers
to separate the text from the background. A boundary clustering algorithm is used to
decompose the image of the scene into different layers based on color. The boundary
of the character acts as the border between the characters or the text and the
surrounding surfaces.
contd.
This image describes color decomposition of scene image by boundary clustering
algorithm. The top row presents original scene image and the edge image obtained from
canny edge detector. The other rows present color layers obtained from bigram color
uniformity. It shows that the text information in signage board is extracted from complex
background in a color layer.
B. LAYOUT ANALYSIS OF HORIZONTAL ALIGNMENT
 For each layer obtained, the boundaries are analyzed according to their geometry
to estimate that a particular character is present. In most of the cases, the text
present on the sign boards or any other regions will be in similar size and horizontal
alignment. So an adjacent character grouping algorithm can help in identifying and
grouping them together.
 A colored bounding box is assigned to each detected text string. Similar adjacent
bounding boxes are searched for and if found, they are grouped together into a
single box. For non-horizontally oriented strings, characters are searched for, only in
a reasonable range. It is generally ±Π 6 degrees compared to the horizontal line. To
work practically good, some details and parameters of the text detection are slightly
adjusted.
contd.
These images describe the adjacent character grouping process. The red box denotes
bounding box of a boundary in a color layer. The green regions in the bottom left two
figures represent two adjacent groups of consecutive neighboring bounding boxes in similar
size and horizontal alignment. The blue regions in the bottom-right figure represent the text
string fragments, obtained by merging the overlapping adjacent groups.
STRUCTURE-BASED SCENE TEXT RECOGNITION
After text regions are detected, text information is obtained by character recognition.
We use 62 character classes in total. We have two recognition schemes for character
recognition. Text understanding is a multi-class classification problem where 62 classes of
characters are classified. Text retrieval is a binary classification problem where it is
required to estimate if a patch contains a character class or not.
Here, text recognition is done using;
A. Character Descriptor
B. Character Stroke Configuration
contd.
A. CHARACTER DESCRIPTOR
 It uses four keypoint detectors which are Harris detector (HD), MSER detector
(MD), Dense detector (DD), Random detector.
• The Harris detector identifies keypoints from corners and junctions.
• The MSER detector identifies key points from the stroke components.
• The Dense detector is used to uniformly extract the keypoints.
• The random detector is used to extract the preset number of keypoints
randomly.
For all the extracted keypoints, the HOG feature is applied and calculated as feature
vector x in the feature space.
contd.
Feature
 Feature descriptors other than HOG can also be used but it is found to give better
results when compared.
 For the quantization process, Bag-of-Words model (BOW) and the Gaussian mixture
model (GMM) are used for aggregating the extracted features. BOW is used to
keypoints from all the four detectors while GMM is applied to those only from DD and
RD.
 The character patch from both models is mapped into characteristic histogram as
feature representation. From the cascading of all the feature representations, the
character descriptor with a good power of discrimination and recognition is obtained.
contd.
Character
sample
Harris + HOG
MSER + HOG
Dense +
HOG
Random +
HOG
Feature
descriptors
Feature
descriptors
BOW
GMM
Histogram of
visual word
frequency
Histogram of
binary
comparison
Character
descriptor
• Flowchart of the proposed character descriptor, which combines four keypoint
detectors, and HOG features are extracted at keypoints. Then BOW and GMM are
employed to respectively obtain visual word histogram and binary comparison
histogram.
contd.
B. CHARACTER STROKE CONFIGURATION
 Character structure consists of multiple oriented strokes, which serve as basic
elements of a text character.
 From the pixel-level perspective, a stroke of printed text is defined as a region
bounded by two parallel boundary segments. Their orientation is regarded as
stroke orientation and the distance between them is regarded as stroke width.
 In order to locate stroke accurately, stroke is redefined in our algorithm as
skeleton points within character sections with consistent width and orientation.
 A character can be represented as a set of connected strokes with specific
configuration which includes the number, locations, lengths and orientations of
the strokes. The structure map of strokes is defined as stroke configuration.
contd.
 .
 In a character class, although the character instances appear in different fonts,
styles, and sizes, the stroke configurations is always consistent.
 The configuration of the stroke is estimated by synthesized characters generated
from computer software rather than scene characters that are cropped from
scene images, as synthesized character can provide accurate boundary and
skeleton that are related to character structure.
 The Synthetic Font Training Dataset proposed is used here to obtain stroke
configuration. This dataset contains about 67400 character patches of synthetic
English letters and digits in various fonts and styles, and 20000 patches are
selected to generate character patches. It covers all the 62 classes of characters.
 A method for scene text recognition from detected text regions for mobile
applications is proposed. It detects text regions from scenes or images and then
recognizes the text information contained in them.
 The proposed character descriptor is effective to extract representative and
discriminative text features for both recognition schemes.
 To model text character structure for text retrieval scheme, a novel feature
representation, stroke configuration map has been designed based on boundary
and skeleton.
CONCLUSION
Thank you💐☺

Text detection and recognition from natural scenes

  • 1.
    SCENE TEXT DETECTIONAND RECOGNITION Presented by: J. Hemanth Kumar B. Kishore Kumar
  • 2.
    ABSTRACT Text characters innatural scenes and surroundings provide us with valuable information about the place and even provide us with some legal/important information. Hence it’s very important for us to detect such text and recognise them which helps a lot. But , it’s not really easy to recognize those text information because of the diverse backgrounds and fonts used for the text. In this paper, a method is proposed to extract the text information from the surroundings. First, a character descriptor is designed with existing standard detectors and descriptors. Then, character structure is modelled at each character class by designing stroke configuration maps.
  • 3.
    INTRODUCTION In natural scenes, the text part is generally found on nearby sign boards and other objects. The extraction of such text is difficult because of noisy backgrounds and diverse fonts and text sizes. But many applications have been proven to be efficient in extraction of text from surroundings. For this , the method of text extraction is divided into two processes; 1. Text detection 2. Text recognition TEXT DETECTION  It is the process of localizing various regions of the scene which contain text.  It helps in removing most of the non-text regions which act as noise during the extraction of required text. contd.
  • 4.
    TEXT RECOGNITION  Theprocess of converting pixel-based text (image text) to readable code.  The main purpose of it is to distinguish between different text types and properly compose them. The main focus of this paper is on text recognition. It involves 62 different identity categories of text characters,  10 digits (0-9)  26 upper case alphabets(A-Z)  26 lower case alphabets(a-z) The text regions are generally distinguished with the aid of color uniformity and alignment of text in a line. Two different schemes are designed for achieving text recognition. contd.
  • 5.
    SCHEME-1: • Training acharacter recognizer to predict the category of a character in an image patch. SCHEME-2: • Training a binary character classifier for each character class to predict the existence of this category in an image patch. TEXT UNDERSTANDING: To acquire text information from natural scene to understand surrounding environments and objects. TEXT RETRIEVAL: To verify whether piece of text exists in the natural scene.  This understanding is used for mobile applications. Generally, a binary classifier is generated by assigning a stroke configuration for each character by the aid of its boundary and skeleton. contd.
  • 6.
    • By thecharacter recognizer, text understanding is able to provide useful surrounding text information for mobile applications. • By the character classifier of each character class, text retrieval is able to help search for expect objects from environment Fig. presents a flowchart of scene text extraction method This method is made different from other existing methods by adding stroke configuration to model text character structure.
  • 7.
    LAYOUT-BASED SCENE TEXTDETECTION • The text detection is done by taking the help of color decomposition and horizontal alignment of the text. A. LAYOUT ANALYSIS OF COLOR DECOMPOSITION From the scene image, similar colored pixels are grouped together into same layers to separate the text from the background. A boundary clustering algorithm is used to decompose the image of the scene into different layers based on color. The boundary of the character acts as the border between the characters or the text and the surrounding surfaces. contd.
  • 8.
    This image describescolor decomposition of scene image by boundary clustering algorithm. The top row presents original scene image and the edge image obtained from canny edge detector. The other rows present color layers obtained from bigram color uniformity. It shows that the text information in signage board is extracted from complex background in a color layer.
  • 9.
    B. LAYOUT ANALYSISOF HORIZONTAL ALIGNMENT  For each layer obtained, the boundaries are analyzed according to their geometry to estimate that a particular character is present. In most of the cases, the text present on the sign boards or any other regions will be in similar size and horizontal alignment. So an adjacent character grouping algorithm can help in identifying and grouping them together.  A colored bounding box is assigned to each detected text string. Similar adjacent bounding boxes are searched for and if found, they are grouped together into a single box. For non-horizontally oriented strings, characters are searched for, only in a reasonable range. It is generally ±Π 6 degrees compared to the horizontal line. To work practically good, some details and parameters of the text detection are slightly adjusted. contd.
  • 10.
    These images describethe adjacent character grouping process. The red box denotes bounding box of a boundary in a color layer. The green regions in the bottom left two figures represent two adjacent groups of consecutive neighboring bounding boxes in similar size and horizontal alignment. The blue regions in the bottom-right figure represent the text string fragments, obtained by merging the overlapping adjacent groups.
  • 11.
    STRUCTURE-BASED SCENE TEXTRECOGNITION After text regions are detected, text information is obtained by character recognition. We use 62 character classes in total. We have two recognition schemes for character recognition. Text understanding is a multi-class classification problem where 62 classes of characters are classified. Text retrieval is a binary classification problem where it is required to estimate if a patch contains a character class or not. Here, text recognition is done using; A. Character Descriptor B. Character Stroke Configuration contd.
  • 12.
    A. CHARACTER DESCRIPTOR It uses four keypoint detectors which are Harris detector (HD), MSER detector (MD), Dense detector (DD), Random detector. • The Harris detector identifies keypoints from corners and junctions. • The MSER detector identifies key points from the stroke components. • The Dense detector is used to uniformly extract the keypoints. • The random detector is used to extract the preset number of keypoints randomly. For all the extracted keypoints, the HOG feature is applied and calculated as feature vector x in the feature space. contd.
  • 13.
    Feature  Feature descriptorsother than HOG can also be used but it is found to give better results when compared.  For the quantization process, Bag-of-Words model (BOW) and the Gaussian mixture model (GMM) are used for aggregating the extracted features. BOW is used to keypoints from all the four detectors while GMM is applied to those only from DD and RD.  The character patch from both models is mapped into characteristic histogram as feature representation. From the cascading of all the feature representations, the character descriptor with a good power of discrimination and recognition is obtained. contd.
  • 14.
    Character sample Harris + HOG MSER+ HOG Dense + HOG Random + HOG Feature descriptors Feature descriptors BOW GMM Histogram of visual word frequency Histogram of binary comparison Character descriptor • Flowchart of the proposed character descriptor, which combines four keypoint detectors, and HOG features are extracted at keypoints. Then BOW and GMM are employed to respectively obtain visual word histogram and binary comparison histogram. contd.
  • 15.
    B. CHARACTER STROKECONFIGURATION  Character structure consists of multiple oriented strokes, which serve as basic elements of a text character.  From the pixel-level perspective, a stroke of printed text is defined as a region bounded by two parallel boundary segments. Their orientation is regarded as stroke orientation and the distance between them is regarded as stroke width.  In order to locate stroke accurately, stroke is redefined in our algorithm as skeleton points within character sections with consistent width and orientation.  A character can be represented as a set of connected strokes with specific configuration which includes the number, locations, lengths and orientations of the strokes. The structure map of strokes is defined as stroke configuration. contd.
  • 16.
     .  Ina character class, although the character instances appear in different fonts, styles, and sizes, the stroke configurations is always consistent.  The configuration of the stroke is estimated by synthesized characters generated from computer software rather than scene characters that are cropped from scene images, as synthesized character can provide accurate boundary and skeleton that are related to character structure.  The Synthetic Font Training Dataset proposed is used here to obtain stroke configuration. This dataset contains about 67400 character patches of synthetic English letters and digits in various fonts and styles, and 20000 patches are selected to generate character patches. It covers all the 62 classes of characters.
  • 17.
     A methodfor scene text recognition from detected text regions for mobile applications is proposed. It detects text regions from scenes or images and then recognizes the text information contained in them.  The proposed character descriptor is effective to extract representative and discriminative text features for both recognition schemes.  To model text character structure for text retrieval scheme, a novel feature representation, stroke configuration map has been designed based on boundary and skeleton. CONCLUSION
  • 18.