This document discusses a system for extracting text from images. It begins with an introduction describing the need for such a system. It then covers related work on text detection techniques. The proposed method involves converting images to grayscale, binarization, connected component analysis, horizontal/vertical projections, reconstruction and using OCR for recognition. Applications discussed include wearable devices, video coding, image indexing and license plate recognition. While the system is robust, OCR recognition of noisy extracted text remains a challenge.
2. INTRODUCTION
Today the most information is available either
on paper or in the form of photographs or
videos.
Large information is stored in images.
The current technology is restricted to extracting
text against clean backgrounds.
Thus, there is a need for a system to extract text
from general backgrounds.
2
3. Text Extraction and recognition in Images has
become a potential application in many fields
like Image indexing , Robotics, Intelligent
transport systems etc.
For example capturing license plate information
through a video camera and extracting license
number in traffic signals
However, variations of text due to differences in
size, style, orientation, and alignment, as well as
low image contrast and complex background
make the problem of automatic text extraction
extremely challenging.
3
4. Content-based image indexing refers to the
process of attaching labels to images based on
their content
Image content can be divided into two main
categories:
1) Perceptual content and
2)semantic content
Perceptual content includes attributes such as
Color , intensity, shape, texture, and their
temporal changes,
4
5. A number of studies on the use of relatively low-
level perceptual content for image have already
been reported
Semantic content means objects, events, and
their relations.
Studies on semantic image content in the form
of text, face, vehicle, and human action have
also attracted some recent interest.
5
6. Among them, text within an image is of particular
interest as
(i) It is very useful for describing the contents of
an image.
(ii)It can be easily extracted compared to other
semantic contents, and
(iii)It enables applications such as
keyword-based image search, and
text-based image indexing.
6
8. TEXT IN IMAGES
A variety of approaches to Text Information
Extraction (TIE) from images have been
proposed for specific applications including page
segmentation , address block location , license
plate location , and content-based image/video
indexing
In spite of such extensive studies, it is still not
easy to design a general-purpose TIE system.
8
9. This is because there are so many possible
sources of variation when extracting text from a
shaded or textured background, from low-
contrast or complex images, or from images
having variations in font size, style, color ,
orientation, and alignment.
These variations make the problem of automatic
TIE extremely difficult.
Figures 1-4 show some examples of text in
images.
9
12. Text in video images can be further classified into
caption text (Fig. 3), or scene text (Fig. 4).
Fig.3 (Images with caption text)
Text is artificially overlaid on the image.
(a)
(a) Shows captions overlaid directly on the
background
12
13. Fig. 4. Scene text images: Images with
variations in skew, perspective, blur, illumination,
and alignment.
13
14. TERMS
Before we attempt to explain the various
techniques used in TIE, it is important to define
the commonly used terms and summarize the
characteristics of text that can be used for TIE
algorithms.
Text in images can exhibit many variations with
respect to the following properties.
14
15. 1. Geometry
Size: Although the text size can vary a lot,
assumptions can be made depending on the
application domain.
Alignment: The characters in the caption text
appear in clusters and usually lie horizontally.
Although sometimes they can appear as
non- planar texts as a result of special effects.
This does not apply to scene text, which can have
various perspective distortions. Scene text can be
aligned in any direction and can have geometric
distortions (Fig. 4).
15
16. Inter-character distance: characters in a text
line have a uniform distance between them.
2. Colour
The characters in a text line tend to have the
same or similar colours.
This property makes it possible to use a
connected component-based approach for text
detection.
Most of the research reported till date has
concentrated on finding ‘text strings of a single
colour (monochrome)’.
16
17. However, video images and other complex
colour documents can contain ‘text strings with
more than two colours (polychrome)’ for effective
visualization, i.e., different colours within one
word.
3. Motion
The same characters usually exist in
consecutive frames in a video with or without
movement.
This property is used in text tracking and
enhancement.
Caption text usually moves in a uniform way:
horizontally or vertically
17
18. Scene text can have arbitrary motion due to
camera or object movement.
4. Edge
Most caption and scene text are designed to be
easily read, thereby resulting in strong edges at
the boundaries of text and background
5. Compression
Many digital images are recorded, transferred,
and processed in a compressed format.
Thus, a faster TIE system can be achieved if
one can extract text without decompression
18
19. The role of text detection is to find the image
regions containing only text that can be directly
highlighted to the user or fed into an optical
character reader module for recognition.
In this seminar a new system is discussing
which extracts text in images.
The system takes colored images as input.
It detects text on the basis of certain text
features:
The image is then cleaned up so that the text
stands out.
19
20. RELATED WORK
Various methods have been proposed in the
past for detection and localization of text in
images and videos.
These approaches take into consideration
different properties related to text in an image
such as color, intensity, connected-components,
edges etc.
20
21. These properties are used to distinguish text
regions from their background and/or other
regions within the image.
In the algorithm based on color clustering, the
input image is first pre-processed to remove any
noise if present.
Then the image is grouped into different color
layers and a gray component.
This approach utilizes the fact that usually the
color data in text characters is different from the
color data in the background.
21
22. The potential text regions are localized using
connected component based heuristics from
these layers.
The experiments conducted show that the
algorithm is robust in locating mostly Chinese
and English characters in images.
The text detection algorithm is also based on
color continuity.
In addition it also uses multi-resolution wavelet
transforms and combines low as well as high
level image features for text region extraction.
22
23. Texture based segmentation is used to
distinguish text from its background.
Further a bottom-up „chip generation‟ process
is carried out which uses the spatial cohesion
property of text characters.
The chips are collections of pixels in the image
consisting of potential text strokes and edges.
The results show that the algorithm is robust in
most cases, except for very small text characters
that are not properly detected.
Also in the case of low contrast in the image,
misclassifications occur in the texture
segmentation.
23
24. METHOD
Various steps involved in TIE process are
(1) Converting colored image to grayscale
(2) Binarization
(3) Connected components
(4) Horizontal and Vertical Projections
(5) Reconstruction.
(6) Recognition using OCR
24
25. Converting colored image to grayscale
A digital color image is a color image that
includes color information for each pixel.
There are various color models which are used
to represent a color image.
These are RGB color model, in which red, green
and blue light is added together in various ways
to reproduce a broad array of colors.
25
26. The other models are CMY color model which
uses cyan, magenta and yellow light and HSI
model which uses hue, saturation and intensity
variations.
Grayscale images have range of shades of gray
without apparent color.
These are used as less information needs to be
provided for each pixel.
In an 8 bit image, 256 shades are possible.
The darkest possible shade black is
represented as 00000000 and lightest possible
shade white is represented as 11111111.
26
27. Binarization
A Binary image is a digital image that can have only
two possible values for each pixel.
Each pixel is stored as single bit 0 or 1.
The name black and white is often used for this
concept.
To form a binary image we select a threshold
intensity value.
27
28. All the pixels having intensity greater than the
threshold value are changes to 0 (black) and the
pixels with intensity value less than the threshold
intensity value are changed to 1 (white).
Thus the image is changed to a binary image .
Connected components
For two pixels to be connected they must be
neighbors and their gray levels must specify a
certain criterion of similarity.
28
29. For example, in a binary image with values 0
and 1, two pixels may be neighbors but they are
said to be connected only if they have same
values.
A pixel p with coordinates (x, y) has four
horizontal and vertical neighbors known as 4-
neighbors of p, given as: (x, y+1), (x, y-1), (x+1,
y), (x-1, y) and four diagonal neighbors given as:
(x+1, y-1), (x-1, y-1), (x+1, y+1), (x-1, y+1).
Together these are known as 8-neighbors of p.
29
30. If S represents subset of pixels in an image, two
pixels p and q are said to be connected if there
exists a path between them consisting entirely of
pixels in S.
For any pixel p in S, the set of pixels that are
connected to it in S is called a connected
component of S.
30
31. Horizontal and Vertical Projections
The method is performed on binary images.
It starts scanning from left side of every line and
records a change in case of facing the pixel
change from zero to one and again to zero.
Counting the change does not depend on
number of pixels in this method.
Robustness in noisy condition is an advantage
of this method.
After finding the connected components, we
check the transitions in the values of pixels
horizontally.
31
32. Transitions can be either zero (black) to one (white)
or one (white) to zero (black).
The text regions will have larger number of
transitions from black to white or vice versa,
whereas the background region will have lesser
number of transitions.
If the allocated amount of changes for each row is
between two thresholds (low and high thresholds),
the row potentially would be considered as text area
and the up and down of this row would be specified.
32
33. Next, we search vertically for finding the exact
location of the text and ignoring these rows as a
text.
For finding the exact location of the text, we use
some heuristics.
These heuristics include height and length of the
text and the ratio of height to length and enough
number of pixels in this area.
33
34. Reconstruction
After the extraction of text regions from images,
the text regions become a bit distorted and
difficult to read, thus we recover these
components using the original image.
The distorted and original images are compared
with each other and the pixels which are erased
or disfigured are recovered.
34
36. Recognition using OCR
The OCR software converts each character into
ASCII codes.
The extracted text is stored as ASCII codes in
the computer memory.
OCR systems have been available for a number
of years and the current commercial systems
can produce an extremely high recognition rate
for machine-printed documents on a simple
background.
36
37. Applications
There are numerous applications of a text
information extraction system, including
document analysis, vehicle license plate
extraction, technical paper analysis, and object-
oriented data compression.
In the following, we briefly describe some of
these applications
37
38. .
Wearable or portable computers: with the
rapid development of computer hardware
technology, wearable computers are now a
reality.
A TIE system involving a hand-held device and
camera was presented as an application of a
wearable vision system.
Translation camera can detect text in a scene
image and translate Japanese text into English
after performing character recognition.
38
39. Content-based video coding or document
coding: The MPEG-4 standard supports object-
based encoding.
When text regions are segmented from other
regions in an image, this can provide higher
compression rates and better image quality.
As a result, they can achieve a higher quality
rendering of documents containing text, pictures,
and graphics.
Text-based image indexing: This involves
automatic text-based video structuring methods
using caption data .
39
40. License/container plate recognition: There
has already been a lot of work done on vehicle
license plate and container plate recognition.
Although container and vehicle license plates
share many characteristics with scene text,
many assumptions have been made regarding
the image acquisition process (camera and
vehicle position and direction, illumination,
character types, and color) and geometric
attributes of the text.
40
41. Texts in WWW images: The extraction of text
from WWW images can provide relevant
information on the Internet.
Video content analysis: Extracted text regions
or the output of character recognition can be
useful in genre recognition .
The size, position, frequency, text alignment,
and OCR-ed results can all be used for this.
Industrial automation: Part identification can
be accomplished by using the text information
on each part.
41
42. CONCLUSION
Current technologies do not work well for
documents with text printed against shaded or
textured backgrounds or those with non-
structured layout.
In contrast, our technique works well for normal
documents as well as with the documents
described above situations.
The system is stable and robust. All the system
parameters remain the same throughout all the
experiments.
42
43. However, it is not easy to use commercial OCR
software for recognizing text extracted from
images or video frames.
New OCR systems need to be developed to
handle large amount of noise and distortion in
TIE applications.
43