REAL-TIME SCENE TEXT LOCALIZATION AND RECOGNITION ppt.pptx
1. REAL-TIME SCENE TEXT
LOCALIZATION AND RECOGNITION
project done by:
208H1A0454 PICHIKA MANOHAR
208H1A0416 CHEEDEPUDI G S PRAVEEN BABU
208H1A0422 DEVARAPALLI CHANDRASEKHAR
218H5A0411 KATARI SRINIVASRAO
3. INTRODUCTION
Scene text recognition (STR) has become an increasing hot research field in computer vision recently, as manifested
by the prosperity of recent ”robust reading” competitions ICDAR [1] in every two years, along with the workshop
about Camera Based Document Analysis and Recognition (CBDAR).
With an extensive demand for the information identification, STR technology has a large-scale applications in
automatically logistics distribution, geographical positioning, license plate recognition, and driverless applications.
Text detection [2] is a common task in image analysis, while text recognition [3] is a more advanced task for it not
only should localize the text spatially which belongs to object detection, but also recognize the text, i.e., text spotting
[4].
Compared to the traditional well-formatted document text detection and recognition, natural text detection and
recognition is a challenging topic in the visual detection task due to multilingual, text sizes, font tilt, blurring,
background interference, handwriting, various angles and so on, as shown
4. Previous Work
Numerous methods which focus solely on text localization in real-world images have been published [6, 2, 7, 17].
The method of Epstein et al. in [5] converts an input image to a greyscale space and uses Canny detector [1] to find
edges.
Pairs of parallel edges are then used to calculate stroke width for each pixel and pixels with similar stroke width are
grouped together into characters. The method is sensitive to noise and blurry images because it is dependent on a
successful edge detection and it provides only single segmentation for each character which not necessarily might
be the best one for an OCR module. A similar edge based approach with different connected component algorithm
is presented in [24].
A good overview of the methods and their performance can be also found in ICDAR Robust Reading competition
results [10, 9, 20]. Only a few methods that perform both text localization and recognition have been published. The
method of Wang Figure 2.
Text localization and recognition overview.
(a) Source 2MPx image.
(b) Intensity channel extracted.
(c) ERs selected in ON by the first stage of the sequential classifier.
(d) ERs selected by the second stage of the classifier.
(e) Text lines found by region grouping.
(f) Only ERs in text lines selected and text recognized by an OCR module.
(g) Number of ERs at the end of each stage and its duration.
5. IMAGE PROCESSING
The term digital image refers to processing of a two dimensional picture by a digital computer. In a
broader context, it implies digital processing of any two dimensional data.
A digital image is an array of real or complex numbers represented by a finite number of bits. An
image given in the form of a transparency, slide, photograph or an X-ray is first digitized and stored as
a matrix of binary digits in computer memory.
This digitized image can then be processed and/or displayed on a high-resolution television monitor.
For display, the image is stored in a rapid-access buffer memory, which refreshes the monitor at a rate
of 25 frames per second to produce a visually continuous display
6. RGB IMAGE
An RGB color image is an M*N*3 array of color pixels where each color pixel is triplet
corresponding to the red, green and blue components of an RGB image, at a specific spatial location.
An RGB image may be viewed as “stack” of three gray scale images that when fed in to the red,
green and blue inputs of a color monitor Produce a color image on the screen.
Convention the three images forming an RGB color image are referred to as the red, green and blue
components images.
The data class of the components images determines their range of values. If an RGB image is of class
double the range of values is [0, 1]
A normal grey scale image has 8 bit color depth = 256 grey scales.
A true color image has 24 bit color depth = 8 x 8 x 8 bits = 256 x 256 x 256 colors = ~16 million
colors
7. BINARY IMAGE
The elementary type of image representation is called the "Binary image". It typical uses only two levels.
The two levels are referred to as black and white which are mentioned as ‘1’ and ‘0’. This kind of image representation is
considered like 1 bit per pixel image. This is suitable to an reason which considers barely single digit number to signify
every pel.
These types of images are often used to depict low level information of the picture like its outline or shape. Especially in
applications like representation in optical character (OCR) where the only outline character required realizing the letter
representing it.
The digital images are generated from the system of images of gray scale through a technique called thresholding.
The two-level thresholding simply acts as a decision factor above which it switches to numerical '1' and below which it
switches to numerical '0'.
(a) (b) (c)
8. IMAGE OF GRAY SCALE
Images of Gray scale (GS) were denoted as neutral or single-color picture. Such images possess
information of brightness merely.
Hence color data is contained by them is empty. However, the brightness is represented at different
levels. Typical 8-bit image holds a range of 0-255 brightness levels known as gray levels.
Here 0 refers to black and 1 refers to white. The 8-bit depiction is obvious with the reality to a
computer actually handles the data in 8-bit format. Below Fig.5.7 (a) and (b) are two examples of
such GSI.
(a) (b) (c)
9. COLOUR IMAGE
Color image (CI) which modeled as triple band single chromatic light information, here every band
of information will keeps in touch with various dissimilar colors.
The following figure illustrates the vector as an arrow that adds, which refers an individual smallest
unit measurement of red, green, blue principles like a color vector ---(R, G, B ).
(a) (b) (c)
10. A shading pixel vector comprises of the red, green and blue pixel esteems (R, G, B) at one given
line/section pixel facilitate (r, c)
A multispectral picture is one that catches picture information at particular frequencies over the
electromagnetic range. Multispectral pictures regularly contain data outside the typical human
perceptual range. This may incorporate infrared, bright, X-beam, acoustic or radar information.
Foundation of these kinds of picture incorporates satellite frameworks, submerged sonar frameworks
and medicinal diagnostics imaging frameworks.
11. MATLAB SOFTWARE
MATLAB® is a high a performance language for technical computing. It integrates computation,
visualization, and programming in an easy-to-use environment where problems and solutions are
expressed in familiar mathematical notation.
Typical uses include
Math and computation.
Algorithm development
Data acquisition
Modelling, simulation, and prototyping
Data analysis, exploration, and visualization
Scientific and engineering graphics
Application development, including graphical user interface building.
12. MATLAB is an interactive system whose basic data element is an array that does not require
dimensioning. This allows you to solve many technical computing problems. Especially those with
matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar
non-interactive language such as C or FORTRAN.
The name MATLAB stands for matrix laboratory. MATLAB was originally written to provide easy
access to matrix software developed by the LINPACK and EISPACK projects. Today, MATLAB
engines incorporate the LAPACK and BLAS libraries, embedding the state of the art in software for
matrix computation.
MATLAB has evolved over a period of years with input from many users. in university environments,
it is the standard instructional tool for introductory and advanced courses in mathematics engineering,
and science. In industry, MAAB is the tool of for high productivity research, development, and
analysis.
13. MATLAB DESKTOP
Mat lab Desktop is the main at lab application window. The desktop contains sub windows, the
command window, the workspace browser, the current directory window, the command history
window, and one or more figure windows, which are shown only when the user displays a graphic.
14. EXISTING METHOD
The method is able to cope with noisy data, but its generality is limited as a lexicon of words (which
contains at most 500 words in their experiments) has to be supplied for each individual image.
Methods presented in [14, 15] detect characters as Maximally Stable Extremal Regions (MSERs) [11]
and perform text recognition using the segmentation obtained by the MSER detector. An MSER is an
particular case of
Extremal Region whose size remains virtually unchanged over a range of thresholds. The methods
perform well but have problems on blurry images or characters with low contrast. According to the
description provided by the ICDAR 2011 Robust Reading competition organizers [20] the winning
method is based on MSER detection, but the method
15. PROPOSED METHOD
The proposed methodology is described in four subsections. In Section II-A, we calculate the product of
Laplacian and Sobel operations on the input image to enhance the text details and it is called the Laplacian–
Sobel product (LSP) process. The Bayesian classifier is used for classifying true text pixels based on three
probable matrices, as described in Section II-B. The three probable matrices are obtained on the basis of LSP
such that high contrast pixels in LSP are classified as text pixels (HLSP), K-means with k = 2 of maximum
gradient difference of HLSP (K-MGD-HLSP), and K-means of LSP (between maximum and minimum values of
a sliding window over HLSP.
16. Posterior probability estimation and text candidates. (a) TPM. (b) NTPM. (c) Bayesian result. (d) Text
candidates.
Boundary growing method. (a) BGM for components. (b) BGM for first line. (c) BGM for second line.
(d) BGM for third line. (e) BGM for third line and false positives. (f) BGM for false positives. (g) False
positives shown. (h) False positive elimination
17. EXPERIMENT RESULT AND ANALYSIS
An end-to-end real-time text localization and recognition method is presented in the paper. In the first stage of the classification, the
probability of each ER being a character is estimated using novel features calculated with O(1) complexity and only ERs with locally
maximal probability are selected for the second stage, where the classification is improved using more computationally expensive
features. It is demonstrated that including the novel gradient magnitude projection ERs cover 94.8% of characters. The average run
time of the method on a 800 × 600 image is 0.3s on a standard PC. however direct comparison is not possible as the method of Wang et
al. uses a different task formulation and a different evaluation protocol. Robustness of
18.
19. CONCLUSION
In this paper, we proposed a new video scene text detection method that made use of a new
enhancement method using Laplacian and Sobel operations of input images to enhance low contrast
text pixels. A Bayesian classifier was used to classify true text pixels from the enhanced text matrix
without a priori knowledge of the input image.
Three probable text matrices and three probable non text matrices were derived based on clustering
and the result of enhancement method. To traverse the multi oriented text, we proposed a boundary
growing method based on the nearest neighbor concept.
Experimentation and comparative study showed that the proposed method outperformed the existing
methods in terms of measures, especially on complex nonhorizontal data. However, there are few
problems in handling false positives.
We planned to extend this method to detection of curve-shaped text lines with good recall, precision,
F-measures, and low computational times. Notwithstanding the current limitations that we will deal
with in our future research, the contribution of this paper lies in our continued effort in detecting
multi oriented text lines in videos, which hitherto has not been well explored by others.
20. REFERENCES
[1] J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 8:679–698, 1986.
[2] X. Chen and A. L. Yuille. Detecting and reading text in natural scenes. CVPR, 2:366–373, 2004.
[3] H. Cheng, X. Jiang, Y. Sun, and J. Wang. Colour image segmentation: advances and prospects. Pattern Recognition,
34(12):2259 – 2281, 2001.
[4] N. Cristianini and J. Shawe Taylor. An introduction to Support Vector Machines. Cambridge University Press,
March 2000.
[5] B. Epshtein, E. O fek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In CVPR 2010,
pages 2963 –2970.
[6] L. Jung- Jin, P.-H. Lee, S.-W. Lee, A. Yuille, and C. Koch. Ada boost for text detection in natural scene. In ICDAR
2011, pages 429–434, 2011.
[7] R. Li enhart and A. Wernicke. Localizing and segmenting text in images and videos. Circuits and Systems for
Video Technology, 12(4):256 –268, 2002.
[8] H. Liu and X. Ding. Handwritten character recognition using gradient feature and quadratic classifier with multiple
discrimination schemes. In ICDAR 2005, pages 19 – 23 Vol. 1.