The document summarizes recent progress in techniques for extracting text objects from video documents. It discusses new and improved approaches proposed between 2003 and 2008 that have contributed to advances in the field. These include approaches that model text as hierarchical structures, use Gaussian mixture modeling of neighboring characters, exploit vertical edge features, and detect character strokes. The document also notes how former approaches have been enhanced by integrating more text characteristics and overcoming limitations.
Recent Progress in Video Text Extraction Techniques
1. Extraction of Text Objects in
Video Documents: Recent
Progress
Jing Zhang and Rangachar Kasturi
University of South Florida
Department of Computer Science and Engineering
2. Acknowledgements
The work presented here is that of numerous
researchers from around the world. We thank
them for their contributions towards the
advances in video document processing.
In particular we would like to thank the
authors of papers whose work is cited in this
presentation and in our paper.
4. Introduction
Since 1990s, with rapid growth of available
multimedia documents and increasing demand
for information indexing and retrieval, much
effort has been done on text extraction in
images and videos.
5. Introduction
• Text Extraction in Video
– Text consists of words that are well-defined models of concepts
for humans communication.
– Text objects embedded in video contain much semantic
information related to the multimedia content.
– Text extraction techniques play an important role in content-
based multimedia information indexing and retrieval.
6. Introduction
Extracting text in video presents unique challenge over
that in scanned documents:
Cons: Pros:
Low contrast Temporal Redundancy (text in video
usually persists for at least several
Low resolution seconds, to give human viewers the
necessary time to read it)
Color bleeding
Unconstrained backgrounds
Unknown text color, size,
position, orientation, and layout
7. Introduction
• Caption Text which is artificially superimposed on the video at the
time of editing.
• Scene Text which naturally occurs in the field of view of the camera
during video capture.
• The extraction of scene text is a much tougher task due to varying
lighting, complex movement and transformation.
Scene Text
Caption Text
8. Introduction
Five stages of text extraction in video:
1) Text Detection: finding regions in a video frame that contain text;
2) Text Localization: grouping text regions into text instances and generating a
set of tight bounding boxes around all text instances;
3) Text Tracking: following a text event as it moves or changes over time and
determining the temporal and spatial locations and extents of text events;
4) Text Binarization: binarizing the text bounded by text regions and marking
text as one binary level and background as the other;
5) Text Recognition: performing OCR on the binarized text image.
9. Introduction
Video Clips
Text Detection
Text Localization Text Tracking
Text Binarization
Text Recognition
Text Objects
10. Introduction
The goal of Text detection, text localization and text
tracking is to generate accurate bounding boxes of all
text objects in video frames and provide a unique
identity to each text event which is composed of the
same text object appearing in a sequence of
consecutive frames.
11. Introduction
This presentation mainly concentrates on the
approaches proposed for text extraction in
videos in the most recent five years, to
summarize and discuss the recent progress in
this research area.
12. Introduction
Region Based Approach utilizes the different region properties
between text and background to extract text objects.
– Bottom-up: separating the image into small regions and then grouping
character regions into text regions.
– Color features, edge features, and connected component methods
Texture Based Approach uses distinct texture properties of text to
extract text objects from background.
– Top-down: extracting texture features of the image and then locating
text regions.
– Spatial variance, Fourier transform, Wavelet transform, and machine
learning methods.
14. Recent Progress
Text extraction in video documents, as an important research
branch of content-based information retrieval and indexing,
continues to be a topic of much interest to researchers.
A large number of newly proposed approaches in the
literature have contributed to an impressive progress of text
extraction techniques.
15. Recent Progress
Prior to 2003 Now
• Only a few text extraction • Temporal redundancy of video is
approaches considered the utilized by almost all recent text
temporal nature of video. extraction approaches.
• Very little work was done on • Scene text extraction is being
scene text. extensively studied.
• Objective performance • A comprehensive performance
evaluation metrics were evaluation framework has been
scarce. developed.
16. Recent Progress
The progress of text extraction in videos can
be categorized into three types:
• New and improved text extraction approaches
• Text extraction techniques adopted from other
research fields
Text extraction approaches proposed for specific
text types and specific genre of video documents
17. Recent Progress
• New and improved text extraction approaches:
The new and improved approaches play an important role in the
recent progress of text extraction technique for videos. These
new approaches introduce not only new algorithms but also new
understanding of the problem.
18. Recent Progress
-New and improved text extraction
approachesNguyen T. and A. Boucher, A novel approach for text detection in images
H. Tran, A lux, H.L.
using structural features, The 3rd International Conference on Advances in Pattern Recognition,
LNCS Vol. 3686, pp. 627-635, 2005
A text string is modeled as its center line
and the skeletons of characters by ridges
at different hierarchical scales.
First line: Images with rectangle showing the text region. Second line: Zoom on text
regions. Third line: ridges detected at two scales (red in high level, blue in small level) in
the text region that represent local structures of text lines whatever the type of text.
19. H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features,
The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005
• Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A
text string is modelled by a ridge at a coarse scale representing its center line and numerous short
ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy
geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge.
In this way, we obtain a hierarchical description of text strings, which can provide direct input to an
OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it
works with a wide variety in size of characters and does not depend on orientation of text string. The
experimental results show a good detection.
X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual
Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.
Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring
characters to extract multilingual texts in images. The case of three neighboring characters is represented as
the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’
defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each
connected component in the binary image as character or non-character according to its neighbors, where a
mathematical morphology based method is introduced to detect and connect the separated parts of each
character, and a Voronoi partition based method is advised to establish the neighborhoods of connected
components. We further present a discriminative training algorithm based on the maximum–minimum similarity
(MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in
Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS
algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In
the experiments, we also show that the MMS provides significant improvement of overall performance,
compared with influential training criterions of the maximum likelihood (ML) and the maximum classification
error (MCE).
20. Recent Progress
-New and improved text extraction
approaches
X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters
for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.
The GMM based algorithm treats the text features of three neighboring
characters as three mixed Gaussian models to extract text objects.
(a) (b) (c)
An example of neighborhood computation. In each figure, the image (a) shows a binary image, where black dots denote
centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a
neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c)
The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.
21. Recent Progress
-New and improved text extraction
approaches
P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International
Conference Signal Processing, IEEE, Vol. 4, 2006
Only the vertical edge features are utilized to find text regions based on the
observation that vertical edges can enhance the characteristic of text and eliminate
most irrelevant information.
(a) (b) (c) (d)
(a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result
22. Recent Progress
-New and improved text extraction
approaches
K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-
Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis
and Recognition, IEEE, pp. 33-37, 2007
Character-stroke is used to extract text objects by utilizing three line scans (a
set of pixels along the horizontal line of an intensity image) to detect image
intensity changes.
(a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, ∆ is the stroke width, (c) threshold Ig ≤
0.35, (d) The thresholded image after morphological operations and connected component analysis.
23. P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International
Conference Signal Processing, IEEE, Vol. 4, 2006
• Abstract: Text detection plays a crucial role in various applications. In this paper we present an
edge based text detection technique in the complex images for multi purpose application. The
technique applied vertical Sobel edge detection and a newly proposed morphological technique
that used to connect the edges to form the candidate regions. The technique has special
advantage, by providing a distinguishable texture on the text area over the others. The connected
components are then extracted using a purposed segmentation algorithm. Later all the candidate
regions are verified to specify the text region. The propose techniques has been tested with
different types of image acquired from different input sources and environment. The experimental
result shows highly successful rate.
K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-
Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis
and Recognition, IEEE, pp. 33-37, 2007
Abstract: In this paper, we present a new approach for analysis of images for text-
localization and extraction. Our approach puts very few constraints on the font, size and
color of text and is capable of handling both scene text and articial text well. In this paper,
we exploit two well-known features of text: approximately constant stroke width and local
contrast, and develop a fast, simple, and effective algorithm to detect character strokes. We
also show how these can be used for accurate extraction and motivate some advantages
of using this approach for text localization over other colorspace segmentation based
approaches. We analyze the performance of our stroke detection algorithm on images
collected for the robust-reading competitions at ICDAR 2003
24. Recent Progress
-New and improved text extraction
approaches
D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video,
International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003
8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that
best correspond to the properties of text are determined empirically. The sum of the absolute
values of these coefficients is computed and regarded as a measure of the “text energy” of that
block. The motion vectors of MPEG-compressed videos are used for text objects tracking.
(a) Original image
(b) Text energy (c) Tracking result
25. D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events
from digital video, International Journal on Document Analysis and Recognition, Vol.
5, pp. 138-157, 2003
• Abstract. The popularity of digital video is increasing rapidly. To help users navigate
libraries of video, algorithms that automatically index video based on content are
needed. One approach is to extract text appearing in video, which often reflects a
scene’s semantic content. This is a difficult problem due to the unconstrained nature
of general-purpose video. Text can have arbitrary color, size, and orientation.
Backgrounds may be complex and changing. Most work so far has made restrictive
assumptions about the nature of text occurring in video. Such work is therefore not
directly applicable to unconstrained, general-purpose video. In addition, most work so
far has focused only on detecting the spatial extent of text in individual video frames.
However, text occurring in video usually persists for several seconds. This constitutes
a text event that should be entered only once in the video index. Therefore it is also
necessary to determine the temporal extent of text events. This is a non-trivial
problem because text may move, rotate, grow, shrink, or otherwise change over time.
Such text effects are common in television programs and commercials but so far
have received little attention in the literature. This paper discusses detecting,
binarizing, and tracking caption text in general-purpose MPEG-1 video. Solutions are
proposed for each of these problems and compared with existing work found in the
literature.
26. Recent Progress
-New and improved text extraction
approaches
In addition, many former text extraction approaches have been
enhanced and extended recently.
By extracting and integrating more comprehensive
characteristics of text objects, these new approaches can provide
more robust performance than previous approaches.
Besides new approaches, many improved approaches are
presented to overcome the limitations of former approaches.
27. Recent Progress
-New and improved text extraction
approaches Caption localization in video sequences by fusion of multiple detectors,
S Lefevre, N Vincent,
Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE,
pp. 106-110, 2005
Color-related detector, wavelet-based texture detector, edge-based contour detector
and temporal invariance principle are adopted to detect candidate caption regions.
Then a parallel fusion strategy
C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.
Euclidean distance based and Cosine similarity
based clustering methods are applied on GRB
color space complementarily to partition the
original image into three clusters: textual
foreground, textual background, and noise.
Overview of the proposed algorithm combining color and spatial information.
28. S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors,
Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp.
106-110, 2005
• Abstract: In this article, we focus on the problem of caption detection in video sequences.
Contrary to most of existing approaches based on a single detector followed by an ad hoc
and costly post-processing, we have decided to consider several detectors and to merge
their results in order to combine advantages of each one. First we made a study of
captions in video sequences to determine how they are represented in images and to
identify their main features (color constancy and background contrast, edge density and
regularity, temporal persistence). Based on these features, we then select or define the
appropriate detectors and we compare several fusion strategies which can be involved.
The logical process we have followed and the satisfying results we have obtained let us
validate our contribution.
C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.
Abstract: Natural scene images brought new challenges for a few years and one of them is text
understanding over images or videos. Text extraction which consists to segment textual foreground from the
background succeeds using color information. Faced to the large diversity of text information in daily life and
artistic ways of display, we are convinced that this only information is no more enough and we present a color
segmentation algorithm using spatial information. Moreover, a new method is proposed in this paper to
handle uneven lighting, blur and complex backgrounds which are inherent degradations to natural scene
images. To merge text pixels together, complementary clustering distances are used to support
simultaneously
clear and well-contrasted images with complex and degraded images. Tests on a public database show
finally efficiency of the whole proposed method.
29. Recent Progress
-New and improved text extraction
approaches
M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection,
localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology,
Vol. 15, pp. 243-255, 2005.
The sequential multi-resolution
paradigm can remove the
redundancy of parallel multi-
resolution paradigm. No text
edges can appear several
times at different resolution
levels.
Sequential multiresolution paradigm
30. M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection,
localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology,
Vol. 15, pp. 243-255, 2005.
• Abstract—Text in video is a very compact and accurate clue for video indexing and
summarization. Most video text detection and extraction methods hold assumptions
on text color, background contrast, and font style. Moreover, few methods can handle
multilingual text well since different languages may have quite different appearances.
This paper performs a detailed analysis of multilingual text characteristics, including
English and Chinese. Based on the analysis, we propose a comprehensive, efficient
video text detection, localization, and extraction method, which emphasizes the
multilingual capability over the whole processing. The proposed method is also robust
to various background complexities and text appearances. The text detection is
carried out by edge detection, local thresholding, and hysteresis edge recovery. The
coarse-to-fine localization scheme is then performed to identify text regions
accurately. The text extraction consists of adaptive thresholding, dam point labeling,
and inward filling. Experimental results on a large number of video images and
comparisons with other methods are reported in detail.
31. Recent Progress
-New and improved text extraction
approaches
J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering
Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp.
283-290, 2006.
Fuzzy C-means based individual
frame clustering is replaced by the
fuzzy clustering ensemble (FCE)
based multi-frame clustering to utilize
temporal redundancy.
Fuzzy cluster ensemble for text detection in videos
32. J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering
Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp.
283-290, 2006.
• Abstract: Detection and localization of text in videos is an important task
towards enabling automatic content-based retrieval of digital video
databases. However, since text is often displayed against a complex
background, its detection is a challenging problem. In this paper, a novel
approach based on fuzzy cluster ensemble techniques to solve this problem
is presented. The advantage of this approach is that the fuzzy clustering
ensemble allows the incremental inclusion of temporal information regarding
the appearance of static text in videos. Comparative experimental results for
a test set of 10.92 minutes of video sequences have shown the very good
performance of the proposed approach with an overall recall of 92.04% and
a precision of 96.71%.
33. Recent Progress
2. Text extraction techniques adopted from other research
fields:
Another encouraging progress is that more and more techniques that have been
successfully applied in other research fields have been adapted for text extraction.
Because these approaches were not initially designed for the text extraction task,
many unique characteristics of their original research fields are embedded in them
intrinsically.
Therefore, by using these approaches from other fields, we can view the text
extraction problem from the viewpoints of other related research fields and benefit
from them. It is a promising way to find good solutions for text extraction task.
34. Recent Progress
-Text extraction techniques adopted from
other research fields
K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support
vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern
Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.
The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to
detect and track faces in a video stream.
Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for
CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked
as white and gray level (white: text region, gray: non-text region), and (d) final detection result
35. Recent Progress
-Text extraction techniques adopted from
other research fields
H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection,
Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp.
894-898, 2007.
The multiscale statistical process
control (MSSPC) was originally
proposed for detecting changes in
univariate and multivariate signals.
Substeps involved in the use of MSSPC for videotext event detection
36. K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector
machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine
Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.
• Abstract—The current paper presents a novel texture-based method for detecting texts
in images. A support vector machine (SVM) is used to analyze the textural properties of
texts. No external texture feature extraction module is used; rather, the intensities of the
raw pixels that make up the textural pattern are fed directly to the SVM, which works well
even in high-dimensional spaces. Next, text regions are identified by applying a
continuously adaptive mean shift algorithm (CAMSHIFT) to the results of the texture
analysis. The combination of CAMSHIFT and SVMs produces both robust and efficient
text detection, as time-consuming texture analyses for less relevant pixels are restricted,
leaving only a small part of the input image to be texture-analyzed.
H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of
Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007.
Abstract: Text in video, whether overlay or in-scene, contains a wealth of information vital to automated
content analysis systems. However, low resolution of the imagery, coupled with richness of the
background and compression artifacts limit the detection accuracy that can be achieved in practice using
existing text detection algorithms. This paper presents a novel, noncausal temporal aggregation method
that acts as a second pass over the output of an existing text detector over the entire video clip. A
multiresolution change detection algorithm is used along the time axis to detect the appearance and
disappearance of multiple, concurrent lines of text followed by recursive timeaveraged projections on Y
and X axes. This algorithm detects and rectifies instances of missed text and enhances spatial boundaries
of detected text lines
using consensus estimates. Experimental results, which demonstrate significant performance gain on
publicly collected and annotated data, are presented.
37. Recent Progress
-Text extraction techniques adopted from
other research fields
D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.
Discriminative Random Fields
(DRF) was initially applied to
detect man-made building in 2D
images.
(a) 2D DRF, with state si and one of its neighbors sj . (b) 3D DRF, with multiple 2D DRFs stacked over
time. (c) 2D DRF-HMM type(A), with intra-frame dependencies modelled by undirected DRFs, and inter-
frame dependencies modelled by HMMs. States are shared between the two models.
38. Recent Progress
-Text extraction techniques adopted from
other research fields
W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using
Sparse Representations, Proceedings of Ninth International Conference on Document Analysis
and Recognition, IEEE, pp. 412-416, 2007.
Sparse representation was initially used for research on the receptive fields of
simple cells.
(a) (b) (C)
(a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c)
binarized result of (b) using Otsu’s method.
39. D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.
• Abstract: In this paper, we propose a general object detection framework which combines the
Hidden Markov Model with the Discriminative Random Fields. Recent object detection
algorithms have achieved impressive results by using graphical models, such as Markov
Random Field. These models, however, have only been applied to two dimensional images. In
many scenarios, video is the directly available source rather than images, hence an important
information for detecting objects has been omitted — the temporal information. To demonstrate
the importance of temporal information, we apply graphical models to the task of text detection
in video and compare the result of with and without temporal information. We also show the
superiority of the proposed models over simple heuristics such as median filter over time.
W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse
Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition,
IEEE, pp. 412-416, 2007.
Abstract: A novel text segmentation method from complex background is presented in this paper. The idea
is inspired by the recent development in searching for the sparse signal representation among a family of
over-complete atoms, which is called a dictionary. We assume that the image under investigation is
composed of two components: the foreground text and the complex background. We further assume that the
latter can be modeled as a piece-wise smooth function. Then we choose two dictionaries, where the first one
gives sparse representation to one component and nonsparse representation to another while the second
one does the opposite. By looking for the sparse representations in each dictionary, we can decompose the
image into the two composing components. After that, text segmentation can be easily achieved by applying
simple thresholding to the text component. Preliminary experiments show some promising results.
40. Recent Progress
3. Text extraction approaches proposed for specific text
types and specific genre of video documents:
Besides general text extraction approaches, an increasing number of
approaches have been proposed for specific text types.
Based on domain knowledge, these specific approaches can take
advantages of unique properties of specific text type or video genre and
often achieve better performance than general approaches.
41. Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documents
W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE
Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.
This approach is composed of
two stages:
1. localizing road signs;
2. detecting text.
Architecture of the proposed framework
42. W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE
Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.
• Abstract—A fast and robust framework for incrementally detecting text on
road signs from video is presented in this paper. This new framework makes
two main contributions. 1) The framework applies a divide-and-conquer
strategy to decompose the original task into two subtasks, that is, the
localization of road signs and the detection of text on the signs. The
algorithms for the two subtasks are naturally incorporated into a unified
framework through a feature-based tracking algorithm. 2) The framework
provides a novel way to detect text from video by integrating two-
dimensional (2-D) image features in each video frame (e.g., color, edges,
texture) with the three-dimensional (3-D) geometric structure information of
objects extracted from video sequence (such as the vertical plane property
of road signs). The feasibility of the proposed framework has been
evaluated using 22 video sequences captured from a moving vehicle. This
new framework gives an overall text detection rate of 88.9% and a false hit
rate of 9.2%. It can easily be applied to other tasks of text detection from
video and potentially be embedded in a driver assistance system.
43. Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documentsT. Liu, Summarization of Visual Content in Instruction videos, IEEE
C. Choudary, and
Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.
content fluctuation curve based on the number of chalk pixels is used to measure the content in
each frame of instructional videos. The frames with enough chalk pixels are extracted as key
frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the
redundancy of key frames by matching the content and mosaicking the frames.
(a) (b) (C) (d)
Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization
algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.
44. C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos,
IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.
• Abstract—In instructional videos of chalk board presentations, the visual content
refers to the text and figures written on the boards. Existing methods on video
summarization are not effective for this video domain because they are mainly based
on low-level image features such as color and edges. In this work, we present a novel
approach to summarizing the visual content in instructional videos using middle-level
features. We first develop a robust algorithm to extract content text and figures from
instructional videos by statistical modelling and clustering. This algorithm addresses
the image noise, nonuniformity of the board regions, camera movements, occlusions,
and other challenges in the instructional videos that are recorded in real classrooms.
Using the extracted text and figures as the middle level features, we retrieve a set of
key frames that contain most of the visual content. We further reduce content
redundancy and build a mosaicked summary image by matching extracted content
based on K-th Hausdorff distance and connected component decomposition.
Performance evaluation on four full-length instructional videos shows that our
algorithm is highly effective in summarizing instructional video content
45. Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documents
Additional References:
• C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006.
• D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief
Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004.
• L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of
IEEE international conference on image processing, Vol. 3, pp11-14, 2005.
• M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization,
and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp.
243-255, 2005.
• S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEE
Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110,
2005.
• CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using
frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with
artificial intelligence, Vol. 2 ,pp 539-542, 2007.
• …
47. Performance Evaluation
R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M.
Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation of
Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol,
to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008.
(http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57)
Evaluation Metrics:
Video Analysis and Content Extraction (VACE)
48. Text: Task Definition
Detection Task: Spatially locate the blocks of text in each video
frame in a video sequence
• Text blocks (objects) contain all words in a particular line of text where
the font and size are the same
Tracking Task: Spatially/temporally locate and track the text objects
in a video sequence
Recognition Task: Transcribe the words in each frame, including
their spatial location (detection implied)
49. Task Definition
Highlights
• Annotate oriented bounding rectangle around text
objects (The reference annotation was done by VideoMining Inc., State College, PA)
• Detection and Tracking task
– Line level annotation with IDs maintained
– Rules based on similarity of font, proximity and readability levels
• Recognition task
– Word Level (IDs maintained)
• Documents
– Annotation guidelines - Evaluation protocol
• Tools
– ViPER (Annotation) - USF-DATE (Scoring)
50. Data Resources
VIDEO
DATA NUMBER TOTAL MINS
OF CLIPS
MICRO-CORPUS 5 10
TRAINING 50 175
TESTING 50 175
• Micro-corpus: a small amount of data that was created after
extensive discussions with the research community to act as a
seed for initial annotation experiments and to provide new
participants with a concrete sampling of the datasets and the
tasks.
51. Data Resources
These discussions were coordinated as a series of weekly
teleconferences with VACE contractors and other eminent
members of the CV community.
The discussions made the research community a partner in
the evaluations and helped us in:
5. selecting the video recordings to be used in the evaluations,
6. creating the specifications for the ground truth annotations and
scoring tools
7. defining the evaluation infrastructure for the program.
52. Data Resources
TASK DOMAIN
Text Detect & Track Broadcast News ABC & CNN*
Face Detect & Track Broadcast News ABC & CNN*
Vehicle Detect & Track Surveillance i-LIDS**
MPEG–2 standard, progressive scanned at 720 × 480 resolution.
GOP (Group of Pictures) of 12 for the broadcast news corpus where
the frame-rate was 29.97 fps (frames per second) and GOP of 10
for the surveillance dataset where the frame-rate was 25 fps.
* Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu
** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK Home
Office and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)
53. Reference Annotations
Text Ground Truth: Every new text area was marked with a box when it
appeared in the video. The box was moved and scaled to fit the text as it
moved in successive frames. This process was done at the text line level
until the text disappeared from the frame.
Three readability levels:
READABILITY = 1 (white)
Completely unreadable text
READABILITY = 1 (gray)
Partially readable text
READABILITY = 2 (black)
Clearly readable text
54. Reference Annotations
• Text regions were tagged based on a comprehensive set of rules:
• All text within a selected block must contain the same readability level and
type.
• Blocks of text must contain the same size and font.
• The bounding box should be tight to the extent that there is no space
between the box and the text.
• Text boxes may not overlap other text boxes unless the characters
themselves are superimposed atop one another.
56. Detection Metric
• The Frame Detection Accuracy (FDA) measure calculates the spatial
overlap between the ground truth and system output objects as a ratio of
the spatial intersection between the two objects and the spatial union of
them. The sum of all of the overlaps was normalized over the average of
the number of ground truth and detected objects
Frame Detection Accuracy (FDA)
Gi(t ) Di(t )
(t )
Overlap Ratio N mapped
FDA(t ) =
N Gt ) + N Dt )
( ( where, Overlap Ratio = ∑i =1 Gi(t ) Di(t )
2
Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t.
Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t.
N(t)G and N(t)D denote the number of ground truth objects and the number of detected objects in frame t respectively.
57. Detection Metric
• The Sequence Frame Detection Accuracy (SFDA), is essentially the
average of the FDA measure over all of the relevant frames in the
sequence.
Sequence Frame Detection Accuracy (SFDA)
N frames
∑ FDA (t )
SFDA = N frames
t =1
Range: 0 to 1 (higher is better)
∑
(t ) (t)
∃( N G OR N D )
t =1
Nframes is the number of frames in the sequence
58. Tracking Metric
• The Average Tracking Accuracy (ATA) is a spatio-temporal measure
which penalizes fragmentations in both the temporal and spatial dimensions
while accounting for the number of objects detected and tracked, missed
objects, and false positives.
N iframes Gi(t ) Di(t )
N mapped
∑t =1 Gi(t ) Di(t )
Sequence Track Detection Accuracy (STDA) STDA = ∑
i =1 N ( Gi ∪ Di ≠φ )
STDA
Average Tracking Accuracy (ATA) ATA =
NG + N D
Range: 0 to 1 (higher is better)
2
NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence
respectively. Uniqueness is defined by object IDs.
60. Annotation Quality
Evaluation relies on manual labeling
The degree of consistency becomes 10% of the entire corpus was
increasingly important as systems doubly annotated by multiple
approach human levels of
annotators and checked for
performance.
quality using the evaluation
A high degree of consistency would be
difficult to achieve with somewhat
measures.
subjective attributes like readability
Humans fatigued easily when
performing such tedious tasks
61. Annotation Quality
For double annotated corpus
Average Sequence Frame Text detection 95%
Detection Accuracy (SFDA)
Average Average Tracking Text tracking 85%
Accuracy (ATA)
The scores for the current state-of-the-art automatic algorithms
are significantly lower than these numbers (22% relative for text
detection, and 61% relative for text tracking).
62. Annotation Quality
Flowchart of Annotation Quality Control Procedure. Steps denoted by dark shaded boxes were carried
out by the annotators. Steps denoted by light shaded boxes were carried out by the evaluators.
63. Text Detection and Tracking –
VACE
Mean SFDA/ATA Scores for Eng Text Detection and Tracking: BNews
1
A
0.9 B
0.8 C
D
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
SFDA ATA
64. Text Recognition
Evaluation
• Datesets: Broadcast News
• Training/Dry Run Development Set
– 5 Clips
• 14.5 minutes
• 1181 words
• Evaluation Set
– 25 Clips
• 62.5 minutes
• 4178 word objects
• 68,738 word frame instances
65. Text Recognition
Evaluation
Evaluate only the most easily readable text (to establish
a baseline at a high level of inter-annotator agreement)
• Type = graphic (no scene text)
• Readability = 2
• Logo = false
• Occlusion = false
• Ambiguous = false
— Exclude scrolling (ticker), dynamic text
(scoreboard)
• Case insensitive and punctuation ignored
67. Recognition Evaluation
Metrics
• Spatially map system output detected words to reference
words, then compare the strings for mapped words
– An unmapped word in system output incurs an Insertion (I) error
– An unmapped word in reference incurs a Deletion (D) error
– A mapped word with a character mismatch incurs a Substitution
(S) error
REF: The raven caws at midnight (I + D + S)
WER =
D S I (Total # Words in Ref)
Sys
Output: raven calls at at midnight WER = (1 + 1 + 1)/5 = 3/5 (60%)
• Errors are accumulated over entire test set
• Also generate: Character Error Rate
68. Individual Clip Word Error
Rate
Clip-wise WER
1
WER per clip (normalized by the #words in each clip)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25
Clips
69. Scores (Word Error Rate)
Word Error Rates with different Normalizations
1
0.9
0.8
0.7
0.6
Values
0.5
0.4
0.3
0.2
0.1
0
WER/Word WER/Frames
WER CER
0.4233 0.2823
71. Discussion
• The recent progresses provide many promising
solutions and research directions for text extraction
problem.
• Due to the large variations of text objects in videos, no
single approach can achieve satisfactory performance in
all applications.
• To further improve the performance of text extraction
techniques, much work in the area remains.
72. Discussion
Detection and Localization
– How to efficiently combine several complementary
extraction algorithms to produce better performance
and how to extract better features by analyzing the
shape of characters and the relationships between
text and its background still need more investigation.
73. Discussion
Tracking
– Although text tracking is an indispensable step for
text extraction in videos, not many text tracking
approaches have been reported in recent years.
– More effort is needed to focus on tracking, not only
for static and scrolling text, but also for dynamic text
objects (growing, shrinking, and rotating text).
74. Discussion
Datasets:
– Besides extraction approaches, because most
algorithms are still tested on their own datasets, in
order to compare and evaluate all algorithms, a large
freely available annotated video dataset is urgently
needed.
Figs. 2 show two computation examples of the neighborhoods. In each figure, the image (a) shows a binary image, where black dots denote centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. We consider two cases and illustrate their solutions in Figs. 2c. According to the Delaunay triangulation shown in Fig. 2b, we can get three neighbor sets, in which the neighborhood of three neighboring characters in the middle of the text string, i.e. BCD , is ignored. In order to tackle this kind of problem, we take all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets. The solution is illustrated by dotted lines in Fig. 2c.
A smart way to combine color or graylevel variation with spatial information is to use Gabor-based filter. For this purpose and in the context of natural scene images, we have chosen to use Log-Gabor filters as explained in Figure 1 to get our final text cluster, which will be then fed into an OCR algorithm.
once the text edges are detected at a resolution level, they are erased immediately from the current edge map, and then the modified edge map is utilized as input to the next level, so that no text edges can appear several times at different resolution levels.
A fuzzy clustering ensemble (FCE), which can merge the results of several clustering algorithms together in order to improve the quality of the individual clusters and robustness, is adopted to fuse multi-frame information. For a set of consecutive frames, the features are extracted from each frame by applying wavelet transform and clustered by FCM. Then FCE is employed to output an integrated frame with three clusters, “text”, “background” and “complex background”, based on the individual clustering result of each frame. “text” cluster is labeled by finding the smallest distance of each cluster from the ideal text features.
The text density time series is mirrored on both ends to ensure that its length is dyadic (i.e., a power of 2). It is then decomposed using Haar wavelets onto 5 scales (H1 through H5 in Figure) and the residual (L5), with dyadic downsampling at each scale. At each scale, potential change points are detected when a detection threshold (shown as a red envelope in Figure, which is set at 3.5 times the standard deviation in a local neighborhood downsampled for each scale) was exceeded. Scales that exceeded the detection threshold at any given time were selected for reconstruction using inverse wavelet transform. Videotext events were detected as points in time where the reconstructed signal exceeds an adjusted threshold
We extend the 2D DRF to a 3D DRF as follows. We extend the neighboring structure Ni of each state si from 2D to 3D, as in Figure 1 (b). We call neighbors in the same frame as intra-frame neighbors, N intra i , and neighbors across neighboring frames as inter-frame neighbors, N inter i . Anisotropy for inter- and intra-frame is a natural requirement since dependencies along the temporal direction should be different from the spatial domain, hence define I intra( si, sj , o ) = β intra sisj and I inter( si, sj , o ) = β inter sisj . The 3D DRF is in essence collecting more context than the 2D DRF. It therefore has a larger chance to correctly estimate the hidden states.
1) Discriminative points detection and clustering—detect discriminative feature points in every video frame using the algorithm proposed in [28] and partition them into clusters. 2) Road sign localization—select candidate road sign regions corresponding to clusters of feature points using a vertical plane criterion. 3) Text detection—detect text on candidate road sign areas and track them. 4) Text extraction and recognition—extract text in candidate sign plane for recognition given a satisfactory size.
We refer to the final disjoint key frames (including the mosaicked frames) as summary frames . We get the bounding boxes for the binary content in the summary frames and stitch them together, making a summary image of the instructional video content. We compare the performance of our summarization algorithm with three well-known key frame selection techniques namely, the fixed rate video sampling, the tolerance band [9] and, the unsupervised clustering [10] methods. Figs. clearly show that our method outperforms the conventional key frame selection methods in summarizing the visual content in instructional videos. Our method performs better than the other methods in the following three aspects. First, the conventional methods are based on image dissimilarity measures, so the occlusions, light condition changes, and camera movements negatively affect the resulting key frames.
Should mention that N_G and N_D are the number of unique ground truth and system output objects in the video sequence. Uniqueness being denoted by their ID.