Recent Progress in Video Text Extraction Techniques

Extraction of Text Objects in
Video Documents: Recent
Progress

Jing Zhang and Rangachar Kasturi
University of South Florida
Department of Computer Science and Engineering

Acknowledgements

The work presented here is that of numerous
researchers from around the world. We thank
them for their contributions towards the
advances in video document processing.

In particular we would like to thank the
authors of papers whose work is cited in this
presentation and in our paper.

Outline
• Introduction
• Recent Progress
• Performance Evaluation
• Discussion

Introduction
 Since 1990s, with rapid growth of available
multimedia documents and increasing demand
for information indexing and retrieval, much
effort has been done on text extraction in
images and videos.

Introduction
• Text Extraction in Video
– Text consists of words that are well-defined models of concepts
for humans communication.
– Text objects embedded in video contain much semantic
information related to the multimedia content.
– Text extraction techniques play an important role in content-
based multimedia information indexing and retrieval.

Introduction
 Extracting text in video presents unique challenge over
that in scanned documents:

Cons: Pros:
Low contrast Temporal Redundancy (text in video
usually persists for at least several
Low resolution seconds, to give human viewers the
necessary time to read it)
Color bleeding
Unconstrained backgrounds
Unknown text color, size,
position, orientation, and layout

Introduction
• Caption Text which is artificially superimposed on the video at the
time of editing.
• Scene Text which naturally occurs in the field of view of the camera
during video capture.
• The extraction of scene text is a much tougher task due to varying
lighting, complex movement and transformation.

Scene Text
Caption Text

Introduction
 Five stages of text extraction in video:
1) Text Detection: finding regions in a video frame that contain text;
2) Text Localization: grouping text regions into text instances and generating a
set of tight bounding boxes around all text instances;
3) Text Tracking: following a text event as it moves or changes over time and
determining the temporal and spatial locations and extents of text events;
4) Text Binarization: binarizing the text bounded by text regions and marking
text as one binary level and background as the other;
5) Text Recognition: performing OCR on the binarized text image.

Introduction
Video Clips

Text Detection

Text Localization Text Tracking

Text Binarization

Text Recognition

Text Objects

Introduction
 The goal of Text detection, text localization and text
tracking is to generate accurate bounding boxes of all
text objects in video frames and provide a unique
identity to each text event which is composed of the
same text object appearing in a sequence of
consecutive frames.

Introduction
 This presentation mainly concentrates on the
approaches proposed for text extraction in
videos in the most recent five years, to
summarize and discuss the recent progress in
this research area.

Introduction
 Region Based Approach utilizes the different region properties
between text and background to extract text objects.
– Bottom-up: separating the image into small regions and then grouping
character regions into text regions.
– Color features, edge features, and connected component methods

 Texture Based Approach uses distinct texture properties of text to
extract text objects from background.
– Top-down: extracting texture features of the image and then locating
text regions.
– Spatial variance, Fourier transform, Wavelet transform, and machine
learning methods.

Recent Progress
 Text extraction in video documents, as an important research
branch of content-based information retrieval and indexing,
continues to be a topic of much interest to researchers.

 A large number of newly proposed approaches in the
literature have contributed to an impressive progress of text
extraction techniques.

Recent Progress

Prior to 2003 Now
• Only a few text extraction • Temporal redundancy of video is
approaches considered the utilized by almost all recent text
temporal nature of video. extraction approaches.
• Very little work was done on • Scene text extraction is being
scene text. extensively studied.
• Objective performance • A comprehensive performance
evaluation metrics were evaluation framework has been
scarce. developed.

Recent Progress

 The progress of text extraction in videos can
be categorized into three types:
• New and improved text extraction approaches
• Text extraction techniques adopted from other
research fields
 Text extraction approaches proposed for specific
text types and specific genre of video documents

Recent Progress
• New and improved text extraction approaches:
The new and improved approaches play an important role in the
recent progress of text extraction technique for videos. These
new approaches introduce not only new algorithms but also new
understanding of the problem.

Recent Progress
-New and improved text extraction
approachesNguyen T. and A. Boucher, A novel approach for text detection in images
H. Tran, A lux, H.L.
using structural features, The 3rd International Conference on Advances in Pattern Recognition,
LNCS Vol. 3686, pp. 627-635, 2005

A text string is modeled as its center line
and the skeletons of characters by ridges
at different hierarchical scales.

First line: Images with rectangle showing the text region. Second line: Zoom on text
regions. Third line: ridges detected at two scales (red in high level, blue in small level) in
the text region that represent local structures of text lines whatever the type of text.

H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features,
The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005

• Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A
text string is modelled by a ridge at a coarse scale representing its center line and numerous short
ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy
geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge.
In this way, we obtain a hierarchical description of text strings, which can provide direct input to an
OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it
works with a wide variety in size of characters and does not depend on orientation of text string. The
experimental results show a good detection.
X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual
Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.
Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring
characters to extract multilingual texts in images. The case of three neighboring characters is represented as
the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’
defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each
connected component in the binary image as character or non-character according to its neighbors, where a
mathematical morphology based method is introduced to detect and connect the separated parts of each
character, and a Voronoi partition based method is advised to establish the neighborhoods of connected
components. We further present a discriminative training algorithm based on the maximum–minimum similarity
(MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in
Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS
algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In
the experiments, we also show that the MMS provides significant improvement of overall performance,
compared with influential training criterions of the maximum likelihood (ML) and the maximum classification
error (MCE).

Recent Progress
approaches
X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters
for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.

The GMM based algorithm treats the text features of three neighboring
characters as three mixed Gaussian models to extract text objects.

(a) (b) (c)
An example of neighborhood computation. In each figure, the image (a) shows a binary image, where black dots denote
centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a
neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c)
The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.

Recent Progress
approaches
P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International
Conference Signal Processing, IEEE, Vol. 4, 2006

Only the vertical edge features are utilized to find text regions based on the
observation that vertical edges can enhance the characteristic of text and eliminate
most irrelevant information.

(a) (b) (c) (d)

(a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result

Recent Progress
approaches
K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-
Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis
and Recognition, IEEE, pp. 33-37, 2007

Character-stroke is used to extract text objects by utilizing three line scans (a
set of pixels along the horizontal line of an intensity image) to detect image
intensity changes.

(a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, ∆ is the stroke width, (c) threshold Ig ≤
0.35, (d) The thresholded image after morphological operations and connected component analysis.

P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International
Conference Signal Processing, IEEE, Vol. 4, 2006
• Abstract: Text detection plays a crucial role in various applications. In this paper we present an
edge based text detection technique in the complex images for multi purpose application. The
technique applied vertical Sobel edge detection and a newly proposed morphological technique
that used to connect the edges to form the candidate regions. The technique has special
advantage, by providing a distinguishable texture on the text area over the others. The connected
components are then extracted using a purposed segmentation algorithm. Later all the candidate
regions are verified to specify the text region. The propose techniques has been tested with
different types of image acquired from different input sources and environment. The experimental
result shows highly successful rate.
K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-
Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis
and Recognition, IEEE, pp. 33-37, 2007
Abstract: In this paper, we present a new approach for analysis of images for text-
localization and extraction. Our approach puts very few constraints on the font, size and
color of text and is capable of handling both scene text and articial text well. In this paper,
we exploit two well-known features of text: approximately constant stroke width and local
contrast, and develop a fast, simple, and effective algorithm to detect character strokes. We
also show how these can be used for accurate extraction and motivate some advantages
of using this approach for text localization over other colorspace segmentation based
approaches. We analyze the performance of our stroke detection algorithm on images
collected for the robust-reading competitions at ICDAR 2003

Recent Progress
approaches
D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video,
International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003

8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that
best correspond to the properties of text are determined empirically. The sum of the absolute
values of these coefficients is computed and regarded as a measure of the “text energy” of that
block. The motion vectors of MPEG-compressed videos are used for text objects tracking.

(a) Original image

(b) Text energy (c) Tracking result

D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events
from digital video, International Journal on Document Analysis and Recognition, Vol.
5, pp. 138-157, 2003
• Abstract. The popularity of digital video is increasing rapidly. To help users navigate
libraries of video, algorithms that automatically index video based on content are
needed. One approach is to extract text appearing in video, which often reflects a
scene’s semantic content. This is a difficult problem due to the unconstrained nature
of general-purpose video. Text can have arbitrary color, size, and orientation.
Backgrounds may be complex and changing. Most work so far has made restrictive
assumptions about the nature of text occurring in video. Such work is therefore not
directly applicable to unconstrained, general-purpose video. In addition, most work so
far has focused only on detecting the spatial extent of text in individual video frames.
However, text occurring in video usually persists for several seconds. This constitutes
a text event that should be entered only once in the video index. Therefore it is also
necessary to determine the temporal extent of text events. This is a non-trivial
problem because text may move, rotate, grow, shrink, or otherwise change over time.
Such text effects are common in television programs and commercials but so far
have received little attention in the literature. This paper discusses detecting,
binarizing, and tracking caption text in general-purpose MPEG-1 video. Solutions are
proposed for each of these problems and compared with existing work found in the
literature.

Recent Progress
approaches
In addition, many former text extraction approaches have been
enhanced and extended recently.
By extracting and integrating more comprehensive
characteristics of text objects, these new approaches can provide
more robust performance than previous approaches.

Besides new approaches, many improved approaches are
presented to overcome the limitations of former approaches.

Recent Progress
approaches Caption localization in video sequences by fusion of multiple detectors,
S Lefevre, N Vincent,
Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE,
pp. 106-110, 2005
Color-related detector, wavelet-based texture detector, edge-based contour detector
and temporal invariance principle are adopted to detect candidate caption regions.
Then a parallel fusion strategy
C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.

Euclidean distance based and Cosine similarity
based clustering methods are applied on GRB
color space complementarily to partition the
original image into three clusters: textual
foreground, textual background, and noise.

Overview of the proposed algorithm combining color and spatial information.

S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors,
Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp.
106-110, 2005
• Abstract: In this article, we focus on the problem of caption detection in video sequences.
Contrary to most of existing approaches based on a single detector followed by an ad hoc
and costly post-processing, we have decided to consider several detectors and to merge
their results in order to combine advantages of each one. First we made a study of
captions in video sequences to determine how they are represented in images and to
identify their main features (color constancy and background contrast, edge density and
regularity, temporal persistence). Based on these features, we then select or define the
appropriate detectors and we compare several fusion strategies which can be involved.
The logical process we have followed and the satisfying results we have obtained let us
validate our contribution.
C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.
Abstract: Natural scene images brought new challenges for a few years and one of them is text
understanding over images or videos. Text extraction which consists to segment textual foreground from the
background succeeds using color information. Faced to the large diversity of text information in daily life and
artistic ways of display, we are convinced that this only information is no more enough and we present a color
segmentation algorithm using spatial information. Moreover, a new method is proposed in this paper to
handle uneven lighting, blur and complex backgrounds which are inherent degradations to natural scene
images. To merge text pixels together, complementary clustering distances are used to support
simultaneously
clear and well-contrasted images with complex and degraded images. Tests on a public database show
finally efficiency of the whole proposed method.

Recent Progress
approaches
M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection,
localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology,
Vol. 15, pp. 243-255, 2005.

The sequential multi-resolution
paradigm can remove the
redundancy of parallel multi-
resolution paradigm. No text
edges can appear several
times at different resolution
levels.

Sequential multiresolution paradigm

M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection,
localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology,
Vol. 15, pp. 243-255, 2005.

• Abstract—Text in video is a very compact and accurate clue for video indexing and
summarization. Most video text detection and extraction methods hold assumptions
on text color, background contrast, and font style. Moreover, few methods can handle
multilingual text well since different languages may have quite different appearances.
This paper performs a detailed analysis of multilingual text characteristics, including
English and Chinese. Based on the analysis, we propose a comprehensive, efficient
video text detection, localization, and extraction method, which emphasizes the
multilingual capability over the whole processing. The proposed method is also robust
to various background complexities and text appearances. The text detection is
carried out by edge detection, local thresholding, and hysteresis edge recovery. The
coarse-to-fine localization scheme is then performed to identify text regions
accurately. The text extraction consists of adaptive thresholding, dam point labeling,
and inward filling. Experimental results on a large number of video images and
comparisons with other methods are reported in detail.

Recent Progress
approaches
J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering
Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp.
283-290, 2006.

Fuzzy C-means based individual
frame clustering is replaced by the
fuzzy clustering ensemble (FCE)
based multi-frame clustering to utilize
temporal redundancy.

Fuzzy cluster ensemble for text detection in videos

J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering
Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp.
283-290, 2006.

• Abstract: Detection and localization of text in videos is an important task
towards enabling automatic content-based retrieval of digital video
databases. However, since text is often displayed against a complex
background, its detection is a challenging problem. In this paper, a novel
approach based on fuzzy cluster ensemble techniques to solve this problem
is presented. The advantage of this approach is that the fuzzy clustering
ensemble allows the incremental inclusion of temporal information regarding
the appearance of static text in videos. Comparative experimental results for
a test set of 10.92 minutes of video sequences have shown the very good
performance of the proposed approach with an overall recall of 92.04% and
a precision of 96.71%.

Recent Progress
2. Text extraction techniques adopted from other research
fields:
Another encouraging progress is that more and more techniques that have been
successfully applied in other research fields have been adapted for text extraction.

Because these approaches were not initially designed for the text extraction task,
many unique characteristics of their original research fields are embedded in them
intrinsically.

Therefore, by using these approaches from other fields, we can view the text
extraction problem from the viewpoints of other related research fields and benefit
from them. It is a promising way to find good solutions for text extraction task.

Recent Progress
-Text extraction techniques adopted from
other research fields
K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support
vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern
Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.

The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to
detect and track faces in a video stream.

Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for
CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked
as white and gray level (white: text region, gray: non-text region), and (d) final detection result

Recent Progress
H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection,
Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp.
894-898, 2007.

The multiscale statistical process
control (MSSPC) was originally
proposed for detecting changes in
univariate and multivariate signals.

Substeps involved in the use of MSSPC for videotext event detection

K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector
machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine
Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.
• Abstract—The current paper presents a novel texture-based method for detecting texts
in images. A support vector machine (SVM) is used to analyze the textural properties of
texts. No external texture feature extraction module is used; rather, the intensities of the
raw pixels that make up the textural pattern are fed directly to the SVM, which works well
even in high-dimensional spaces. Next, text regions are identified by applying a
continuously adaptive mean shift algorithm (CAMSHIFT) to the results of the texture
analysis. The combination of CAMSHIFT and SVMs produces both robust and efficient
text detection, as time-consuming texture analyses for less relevant pixels are restricted,
leaving only a small part of the input image to be texture-analyzed.

H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of
Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007.
Abstract: Text in video, whether overlay or in-scene, contains a wealth of information vital to automated
content analysis systems. However, low resolution of the imagery, coupled with richness of the
background and compression artifacts limit the detection accuracy that can be achieved in practice using
existing text detection algorithms. This paper presents a novel, noncausal temporal aggregation method
that acts as a second pass over the output of an existing text detector over the entire video clip. A
multiresolution change detection algorithm is used along the time axis to detect the appearance and
disappearance of multiple, concurrent lines of text followed by recursive timeaveraged projections on Y
and X axes. This algorithm detects and rectifies instances of missed text and enhances spatial boundaries
of detected text lines
using consensus estimates. Experimental results, which demonstrate significant performance gain on
publicly collected and annotated data, are presented.

Recent Progress
D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.

Discriminative Random Fields
(DRF) was initially applied to
detect man-made building in 2D
images.

(a) 2D DRF, with state si and one of its neighbors sj . (b) 3D DRF, with multiple 2D DRFs stacked over
time. (c) 2D DRF-HMM type(A), with intra-frame dependencies modelled by undirected DRFs, and inter-
frame dependencies modelled by HMMs. States are shared between the two models.

Recent Progress
W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using
Sparse Representations, Proceedings of Ninth International Conference on Document Analysis
and Recognition, IEEE, pp. 412-416, 2007.

Sparse representation was initially used for research on the receptive fields of
simple cells.

(a) (b) (C)

(a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c)
binarized result of (b) using Otsu’s method.

D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.
• Abstract: In this paper, we propose a general object detection framework which combines the
Hidden Markov Model with the Discriminative Random Fields. Recent object detection
algorithms have achieved impressive results by using graphical models, such as Markov
Random Field. These models, however, have only been applied to two dimensional images. In
many scenarios, video is the directly available source rather than images, hence an important
information for detecting objects has been omitted — the temporal information. To demonstrate
the importance of temporal information, we apply graphical models to the task of text detection
in video and compare the result of with and without temporal information. We also show the
superiority of the proposed models over simple heuristics such as median filter over time.

W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse
Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition,
IEEE, pp. 412-416, 2007.
Abstract: A novel text segmentation method from complex background is presented in this paper. The idea
is inspired by the recent development in searching for the sparse signal representation among a family of
over-complete atoms, which is called a dictionary. We assume that the image under investigation is
composed of two components: the foreground text and the complex background. We further assume that the
latter can be modeled as a piece-wise smooth function. Then we choose two dictionaries, where the first one
gives sparse representation to one component and nonsparse representation to another while the second
one does the opposite. By looking for the sparse representations in each dictionary, we can decompose the
image into the two composing components. After that, text segmentation can be easily achieved by applying
simple thresholding to the text component. Preliminary experiments show some promising results.

Recent Progress
3. Text extraction approaches proposed for specific text
types and specific genre of video documents:
Besides general text extraction approaches, an increasing number of
approaches have been proposed for specific text types.

Based on domain knowledge, these specific approaches can take
advantages of unique properties of specific text type or video genre and
often achieve better performance than general approaches.

Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documents
W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE
Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.

This approach is composed of
two stages:
1. localizing road signs;
2. detecting text.

Architecture of the proposed framework

W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE
Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.

• Abstract—A fast and robust framework for incrementally detecting text on
road signs from video is presented in this paper. This new framework makes
two main contributions. 1) The framework applies a divide-and-conquer
strategy to decompose the original task into two subtasks, that is, the
localization of road signs and the detection of text on the signs. The
algorithms for the two subtasks are naturally incorporated into a unified
framework through a feature-based tracking algorithm. 2) The framework
provides a novel way to detect text from video by integrating two-
dimensional (2-D) image features in each video frame (e.g., color, edges,
texture) with the three-dimensional (3-D) geometric structure information of
objects extracted from video sequence (such as the vertical plane property
of road signs). The feasibility of the proposed framework has been
evaluated using 22 video sequences captured from a moving vehicle. This
new framework gives an overall text detection rate of 88.9% and a false hit
rate of 9.2%. It can easily be applied to other tasks of text detection from
video and potentially be embedded in a driver assistance system.

Recent Progress
documentsT. Liu, Summarization of Visual Content in Instruction videos, IEEE
C. Choudary, and
Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.
content fluctuation curve based on the number of chalk pixels is used to measure the content in
each frame of instructional videos. The frames with enough chalk pixels are extracted as key
frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the
redundancy of key frames by matching the content and mosaicking the frames.

(a) (b) (C) (d)
Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization
algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.

C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos,
IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.

• Abstract—In instructional videos of chalk board presentations, the visual content
refers to the text and figures written on the boards. Existing methods on video
summarization are not effective for this video domain because they are mainly based
on low-level image features such as color and edges. In this work, we present a novel
approach to summarizing the visual content in instructional videos using middle-level
features. We first develop a robust algorithm to extract content text and figures from
instructional videos by statistical modelling and clustering. This algorithm addresses
the image noise, nonuniformity of the board regions, camera movements, occlusions,
and other challenges in the instructional videos that are recorded in real classrooms.
Using the extracted text and figures as the middle level features, we retrieve a set of
key frames that contain most of the visual content. We further reduce content
redundancy and build a mosaicked summary image by matching extracted content
based on K-th Hausdorff distance and connected component decomposition.
Performance evaluation on four full-length instructional videos shows that our
algorithm is highly effective in summarizing instructional video content

Recent Progress
documents
Additional References:
• C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006.
• D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief
Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004.
• L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of
IEEE international conference on image processing, Vol. 3, pp11-14, 2005.

• M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization,
and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp.
243-255, 2005.
• S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEE
Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110,
2005.
• CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using
frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with
artificial intelligence, Vol. 2 ,pp 539-542, 2007.
• …

Performance Evaluation
R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M.
Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation of
Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol,
to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008.
(http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57)

Evaluation Metrics:
Video Analysis and Content Extraction (VACE)

Text: Task Definition
 Detection Task: Spatially locate the blocks of text in each video
frame in a video sequence
• Text blocks (objects) contain all words in a particular line of text where
the font and size are the same

 Tracking Task: Spatially/temporally locate and track the text objects
in a video sequence
 Recognition Task: Transcribe the words in each frame, including
their spatial location (detection implied)

Task Definition
Highlights
• Annotate oriented bounding rectangle around text
objects (The reference annotation was done by VideoMining Inc., State College, PA)
• Detection and Tracking task
– Line level annotation with IDs maintained
– Rules based on similarity of font, proximity and readability levels
• Recognition task
– Word Level (IDs maintained)
• Documents
– Annotation guidelines - Evaluation protocol
• Tools
– ViPER (Annotation) - USF-DATE (Scoring)

Data Resources
 VIDEO

DATA NUMBER TOTAL MINS
OF CLIPS
MICRO-CORPUS 5 10

TRAINING 50 175

TESTING 50 175

• Micro-corpus: a small amount of data that was created after
extensive discussions with the research community to act as a
seed for initial annotation experiments and to provide new
participants with a concrete sampling of the datasets and the
tasks.

Data Resources
 These discussions were coordinated as a series of weekly
teleconferences with VACE contractors and other eminent
members of the CV community.
 The discussions made the research community a partner in
the evaluations and helped us in:
5. selecting the video recordings to be used in the evaluations,
6. creating the specifications for the ground truth annotations and
scoring tools
7. defining the evaluation infrastructure for the program.

Data Resources
TASK DOMAIN
Text Detect & Track Broadcast News ABC & CNN*
Face Detect & Track Broadcast News ABC & CNN*
Vehicle Detect & Track Surveillance i-LIDS**

 MPEG–2 standard, progressive scanned at 720 × 480 resolution.
GOP (Group of Pictures) of 12 for the broadcast news corpus where
the frame-rate was 29.97 fps (frames per second) and GOP of 10
for the surveillance dataset where the frame-rate was 25 fps.

* Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu
** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK Home
Office and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)

Reference Annotations
 Text Ground Truth: Every new text area was marked with a box when it
appeared in the video. The box was moved and scaled to fit the text as it
moved in successive frames. This process was done at the text line level
until the text disappeared from the frame.

Three readability levels:

READABILITY = 1 (white)
Completely unreadable text

READABILITY = 1 (gray)
Partially readable text

READABILITY = 2 (black)
Clearly readable text

Reference Annotations
• Text regions were tagged based on a comprehensive set of rules:
• All text within a selected block must contain the same readability level and
type.

• Blocks of text must contain the same size and font.

• The bounding box should be tight to the extent that there is no space
between the box and the text.

• Text boxes may not overlap other text boxes unless the characters
themselves are superimposed atop one another.

Sample Annotation Clip (line-
level)

Detection Metric
• The Frame Detection Accuracy (FDA) measure calculates the spatial
overlap between the ground truth and system output objects as a ratio of
the spatial intersection between the two objects and the spatial union of
them. The sum of all of the overlaps was normalized over the average of
the number of ground truth and detected objects

Frame Detection Accuracy (FDA)

Gi(t )  Di(t )
(t )
Overlap Ratio N mapped
FDA(t ) =
N Gt ) + N Dt )
( ( where, Overlap Ratio = ∑i =1 Gi(t )  Di(t )
2
Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t.
Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t.
N(t)G and N(t)D denote the number of ground truth objects and the number of detected objects in frame t respectively.

Detection Metric
• The Sequence Frame Detection Accuracy (SFDA), is essentially the
average of the FDA measure over all of the relevant frames in the
sequence.

Sequence Frame Detection Accuracy (SFDA)
N frames

∑ FDA (t )
SFDA = N frames
t =1
Range: 0 to 1 (higher is better)

∑
(t ) (t)
∃( N G OR N D )
t =1

Nframes is the number of frames in the sequence

Tracking Metric
• The Average Tracking Accuracy (ATA) is a spatio-temporal measure
which penalizes fragmentations in both the temporal and spatial dimensions
while accounting for the number of objects detected and tracked, missed
objects, and false positives.
N iframes Gi(t )  Di(t )
N mapped
∑t =1 Gi(t )  Di(t )
Sequence Track Detection Accuracy (STDA) STDA = ∑
i =1 N ( Gi ∪ Di ≠φ )

STDA
Average Tracking Accuracy (ATA) ATA =
 NG + N D 
Range: 0 to 1 (higher is better)  
 2 
NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence
respectively. Uniqueness is defined by object IDs.

Example Detection
Scoring
Green: Detected box Red: Ground truth box Yellow: Overlap in mapped boxes

Spatial alignment error
(ratio = .4505)
3 false alarm objects

[0.4505] + [1] 
FDA =   = 0.2901
Correctly detected  (5 + 5) 2 
object – perfect
overlap (ratio = 1.0)

3 missed objects

Annotation Quality

Evaluation relies on manual labeling

The degree of consistency becomes 10% of the entire corpus was
increasingly important as systems doubly annotated by multiple
approach human levels of
annotators and checked for
performance.
quality using the evaluation
A high degree of consistency would be
difficult to achieve with somewhat
measures.
subjective attributes like readability

Humans fatigued easily when
performing such tedious tasks

Annotation Quality
 For double annotated corpus

Average Sequence Frame Text detection 95%
Detection Accuracy (SFDA)

Average Average Tracking Text tracking 85%
Accuracy (ATA)

The scores for the current state-of-the-art automatic algorithms
are significantly lower than these numbers (22% relative for text
detection, and 61% relative for text tracking).

Annotation Quality

Flowchart of Annotation Quality Control Procedure. Steps denoted by dark shaded boxes were carried
out by the annotators. Steps denoted by light shaded boxes were carried out by the evaluators.

Text Detection and Tracking –
VACE
Mean SFDA/ATA Scores for Eng Text Detection and Tracking: BNews
1
A
0.9 B
0.8 C
D
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
SFDA ATA

Text Recognition
Evaluation
• Datesets: Broadcast News
• Training/Dry Run Development Set
– 5 Clips
• 14.5 minutes
• 1181 words

• Evaluation Set
– 25 Clips
• 62.5 minutes
• 4178 word objects
• 68,738 word frame instances

Text Recognition
Evaluation
 Evaluate only the most easily readable text (to establish
a baseline at a high level of inter-annotator agreement)
• Type = graphic (no scene text)
• Readability = 2
• Logo = false
• Occlusion = false
• Ambiguous = false
— Exclude scrolling (ticker), dynamic text
(scoreboard)
• Case insensitive and punctuation ignored

Sample Annotation Clip (Word-
level)

Recognition Evaluation
Metrics
• Spatially map system output detected words to reference
words, then compare the strings for mapped words
– An unmapped word in system output incurs an Insertion (I) error
– An unmapped word in reference incurs a Deletion (D) error
– A mapped word with a character mismatch incurs a Substitution
(S) error

REF: The raven caws at midnight (I + D + S)
WER =
D S I (Total # Words in Ref)
Sys
Output: raven calls at at midnight WER = (1 + 1 + 1)/5 = 3/5 (60%)

• Errors are accumulated over entire test set
• Also generate: Character Error Rate

Individual Clip Word Error
Rate
Clip-wise WER
1
WER per clip (normalized by the #words in each clip)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25
Clips

Scores (Word Error Rate)
Word Error Rates with different Normalizations
1

0.9

0.8

0.7

0.6
Values

0.5

0.4

0.3

0.2

0.1

0
WER/Word WER/Frames

WER CER
0.4233 0.2823

Discussion
• The recent progresses provide many promising
solutions and research directions for text extraction
problem.
• Due to the large variations of text objects in videos, no
single approach can achieve satisfactory performance in
all applications.
• To further improve the performance of text extraction
techniques, much work in the area remains.

Discussion

 Detection and Localization
– How to efficiently combine several complementary
extraction algorithms to produce better performance
and how to extract better features by analyzing the
shape of characters and the relationships between
text and its background still need more investigation.

Discussion
 Tracking
– Although text tracking is an indispensable step for
text extraction in videos, not many text tracking
approaches have been reported in recent years.

– More effort is needed to focus on tracking, not only
for static and scrolling text, but also for dynamic text
objects (growing, shrinking, and rotating text).

Discussion
 Datasets:
– Besides extraction approaches, because most
algorithms are still tested on their own datasets, in
order to compare and evaluate all algorithms, a large
freely available annotated video dataset is urgently
needed.

THANK YOU!
See you at ICPR 2008 in December

Recent Progress in Video Text Extraction Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Recent Progress in Video Text Extraction Techniques

Similar to Recent Progress in Video Text Extraction Techniques (20)

Recently uploaded

Recently uploaded (20)

Recent Progress in Video Text Extraction Techniques

Editor's Notes