Figs. 2 show two computation examples of the neighborhoods. In each figure, the image (a) shows a binary image, where black dots denote centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. We consider two cases and illustrate their solutions in Figs. 2c. According to the Delaunay triangulation shown in Fig. 2b, we can get three neighbor sets, in which the neighborhood of three neighboring characters in the middle of the text string, i.e. BCD , is ignored. In order to tackle this kind of problem, we take all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets. The solution is illustrated by dotted lines in Fig. 2c.
A smart way to combine color or graylevel variation with spatial information is to use Gabor-based filter. For this purpose and in the context of natural scene images, we have chosen to use Log-Gabor filters as explained in Figure 1 to get our final text cluster, which will be then fed into an OCR algorithm.
once the text edges are detected at a resolution level, they are erased immediately from the current edge map, and then the modified edge map is utilized as input to the next level, so that no text edges can appear several times at different resolution levels.
A fuzzy clustering ensemble (FCE), which can merge the results of several clustering algorithms together in order to improve the quality of the individual clusters and robustness, is adopted to fuse multi-frame information. For a set of consecutive frames, the features are extracted from each frame by applying wavelet transform and clustered by FCM. Then FCE is employed to output an integrated frame with three clusters, “text”, “background” and “complex background”, based on the individual clustering result of each frame. “text” cluster is labeled by finding the smallest distance of each cluster from the ideal text features.
The text density time series is mirrored on both ends to ensure that its length is dyadic (i.e., a power of 2). It is then decomposed using Haar wavelets onto 5 scales (H1 through H5 in Figure) and the residual (L5), with dyadic downsampling at each scale. At each scale, potential change points are detected when a detection threshold (shown as a red envelope in Figure, which is set at 3.5 times the standard deviation in a local neighborhood downsampled for each scale) was exceeded. Scales that exceeded the detection threshold at any given time were selected for reconstruction using inverse wavelet transform. Videotext events were detected as points in time where the reconstructed signal exceeds an adjusted threshold
We extend the 2D DRF to a 3D DRF as follows. We extend the neighboring structure Ni of each state si from 2D to 3D, as in Figure 1 (b). We call neighbors in the same frame as intra-frame neighbors, N intra i , and neighbors across neighboring frames as inter-frame neighbors, N inter i . Anisotropy for inter- and intra-frame is a natural requirement since dependencies along the temporal direction should be different from the spatial domain, hence define I intra( si, sj , o ) = β intra sisj and I inter( si, sj , o ) = β inter sisj . The 3D DRF is in essence collecting more context than the 2D DRF. It therefore has a larger chance to correctly estimate the hidden states.
1) Discriminative points detection and clustering—detect discriminative feature points in every video frame using the algorithm proposed in  and partition them into clusters. 2) Road sign localization—select candidate road sign regions corresponding to clusters of feature points using a vertical plane criterion. 3) Text detection—detect text on candidate road sign areas and track them. 4) Text extraction and recognition—extract text in candidate sign plane for recognition given a satisfactory size.
We refer to the final disjoint key frames (including the mosaicked frames) as summary frames . We get the bounding boxes for the binary content in the summary frames and stitch them together, making a summary image of the instructional video content. We compare the performance of our summarization algorithm with three well-known key frame selection techniques namely, the fixed rate video sampling, the tolerance band  and, the unsupervised clustering  methods. Figs. clearly show that our method outperforms the conventional key frame selection methods in summarizing the visual content in instructional videos. Our method performs better than the other methods in the following three aspects. First, the conventional methods are based on image dissimilarity measures, so the occlusions, light condition changes, and camera movements negatively affect the resulting key frames.
Should mention that N_G and N_D are the number of unique ground truth and system output objects in the video sequence. Uniqueness being denoted by their ID.
Each clip treated independently..
Extraction of Text Objects in Video Documents: Recent Progress Jing Zhang and Rangachar Kasturi University of South Florida Department of Computer Science and Engineering
AcknowledgementsThe work presented here is that of numerousresearchers from around the world. We thank them for their contributions towards the advances in video document processing. In particular we would like to thank theauthors of papers whose work is cited in this presentation and in our paper.
Introduction Since 1990s, with rapid growth of available multimedia documents and increasing demand for information indexing and retrieval, much effort has been done on text extraction in images and videos.
Introduction• Text Extraction in Video – Text consists of words that are well-defined models of concepts for humans communication. – Text objects embedded in video contain much semantic information related to the multimedia content. – Text extraction techniques play an important role in content- based multimedia information indexing and retrieval.
Introduction Extracting text in video presents unique challenge over that in scanned documents: Cons: Pros: Low contrast Temporal Redundancy (text in video usually persists for at least several Low resolution seconds, to give human viewers the necessary time to read it) Color bleeding Unconstrained backgrounds Unknown text color, size, position, orientation, and layout
Introduction• Caption Text which is artificially superimposed on the video at the time of editing.• Scene Text which naturally occurs in the field of view of the camera during video capture.• The extraction of scene text is a much tougher task due to varying lighting, complex movement and transformation.Scene Text Caption Text
Introduction Five stages of text extraction in video:1) Text Detection: finding regions in a video frame that contain text;2) Text Localization: grouping text regions into text instances and generating a set of tight bounding boxes around all text instances;3) Text Tracking: following a text event as it moves or changes over time and determining the temporal and spatial locations and extents of text events;4) Text Binarization: binarizing the text bounded by text regions and marking text as one binary level and background as the other;5) Text Recognition: performing OCR on the binarized text image.
Introduction Video Clips Text Detection Text Localization Text Tracking Text Binarization Text Recognition Text Objects
Introduction The goal of Text detection, text localization and text tracking is to generate accurate bounding boxes of all text objects in video frames and provide a unique identity to each text event which is composed of the same text object appearing in a sequence of consecutive frames.
Introduction This presentation mainly concentrates on the approaches proposed for text extraction in videos in the most recent five years, to summarize and discuss the recent progress in this research area.
Introduction Region Based Approach utilizes the different region properties between text and background to extract text objects. – Bottom-up: separating the image into small regions and then grouping character regions into text regions. – Color features, edge features, and connected component methods Texture Based Approach uses distinct texture properties of text to extract text objects from background. – Top-down: extracting texture features of the image and then locating text regions. – Spatial variance, Fourier transform, Wavelet transform, and machine learning methods.
Recent Progress Text extraction in video documents, as an important research branch of content-based information retrieval and indexing, continues to be a topic of much interest to researchers. A large number of newly proposed approaches in the literature have contributed to an impressive progress of text extraction techniques.
Recent Progress Prior to 2003 Now • Only a few text extraction • Temporal redundancy of video is approaches considered the utilized by almost all recent text temporal nature of video. extraction approaches. • Very little work was done on • Scene text extraction is being scene text. extensively studied. • Objective performance • A comprehensive performance evaluation metrics were evaluation framework has been scarce. developed.
Recent Progress The progress of text extraction in videos can be categorized into three types: • New and improved text extraction approaches • Text extraction techniques adopted from other research fields Text extraction approaches proposed for specific text types and specific genre of video documents
Recent Progress• New and improved text extraction approaches: The new and improved approaches play an important role in the recent progress of text extraction technique for videos. These new approaches introduce not only new algorithms but also new understanding of the problem.
Recent Progress -New and improved text extraction approachesNguyen T. and A. Boucher, A novel approach for text detection in images H. Tran, A lux, H.L. using structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005A text string is modeled as its center lineand the skeletons of characters by ridgesat different hierarchical scales. First line: Images with rectangle showing the text region. Second line: Zoom on text regions. Third line: ridges detected at two scales (red in high level, blue in small level) in the text region that represent local structures of text lines whatever the type of text.
H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005• Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A text string is modelled by a ridge at a coarse scale representing its center line and numerous short ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge. In this way, we obtain a hierarchical description of text strings, which can provide direct input to an OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it works with a wide variety in size of characters and does not depend on orientation of text string. The experimental results show a good detection. X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008. Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring characters to extract multilingual texts in images. The case of three neighboring characters is represented as the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’ defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each connected component in the binary image as character or non-character according to its neighbors, where a mathematical morphology based method is introduced to detect and connect the separated parts of each character, and a Voronoi partition based method is advised to establish the neighborhoods of connected components. We further present a discriminative training algorithm based on the maximum–minimum similarity (MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In the experiments, we also show that the MMS provides significant improvement of overall performance, compared with influential training criterions of the maximum likelihood (ML) and the maximum classification error (MCE).
Recent Progress-New and improved text extractionapproaches X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008. The GMM based algorithm treats the text features of three neighboring characters as three mixed Gaussian models to extract text objects. (a) (b) (c)An example of neighborhood computation. In each figure, the image (a) shows a binary image, where black dots denotecentroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with aneighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c)The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.
Recent Progress-New and improved text extractionapproaches P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International Conference Signal Processing, IEEE, Vol. 4, 2006 Only the vertical edge features are utilized to find text regions based on the observation that vertical edges can enhance the characteristic of text and eliminate most irrelevant information. (a) (b) (c) (d) (a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result
Recent Progress-New and improved text extractionapproaches K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text- Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 33-37, 2007 Character-stroke is used to extract text objects by utilizing three line scans (a set of pixels along the horizontal line of an intensity image) to detect image intensity changes. (a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, ∆ is the stroke width, (c) threshold Ig ≤ 0.35, (d) The thresholded image after morphological operations and connected component analysis.
P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International Conference Signal Processing, IEEE, Vol. 4, 2006 • Abstract: Text detection plays a crucial role in various applications. In this paper we present an edge based text detection technique in the complex images for multi purpose application. The technique applied vertical Sobel edge detection and a newly proposed morphological technique that used to connect the edges to form the candidate regions. The technique has special advantage, by providing a distinguishable texture on the text area over the others. The connected components are then extracted using a purposed segmentation algorithm. Later all the candidate regions are verified to specify the text region. The propose techniques has been tested with different types of image acquired from different input sources and environment. The experimental result shows highly successful rate. K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text- Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 33-37, 2007Abstract: In this paper, we present a new approach for analysis of images for text-localization and extraction. Our approach puts very few constraints on the font, size andcolor of text and is capable of handling both scene text and articial text well. In this paper,we exploit two well-known features of text: approximately constant stroke width and localcontrast, and develop a fast, simple, and effective algorithm to detect character strokes. Wealso show how these can be used for accurate extraction and motivate some advantagesof using this approach for text localization over other colorspace segmentation basedapproaches. We analyze the performance of our stroke detection algorithm on imagescollected for the robust-reading competitions at ICDAR 2003
Recent Progress-New and improved text extractionapproaches D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video, International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003 8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that best correspond to the properties of text are determined empirically. The sum of the absolute values of these coefficients is computed and regarded as a measure of the “text energy” of that block. The motion vectors of MPEG-compressed videos are used for text objects tracking. (a) Original image (b) Text energy (c) Tracking result
D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video, International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003• Abstract. The popularity of digital video is increasing rapidly. To help users navigate libraries of video, algorithms that automatically index video based on content are needed. One approach is to extract text appearing in video, which often reflects a scene’s semantic content. This is a difficult problem due to the unconstrained nature of general-purpose video. Text can have arbitrary color, size, and orientation. Backgrounds may be complex and changing. Most work so far has made restrictive assumptions about the nature of text occurring in video. Such work is therefore not directly applicable to unconstrained, general-purpose video. In addition, most work so far has focused only on detecting the spatial extent of text in individual video frames. However, text occurring in video usually persists for several seconds. This constitutes a text event that should be entered only once in the video index. Therefore it is also necessary to determine the temporal extent of text events. This is a non-trivial problem because text may move, rotate, grow, shrink, or otherwise change over time. Such text effects are common in television programs and commercials but so far have received little attention in the literature. This paper discusses detecting, binarizing, and tracking caption text in general-purpose MPEG-1 video. Solutions are proposed for each of these problems and compared with existing work found in the literature.
Recent Progress-New and improved text extractionapproaches In addition, many former text extraction approaches have been enhanced and extended recently. By extracting and integrating more comprehensive characteristics of text objects, these new approaches can provide more robust performance than previous approaches. Besides new approaches, many improved approaches are presented to overcome the limitations of former approaches.
Recent Progress -New and improved text extraction approaches Caption localization in video sequences by fusion of multiple detectors, S Lefevre, N Vincent, Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp. 106-110, 2005 Color-related detector, wavelet-based texture detector, edge-based contour detector and temporal invariance principle are adopted to detect candidate caption regions. Then a parallel fusion strategy C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006. Euclidean distance based and Cosine similarity based clustering methods are applied on GRB color space complementarily to partition the original image into three clusters: textual foreground, textual background, and noise.Overview of the proposed algorithm combining color and spatial information.
S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp. 106-110, 2005• Abstract: In this article, we focus on the problem of caption detection in video sequences. Contrary to most of existing approaches based on a single detector followed by an ad hoc and costly post-processing, we have decided to consider several detectors and to merge their results in order to combine advantages of each one. First we made a study of captions in video sequences to determine how they are represented in images and to identify their main features (color constancy and background contrast, edge density and regularity, temporal persistence). Based on these features, we then select or define the appropriate detectors and we compare several fusion strategies which can be involved. The logical process we have followed and the satisfying results we have obtained let us validate our contribution. C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006. Abstract: Natural scene images brought new challenges for a few years and one of them is text understanding over images or videos. Text extraction which consists to segment textual foreground from the background succeeds using color information. Faced to the large diversity of text information in daily life and artistic ways of display, we are convinced that this only information is no more enough and we present a color segmentation algorithm using spatial information. Moreover, a new method is proposed in this paper to handle uneven lighting, blur and complex backgrounds which are inherent degradations to natural scene images. To merge text pixels together, complementary clustering distances are used to support simultaneously clear and well-contrasted images with complex and degraded images. Tests on a public database show finally efficiency of the whole proposed method.
Recent Progress-New and improved text extractionapproaches M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005. The sequential multi-resolution paradigm can remove the redundancy of parallel multi- resolution paradigm. No text edges can appear several times at different resolution levels. Sequential multiresolution paradigm
M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005.• Abstract—Text in video is a very compact and accurate clue for video indexing and summarization. Most video text detection and extraction methods hold assumptions on text color, background contrast, and font style. Moreover, few methods can handle multilingual text well since different languages may have quite different appearances. This paper performs a detailed analysis of multilingual text characteristics, including English and Chinese. Based on the analysis, we propose a comprehensive, efficient video text detection, localization, and extraction method, which emphasizes the multilingual capability over the whole processing. The proposed method is also robust to various background complexities and text appearances. The text detection is carried out by edge detection, local thresholding, and hysteresis edge recovery. The coarse-to-fine localization scheme is then performed to identify text regions accurately. The text extraction consists of adaptive thresholding, dam point labeling, and inward filling. Experimental results on a large number of video images and comparisons with other methods are reported in detail.
Recent Progress-New and improved text extractionapproaches J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp. 283-290, 2006. Fuzzy C-means based individual frame clustering is replaced by the fuzzy clustering ensemble (FCE) based multi-frame clustering to utilize temporal redundancy.Fuzzy cluster ensemble for text detection in videos
J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp. 283-290, 2006.• Abstract: Detection and localization of text in videos is an important task towards enabling automatic content-based retrieval of digital video databases. However, since text is often displayed against a complex background, its detection is a challenging problem. In this paper, a novel approach based on fuzzy cluster ensemble techniques to solve this problem is presented. The advantage of this approach is that the fuzzy clustering ensemble allows the incremental inclusion of temporal information regarding the appearance of static text in videos. Comparative experimental results for a test set of 10.92 minutes of video sequences have shown the very good performance of the proposed approach with an overall recall of 92.04% and a precision of 96.71%.
Recent Progress2. Text extraction techniques adopted from other research fields: Another encouraging progress is that more and more techniques that have been successfully applied in other research fields have been adapted for text extraction. Because these approaches were not initially designed for the text extraction task, many unique characteristics of their original research fields are embedded in them intrinsically. Therefore, by using these approaches from other fields, we can view the text extraction problem from the viewpoints of other related research fields and benefit from them. It is a promising way to find good solutions for text extraction task.
Recent Progress-Text extraction techniques adopted fromother research fields K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003. The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to detect and track faces in a video stream. Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked as white and gray level (white: text region, gray: non-text region), and (d) final detection result
Recent Progress-Text extraction techniques adopted fromother research fields H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007. The multiscale statistical process control (MSSPC) was originally proposed for detecting changes in univariate and multivariate signals. Substeps involved in the use of MSSPC for videotext event detection
K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.• Abstract—The current paper presents a novel texture-based method for detecting texts in images. A support vector machine (SVM) is used to analyze the textural properties of texts. No external texture feature extraction module is used; rather, the intensities of the raw pixels that make up the textural pattern are fed directly to the SVM, which works well even in high-dimensional spaces. Next, text regions are identified by applying a continuously adaptive mean shift algorithm (CAMSHIFT) to the results of the texture analysis. The combination of CAMSHIFT and SVMs produces both robust and efficient text detection, as time-consuming texture analyses for less relevant pixels are restricted, leaving only a small part of the input image to be texture-analyzed.H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings ofNinth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007. Abstract: Text in video, whether overlay or in-scene, contains a wealth of information vital to automated content analysis systems. However, low resolution of the imagery, coupled with richness of the background and compression artifacts limit the detection accuracy that can be achieved in practice using existing text detection algorithms. This paper presents a novel, noncausal temporal aggregation method that acts as a second pass over the output of an existing text detector over the entire video clip. A multiresolution change detection algorithm is used along the time axis to detect the appearance and disappearance of multiple, concurrent lines of text followed by recursive timeaveraged projections on Y and X axes. This algorithm detects and rectifies instances of missed text and enhances spatial boundaries of detected text lines using consensus estimates. Experimental results, which demonstrate significant performance gain on publicly collected and annotated data, are presented.
Recent Progress-Text extraction techniques adopted fromother research fields D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006. Discriminative Random Fields (DRF) was initially applied to detect man-made building in 2D images. (a) 2D DRF, with state si and one of its neighbors sj . (b) 3D DRF, with multiple 2D DRFs stacked over time. (c) 2D DRF-HMM type(A), with intra-frame dependencies modelled by undirected DRFs, and inter- frame dependencies modelled by HMMs. States are shared between the two models.
Recent Progress-Text extraction techniques adopted fromother research fields W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 412-416, 2007. Sparse representation was initially used for research on the receptive fields of simple cells. (a) (b) (C) (a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c) binarized result of (b) using Otsu’s method.
D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.• Abstract: In this paper, we propose a general object detection framework which combines the Hidden Markov Model with the Discriminative Random Fields. Recent object detection algorithms have achieved impressive results by using graphical models, such as Markov Random Field. These models, however, have only been applied to two dimensional images. In many scenarios, video is the directly available source rather than images, hence an important information for detecting objects has been omitted — the temporal information. To demonstrate the importance of temporal information, we apply graphical models to the task of text detection in video and compare the result of with and without temporal information. We also show the superiority of the proposed models over simple heuristics such as median filter over time. W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 412-416, 2007. Abstract: A novel text segmentation method from complex background is presented in this paper. The idea is inspired by the recent development in searching for the sparse signal representation among a family of over-complete atoms, which is called a dictionary. We assume that the image under investigation is composed of two components: the foreground text and the complex background. We further assume that the latter can be modeled as a piece-wise smooth function. Then we choose two dictionaries, where the first one gives sparse representation to one component and nonsparse representation to another while the second one does the opposite. By looking for the sparse representations in each dictionary, we can decompose the image into the two composing components. After that, text segmentation can be easily achieved by applying simple thresholding to the text component. Preliminary experiments show some promising results.
Recent Progress3. Text extraction approaches proposed for specific text types and specific genre of video documents: Besides general text extraction approaches, an increasing number of approaches have been proposed for specific text types. Based on domain knowledge, these specific approaches can take advantages of unique properties of specific text type or video genre and often achieve better performance than general approaches.
Recent Progress-Text extraction approaches proposed forspecific text types and specific genre of videodocuments W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.This approach is composed oftwo stages:1. localizing road signs;2. detecting text. Architecture of the proposed framework
W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEETransactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.• Abstract—A fast and robust framework for incrementally detecting text on road signs from video is presented in this paper. This new framework makes two main contributions. 1) The framework applies a divide-and-conquer strategy to decompose the original task into two subtasks, that is, the localization of road signs and the detection of text on the signs. The algorithms for the two subtasks are naturally incorporated into a unified framework through a feature-based tracking algorithm. 2) The framework provides a novel way to detect text from video by integrating two- dimensional (2-D) image features in each video frame (e.g., color, edges, texture) with the three-dimensional (3-D) geometric structure information of objects extracted from video sequence (such as the vertical plane property of road signs). The feasibility of the proposed framework has been evaluated using 22 video sequences captured from a moving vehicle. This new framework gives an overall text detection rate of 88.9% and a false hit rate of 9.2%. It can easily be applied to other tasks of text detection from video and potentially be embedded in a driver assistance system.
Recent Progress-Text extraction approaches proposed forspecific text types and specific genre of videodocumentsT. Liu, Summarization of Visual Content in Instruction videos, IEEE C. Choudary, and Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007. content fluctuation curve based on the number of chalk pixels is used to measure the content in each frame of instructional videos. The frames with enough chalk pixels are extracted as key frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the redundancy of key frames by matching the content and mosaicking the frames. (a) (b) (C) (d) Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.
C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos, IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.• Abstract—In instructional videos of chalk board presentations, the visual content refers to the text and figures written on the boards. Existing methods on video summarization are not effective for this video domain because they are mainly based on low-level image features such as color and edges. In this work, we present a novel approach to summarizing the visual content in instructional videos using middle-level features. We first develop a robust algorithm to extract content text and figures from instructional videos by statistical modelling and clustering. This algorithm addresses the image noise, nonuniformity of the board regions, camera movements, occlusions, and other challenges in the instructional videos that are recorded in real classrooms. Using the extracted text and figures as the middle level features, we retrieve a set of key frames that contain most of the visual content. We further reduce content redundancy and build a mosaicked summary image by matching extracted content based on K-th Hausdorff distance and connected component decomposition. Performance evaluation on four full-length instructional videos shows that our algorithm is highly effective in summarizing instructional video content
Recent Progress-Text extraction approaches proposed forspecific text types and specific genre of videodocumentsAdditional References:• C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006.• D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004.• L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of IEEE international conference on image processing, Vol. 3, pp11-14, 2005.• M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005.• S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEE Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110, 2005.• CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with artificial intelligence, Vol. 2 ,pp 539-542, 2007.• …
Performance EvaluationR. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M.Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation ofFace, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol,to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008.(http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57)Evaluation Metrics:Video Analysis and Content Extraction (VACE)
Text: Task Definition Detection Task: Spatially locate the blocks of text in each video frame in a video sequence • Text blocks (objects) contain all words in a particular line of text where the font and size are the same Tracking Task: Spatially/temporally locate and track the text objects in a video sequence Recognition Task: Transcribe the words in each frame, including their spatial location (detection implied)
Task DefinitionHighlights• Annotate oriented bounding rectangle around text objects (The reference annotation was done by VideoMining Inc., State College, PA)• Detection and Tracking task – Line level annotation with IDs maintained – Rules based on similarity of font, proximity and readability levels• Recognition task – Word Level (IDs maintained)• Documents – Annotation guidelines - Evaluation protocol• Tools – ViPER (Annotation) - USF-DATE (Scoring)
Data Resources VIDEO DATA NUMBER TOTAL MINS OF CLIPS MICRO-CORPUS 5 10 TRAINING 50 175 TESTING 50 175 • Micro-corpus: a small amount of data that was created after extensive discussions with the research community to act as a seed for initial annotation experiments and to provide new participants with a concrete sampling of the datasets and the tasks.
Data Resources These discussions were coordinated as a series of weekly teleconferences with VACE contractors and other eminent members of the CV community. The discussions made the research community a partner in the evaluations and helped us in:5. selecting the video recordings to be used in the evaluations,6. creating the specifications for the ground truth annotations and scoring tools7. defining the evaluation infrastructure for the program.
Data Resources TASK DOMAIN Text Detect & Track Broadcast News ABC & CNN* Face Detect & Track Broadcast News ABC & CNN* Vehicle Detect & Track Surveillance i-LIDS** MPEG–2 standard, progressive scanned at 720 × 480 resolution. GOP (Group of Pictures) of 12 for the broadcast news corpus where the frame-rate was 29.97 fps (frames per second) and GOP of 10 for the surveillance dataset where the frame-rate was 25 fps.* Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK HomeOffice and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)
Reference Annotations Text Ground Truth: Every new text area was marked with a box when it appeared in the video. The box was moved and scaled to fit the text as it moved in successive frames. This process was done at the text line level until the text disappeared from the frame. Three readability levels:READABILITY = 1 (white)Completely unreadable text READABILITY = 1 (gray) Partially readable textREADABILITY = 2 (black) Clearly readable text
Reference Annotations• Text regions were tagged based on a comprehensive set of rules:• All text within a selected block must contain the same readability level and type.• Blocks of text must contain the same size and font.• The bounding box should be tight to the extent that there is no space between the box and the text.• Text boxes may not overlap other text boxes unless the characters themselves are superimposed atop one another.
Detection Metric• The Frame Detection Accuracy (FDA) measure calculates the spatial overlap between the ground truth and system output objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all of the overlaps was normalized over the average of the number of ground truth and detected objects Frame Detection Accuracy (FDA) Gi(t ) Di(t ) (t ) Overlap Ratio N mapped FDA(t ) = N Gt ) + N Dt ) ( ( where, Overlap Ratio = ∑i =1 Gi(t ) Di(t ) 2 Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t. Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t. N(t)G and N(t)D denote the number of ground truth objects and the number of detected objects in frame t respectively.
Detection Metric• The Sequence Frame Detection Accuracy (SFDA), is essentially the average of the FDA measure over all of the relevant frames in the sequence. Sequence Frame Detection Accuracy (SFDA) N frames ∑ FDA (t ) SFDA = N frames t =1 Range: 0 to 1 (higher is better) ∑ (t ) (t) ∃( N G OR N D ) t =1 Nframes is the number of frames in the sequence
Tracking Metric • The Average Tracking Accuracy (ATA) is a spatio-temporal measure which penalizes fragmentations in both the temporal and spatial dimensions while accounting for the number of objects detected and tracked, missed objects, and false positives. N iframes Gi(t ) Di(t ) N mapped ∑t =1 Gi(t ) Di(t ) Sequence Track Detection Accuracy (STDA) STDA = ∑ i =1 N ( Gi ∪ Di ≠φ ) STDA Average Tracking Accuracy (ATA) ATA = NG + N D Range: 0 to 1 (higher is better) 2 NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequencerespectively. Uniqueness is defined by object IDs.
Annotation Quality Evaluation relies on manual labeling The degree of consistency becomes 10% of the entire corpus was increasingly important as systems doubly annotated by multiple approach human levels of annotators and checked for performance. quality using the evaluation A high degree of consistency would be difficult to achieve with somewhat measures. subjective attributes like readability Humans fatigued easily when performing such tedious tasks
Annotation Quality For double annotated corpus Average Sequence Frame Text detection 95% Detection Accuracy (SFDA) Average Average Tracking Text tracking 85% Accuracy (ATA) The scores for the current state-of-the-art automatic algorithms are significantly lower than these numbers (22% relative for text detection, and 61% relative for text tracking).
Annotation Quality Flowchart of Annotation Quality Control Procedure. Steps denoted by dark shaded boxes were carried out by the annotators. Steps denoted by light shaded boxes were carried out by the evaluators.
Text Detection and Tracking –VACE Mean SFDA/ATA Scores for Eng Text Detection and Tracking: BNews 1 A 0.9 B 0.8 C D 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SFDA ATA
Text RecognitionEvaluation• Datesets: Broadcast News• Training/Dry Run Development Set – 5 Clips • 14.5 minutes • 1181 words• Evaluation Set – 25 Clips • 62.5 minutes • 4178 word objects • 68,738 word frame instances
Text RecognitionEvaluation Evaluate only the most easily readable text (to establish a baseline at a high level of inter-annotator agreement) • Type = graphic (no scene text) • Readability = 2 • Logo = false • Occlusion = false • Ambiguous = false — Exclude scrolling (ticker), dynamic text (scoreboard) • Case insensitive and punctuation ignored
Recognition EvaluationMetrics• Spatially map system output detected words to reference words, then compare the strings for mapped words – An unmapped word in system output incurs an Insertion (I) error – An unmapped word in reference incurs a Deletion (D) error – A mapped word with a character mismatch incurs a Substitution (S) error REF: The raven caws at midnight (I + D + S) WER = D S I (Total # Words in Ref)SysOutput: raven calls at at midnight WER = (1 + 1 + 1)/5 = 3/5 (60%)• Errors are accumulated over entire test set• Also generate: Character Error Rate
Individual Clip Word ErrorRate Clip-wise WER 1WER per clip (normalized by the #words in each clip) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 Clips
Scores (Word Error Rate) Word Error Rates with different Normalizations 1 0.9 0.8 0.7 0.6Values 0.5 0.4 0.3 0.2 0.1 0 WER/Word WER/Frames WER CER 0.4233 0.2823
Discussion• The recent progresses provide many promising solutions and research directions for text extraction problem.• Due to the large variations of text objects in videos, no single approach can achieve satisfactory performance in all applications.• To further improve the performance of text extraction techniques, much work in the area remains.
Discussion Detection and Localization – How to efficiently combine several complementary extraction algorithms to produce better performance and how to extract better features by analyzing the shape of characters and the relationships between text and its background still need more investigation.
Discussion Tracking – Although text tracking is an indispensable step for text extraction in videos, not many text tracking approaches have been reported in recent years. – More effort is needed to focus on tracking, not only for static and scrolling text, but also for dynamic text objects (growing, shrinking, and rotating text).
Discussion Datasets: – Besides extraction approaches, because most algorithms are still tested on their own datasets, in order to compare and evaluate all algorithms, a large freely available annotated video dataset is urgently needed.