SlideShare a Scribd company logo
1 of 75
Extraction of Text Objects in
 Video Documents: Recent
          Progress

       Jing Zhang and Rangachar Kasturi
            University of South Florida
  Department of Computer Science and Engineering
Acknowledgements

The work presented here is that of numerous
researchers from around the world. We thank
   them for their contributions towards the
  advances in video document processing.

   In particular we would like to thank the
authors of papers whose work is cited in this
       presentation and in our paper.
Outline
•   Introduction
•   Recent Progress
•   Performance Evaluation
•   Discussion
Introduction
 Since 1990s, with rapid growth of available
  multimedia documents and increasing demand
  for information indexing and retrieval, much
  effort has been done on text extraction in
  images and videos.
Introduction
• Text Extraction in Video
  – Text consists of words that are well-defined models of concepts
    for humans communication.
  – Text objects embedded in video contain much semantic
    information related to the multimedia content.
  – Text extraction techniques play an important role in content-
    based multimedia information indexing and retrieval.
Introduction
 Extracting text in video presents unique challenge over
  that in scanned documents:

   Cons:                               Pros:
   Low contrast                        Temporal Redundancy (text in video
                                       usually persists for at least several
   Low resolution                      seconds, to give human viewers the
                                       necessary time to read it)
   Color bleeding
   Unconstrained backgrounds
   Unknown text color, size,
   position, orientation, and layout
Introduction
• Caption Text which is artificially superimposed on the video at the
  time of editing.
• Scene Text which naturally occurs in the field of view of the camera
  during video capture.
• The extraction of scene text is a much tougher task due to varying
  lighting, complex movement and transformation.


Scene Text
                                                        Caption Text
Introduction
 Five stages of text extraction in video:
1) Text Detection: finding regions in a video frame that contain text;
2) Text Localization: grouping text regions into text instances and generating a
   set of tight bounding boxes around all text instances;
3) Text Tracking: following a text event as it moves or changes over time and
   determining the temporal and spatial locations and extents of text events;
4) Text Binarization: binarizing the text bounded by text regions and marking
   text as one binary level and background as the other;
5) Text Recognition: performing OCR on the binarized text image.
Introduction
        Video Clips


        Text Detection



       Text Localization   Text Tracking



       Text Binarization



       Text Recognition


       Text Objects
Introduction
 The goal of Text detection, text localization and text
  tracking is to generate accurate bounding boxes of all
  text objects in video frames and provide a unique
  identity to each text event which is composed of the
  same text object appearing in a sequence of
  consecutive frames.
Introduction
 This presentation mainly concentrates on the
  approaches proposed for text extraction in
  videos in the most recent five years, to
  summarize and discuss the recent progress in
  this research area.
Introduction
 Region Based Approach utilizes the different region properties
  between text and background to extract text objects.
   – Bottom-up: separating the image into small regions and then grouping
     character regions into text regions.
   – Color features, edge features, and connected component methods

 Texture Based Approach uses distinct texture properties of text to
  extract text objects from background.
   – Top-down: extracting texture features of the image and then locating
     text regions.
   – Spatial variance, Fourier transform, Wavelet transform, and machine
     learning methods.
Outline
•   Introduction
•   Recent Progress
•   Performance Evaluation
•   Discussion
Recent Progress
 Text extraction in video documents, as an important research
  branch of content-based information retrieval and indexing,
  continues to be a topic of much interest to researchers.

 A large number of newly proposed approaches in the
  literature have contributed to an impressive progress of text
  extraction techniques.
Recent Progress

 Prior to 2003                    Now
 • Only a few text extraction     • Temporal redundancy of video is
 approaches considered the        utilized by almost all recent text
 temporal nature of video.        extraction approaches.
 • Very little work was done on   • Scene text extraction is being
 scene text.                      extensively studied.
 • Objective performance          • A comprehensive performance
 evaluation metrics were          evaluation framework has been
 scarce.                          developed.
Recent Progress

  The progress of text extraction in videos can
     be categorized into three types:
 •   New and improved text extraction approaches
 •   Text extraction techniques adopted from other
     research fields
  Text extraction approaches proposed for specific
     text types and specific genre of video documents
Recent Progress
• New and improved text extraction approaches:
  The new and improved approaches play an important role in the
   recent progress of text extraction technique for videos. These
   new approaches introduce not only new algorithms but also new
   understanding of the problem.
Recent Progress
 -New and improved text extraction
 approachesNguyen T. and A. Boucher, A novel approach for text detection in images
   H. Tran, A lux, H.L.
        using structural features, The 3rd International Conference on Advances in Pattern Recognition,
        LNCS Vol. 3686, pp. 627-635, 2005




A text string is modeled as its center line
and the skeletons of characters by ridges
at different hierarchical scales.




                                   First line: Images with rectangle showing the text region. Second line: Zoom on text
                                   regions. Third line: ridges detected at two scales (red in high level, blue in small level) in
                                   the text region that represent local structures of text lines whatever the type of text.
H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features,
    The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005

•     Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A
      text string is modelled by a ridge at a coarse scale representing its center line and numerous short
      ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy
      geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge.
      In this way, we obtain a hierarchical description of text strings, which can provide direct input to an
      OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it
      works with a wide variety in size of characters and does not depend on orientation of text string. The
      experimental results show a good detection.
    X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual
    Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.
    Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring
    characters to extract multilingual texts in images. The case of three neighboring characters is represented as
    the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’
    defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each
    connected component in the binary image as character or non-character according to its neighbors, where a
    mathematical morphology based method is introduced to detect and connect the separated parts of each
    character, and a Voronoi partition based method is advised to establish the neighborhoods of connected
    components. We further present a discriminative training algorithm based on the maximum–minimum similarity
    (MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in
    Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS
    algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In
    the experiments, we also show that the MMS provides significant improvement of overall performance,
    compared with influential training criterions of the maximum likelihood (ML) and the maximum classification
    error (MCE).
Recent Progress
-New and improved text extraction
approaches
      X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters
      for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.


       The GMM based algorithm treats the text features of three neighboring
       characters as three mixed Gaussian models to extract text objects.




                              (a)                         (b)                            (c)
An example of neighborhood computation. In each figure, the image (a) shows a binary image, where black dots denote
centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a
neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c)
The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.
Recent Progress
-New and improved text extraction
approaches
 P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International
 Conference Signal Processing, IEEE, Vol. 4, 2006

  Only the vertical edge features are utilized to find text regions based on the
  observation that vertical edges can enhance the characteristic of text and eliminate
  most irrelevant information.




    (a)                                   (b)                                (c)                            (d)

          (a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result
Recent Progress
-New and improved text extraction
approaches
 K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-
 Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis
 and Recognition, IEEE, pp. 33-37, 2007

    Character-stroke is used to extract text objects by utilizing three line scans (a
    set of pixels along the horizontal line of an intensity image) to detect image
    intensity changes.




      (a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, ∆ is the stroke width, (c) threshold Ig ≤
      0.35, (d) The thresholded image after morphological operations and connected component analysis.
P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International
   Conference Signal Processing, IEEE, Vol. 4, 2006
 •    Abstract: Text detection plays a crucial role in various applications. In this paper we present an
      edge based text detection technique in the complex images for multi purpose application. The
      technique applied vertical Sobel edge detection and a newly proposed morphological technique
      that used to connect the edges to form the candidate regions. The technique has special
      advantage, by providing a distinguishable texture on the text area over the others. The connected
      components are then extracted using a purposed segmentation algorithm. Later all the candidate
      regions are verified to specify the text region. The propose techniques has been tested with
      different types of image acquired from different input sources and environment. The experimental
      result shows highly successful rate.
 K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-
 Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis
 and Recognition, IEEE, pp. 33-37, 2007
Abstract: In this paper, we present a new approach for analysis of images for text-
localization and extraction. Our approach puts very few constraints on the font, size and
color of text and is capable of handling both scene text and articial text well. In this paper,
we exploit two well-known features of text: approximately constant stroke width and local
contrast, and develop a fast, simple, and effective algorithm to detect character strokes. We
also show how these can be used for accurate extraction and motivate some advantages
of using this approach for text localization over other colorspace segmentation based
approaches. We analyze the performance of our stroke detection algorithm on images
collected for the robust-reading competitions at ICDAR 2003
Recent Progress
-New and improved text extraction
approaches
  D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video,
 International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003

  8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that
  best correspond to the properties of text are determined empirically. The sum of the absolute
  values of these coefficients is computed and regarded as a measure of the “text energy” of that
  block. The motion vectors of MPEG-compressed videos are used for text objects tracking.




      (a) Original image




      (b) Text energy                                          (c) Tracking result
D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events
 from digital video, International Journal on Document Analysis and Recognition, Vol.
 5, pp. 138-157, 2003
• Abstract. The popularity of digital video is increasing rapidly. To help users navigate
    libraries of video, algorithms that automatically index video based on content are
    needed. One approach is to extract text appearing in video, which often reflects a
    scene’s semantic content. This is a difficult problem due to the unconstrained nature
    of general-purpose video. Text can have arbitrary color, size, and orientation.
    Backgrounds may be complex and changing. Most work so far has made restrictive
    assumptions about the nature of text occurring in video. Such work is therefore not
    directly applicable to unconstrained, general-purpose video. In addition, most work so
    far has focused only on detecting the spatial extent of text in individual video frames.
    However, text occurring in video usually persists for several seconds. This constitutes
    a text event that should be entered only once in the video index. Therefore it is also
    necessary to determine the temporal extent of text events. This is a non-trivial
    problem because text may move, rotate, grow, shrink, or otherwise change over time.
    Such text effects are common in television programs and commercials but so far
    have received little attention in the literature. This paper discusses detecting,
    binarizing, and tracking caption text in general-purpose MPEG-1 video. Solutions are
    proposed for each of these problems and compared with existing work found in the
    literature.
Recent Progress
-New and improved text extraction
approaches
   In addition, many former text extraction approaches have been
     enhanced and extended recently.
     By extracting and integrating more comprehensive
     characteristics of text objects, these new approaches can provide
     more robust performance than previous approaches.

     Besides new approaches, many improved approaches are
     presented to overcome the limitations of former approaches.
Recent Progress
   -New and improved text extraction
   approaches Caption localization in video sequences by fusion of multiple detectors,
     S Lefevre, N Vincent,
           Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE,
           pp. 106-110, 2005
           Color-related detector, wavelet-based texture detector, edge-based contour detector
           and temporal invariance principle are adopted to detect candidate caption regions.
           Then a parallel fusion strategy
           C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
           Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.

           Euclidean distance based and Cosine similarity
           based clustering methods are applied on GRB
           color space complementarily to partition the
           original image into three clusters: textual
           foreground, textual background, and noise.



Overview of the proposed algorithm combining color and spatial information.
S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors,
    Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp.
    106-110, 2005
•   Abstract: In this article, we focus on the problem of caption detection in video sequences.
    Contrary to most of existing approaches based on a single detector followed by an ad hoc
    and costly post-processing, we have decided to consider several detectors and to merge
    their results in order to combine advantages of each one. First we made a study of
    captions in video sequences to determine how they are represented in images and to
    identify their main features (color constancy and background contrast, edge density and
    regularity, temporal persistence). Based on these features, we then select or define the
    appropriate detectors and we compare several fusion strategies which can be involved.
    The logical process we have followed and the satisfying results we have obtained let us
    validate our contribution.
      C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
      Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.
    Abstract: Natural scene images brought new challenges for a few years and one of them is text
    understanding over images or videos. Text extraction which consists to segment textual foreground from the
    background succeeds using color information. Faced to the large diversity of text information in daily life and
    artistic ways of display, we are convinced that this only information is no more enough and we present a color
    segmentation algorithm using spatial information. Moreover, a new method is proposed in this paper to
    handle uneven lighting, blur and complex backgrounds which are inherent degradations to natural scene
    images. To merge text pixels together, complementary clustering distances are used to support
    simultaneously
    clear and well-contrasted images with complex and degraded images. Tests on a public database show
    finally efficiency of the whole proposed method.
Recent Progress
-New and improved text extraction
approaches
   M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection,
   localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology,
   Vol. 15, pp. 243-255, 2005.

 The sequential multi-resolution
 paradigm can remove the
 redundancy of parallel multi-
 resolution paradigm. No text
 edges can appear several
 times at different resolution
 levels.




                                                        Sequential multiresolution paradigm
M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection,
    localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology,
    Vol. 15, pp. 243-255, 2005.

•   Abstract—Text in video is a very compact and accurate clue for video indexing and
    summarization. Most video text detection and extraction methods hold assumptions
    on text color, background contrast, and font style. Moreover, few methods can handle
    multilingual text well since different languages may have quite different appearances.
    This paper performs a detailed analysis of multilingual text characteristics, including
    English and Chinese. Based on the analysis, we propose a comprehensive, efficient
    video text detection, localization, and extraction method, which emphasizes the
    multilingual capability over the whole processing. The proposed method is also robust
    to various background complexities and text appearances. The text detection is
    carried out by edge detection, local thresholding, and hysteresis edge recovery. The
    coarse-to-fine localization scheme is then performed to identify text regions
    accurately. The text extraction consists of adaptive thresholding, dam point labeling,
    and inward filling. Experimental results on a large number of video images and
    comparisons with other methods are reported in detail.
Recent Progress
-New and improved text extraction
approaches
       J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering
       Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp.
       283-290, 2006.




  Fuzzy C-means based individual
  frame clustering is replaced by the
  fuzzy clustering ensemble (FCE)
  based multi-frame clustering to utilize
  temporal redundancy.




Fuzzy cluster ensemble for text detection in videos
J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering
    Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp.
    283-290, 2006.

•   Abstract: Detection and localization of text in videos is an important task
    towards enabling automatic content-based retrieval of digital video
    databases. However, since text is often displayed against a complex
    background, its detection is a challenging problem. In this paper, a novel
    approach based on fuzzy cluster ensemble techniques to solve this problem
    is presented. The advantage of this approach is that the fuzzy clustering
    ensemble allows the incremental inclusion of temporal information regarding
    the appearance of static text in videos. Comparative experimental results for
    a test set of 10.92 minutes of video sequences have shown the very good
    performance of the proposed approach with an overall recall of 92.04% and
    a precision of 96.71%.
Recent Progress
2. Text extraction techniques adopted from other research
  fields:
  Another encouraging progress is that more and more techniques that have been
  successfully applied in other research fields have been adapted for text extraction.

  Because these approaches were not initially designed for the text extraction task,
  many unique characteristics of their original research fields are embedded in them
  intrinsically.

  Therefore, by using these approaches from other fields, we can view the text
  extraction problem from the viewpoints of other related research fields and benefit
  from them. It is a promising way to find good solutions for text extraction task.
Recent Progress
-Text extraction techniques adopted from
other research fields
   K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support
   vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern
   Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.

    The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to
    detect and track faces in a video stream.




  Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for
  CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked
  as white and gray level (white: text region, gray: non-text region), and (d) final detection result
Recent Progress
-Text extraction techniques adopted from
other research fields
 H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection,
 Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp.
 894-898, 2007.

 The multiscale statistical process
 control (MSSPC) was originally
 proposed for detecting changes in
 univariate and multivariate signals.




                                            Substeps involved in the use of MSSPC for videotext event detection
K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector
    machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine
    Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.
•     Abstract—The current paper presents a novel texture-based method for detecting texts
      in images. A support vector machine (SVM) is used to analyze the textural properties of
      texts. No external texture feature extraction module is used; rather, the intensities of the
      raw pixels that make up the textural pattern are fed directly to the SVM, which works well
      even in high-dimensional spaces. Next, text regions are identified by applying a
      continuously adaptive mean shift algorithm (CAMSHIFT) to the results of the texture
      analysis. The combination of CAMSHIFT and SVMs produces both robust and efficient
      text detection, as time-consuming texture analyses for less relevant pixels are restricted,
      leaving only a small part of the input image to be texture-analyzed.

H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of
Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007.
  Abstract: Text in video, whether overlay or in-scene, contains a wealth of information vital to automated
  content analysis systems. However, low resolution of the imagery, coupled with richness of the
  background and compression artifacts limit the detection accuracy that can be achieved in practice using
  existing text detection algorithms. This paper presents a novel, noncausal temporal aggregation method
  that acts as a second pass over the output of an existing text detector over the entire video clip. A
  multiresolution change detection algorithm is used along the time axis to detect the appearance and
  disappearance of multiple, concurrent lines of text followed by recursive timeaveraged projections on Y
  and X axes. This algorithm detects and rectifies instances of missed text and enhances spatial boundaries
  of detected text lines
  using consensus estimates. Experimental results, which demonstrate significant performance gain on
  publicly collected and annotated data, are presented.
Recent Progress
-Text extraction techniques adopted from
other research fields
   D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE
   International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.




    Discriminative Random Fields
    (DRF) was initially applied to
    detect man-made building in 2D
    images.




                 (a) 2D DRF, with state si and one of its neighbors sj . (b) 3D DRF, with multiple 2D DRFs stacked over
                 time. (c) 2D DRF-HMM type(A), with intra-frame dependencies modelled by undirected DRFs, and inter-
                 frame dependencies modelled by HMMs. States are shared between the two models.
Recent Progress
-Text extraction techniques adopted from
other research fields
 W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using
 Sparse Representations, Proceedings of Ninth International Conference on Document Analysis
 and Recognition, IEEE, pp. 412-416, 2007.

 Sparse representation was initially used for research on the receptive fields of
 simple cells.




               (a)                               (b)                                    (C)

  (a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c)
  binarized result of (b) using Otsu’s method.
D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE
       International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.
•   Abstract: In this paper, we propose a general object detection framework which combines the
    Hidden Markov Model with the Discriminative Random Fields. Recent object detection
    algorithms have achieved impressive results by using graphical models, such as Markov
    Random Field. These models, however, have only been applied to two dimensional images. In
    many scenarios, video is the directly available source rather than images, hence an important
    information for detecting objects has been omitted — the temporal information. To demonstrate
    the importance of temporal information, we apply graphical models to the task of text detection
    in video and compare the result of with and without temporal information. We also show the
    superiority of the proposed models over simple heuristics such as median filter over time.

    W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse
    Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition,
    IEEE, pp. 412-416, 2007.
    Abstract: A novel text segmentation method from complex background is presented in this paper. The idea
    is inspired by the recent development in searching for the sparse signal representation among a family of
    over-complete atoms, which is called a dictionary. We assume that the image under investigation is
    composed of two components: the foreground text and the complex background. We further assume that the
    latter can be modeled as a piece-wise smooth function. Then we choose two dictionaries, where the first one
    gives sparse representation to one component and nonsparse representation to another while the second
    one does the opposite. By looking for the sparse representations in each dictionary, we can decompose the
    image into the two composing components. After that, text segmentation can be easily achieved by applying
    simple thresholding to the text component. Preliminary experiments show some promising results.
Recent Progress
3. Text extraction approaches proposed for specific text
  types and specific genre of video documents:
  Besides general text extraction approaches, an increasing number of
  approaches have been proposed for specific text types.

  Based on domain knowledge, these specific approaches can take
  advantages of unique properties of specific text type or video genre and
  often achieve better performance than general approaches.
Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documents
  W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE
  Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.




This approach is composed of
two stages:
1. localizing road signs;
2. detecting text.




   Architecture of the proposed framework
W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE
Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.

•   Abstract—A fast and robust framework for incrementally detecting text on
    road signs from video is presented in this paper. This new framework makes
    two main contributions. 1) The framework applies a divide-and-conquer
    strategy to decompose the original task into two subtasks, that is, the
    localization of road signs and the detection of text on the signs. The
    algorithms for the two subtasks are naturally incorporated into a unified
    framework through a feature-based tracking algorithm. 2) The framework
    provides a novel way to detect text from video by integrating two-
    dimensional (2-D) image features in each video frame (e.g., color, edges,
    texture) with the three-dimensional (3-D) geometric structure information of
    objects extracted from video sequence (such as the vertical plane property
    of road signs). The feasibility of the proposed framework has been
    evaluated using 22 video sequences captured from a moving vehicle. This
    new framework gives an overall text detection rate of 88.9% and a false hit
    rate of 9.2%. It can easily be applied to other tasks of text detection from
    video and potentially be embedded in a driver assistance system.
Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documentsT. Liu, Summarization of Visual Content in Instruction videos, IEEE
  C. Choudary, and
  Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.
  content fluctuation curve based on the number of chalk pixels is used to measure the content in
  each frame of instructional videos. The frames with enough chalk pixels are extracted as key
  frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the
  redundancy of key frames by matching the content and mosaicking the frames.




                           (a)                              (b)                        (C)                           (d)
  Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization
  algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.
C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos,
 IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.

•   Abstract—In instructional videos of chalk board presentations, the visual content
    refers to the text and figures written on the boards. Existing methods on video
    summarization are not effective for this video domain because they are mainly based
    on low-level image features such as color and edges. In this work, we present a novel
    approach to summarizing the visual content in instructional videos using middle-level
    features. We first develop a robust algorithm to extract content text and figures from
    instructional videos by statistical modelling and clustering. This algorithm addresses
    the image noise, nonuniformity of the board regions, camera movements, occlusions,
    and other challenges in the instructional videos that are recorded in real classrooms.
    Using the extracted text and figures as the middle level features, we retrieve a set of
    key frames that contain most of the visual content. We further reduce content
    redundancy and build a mosaicked summary image by matching extracted content
    based on K-th Hausdorff distance and connected component decomposition.
    Performance evaluation on four full-length instructional videos shows that our
    algorithm is highly effective in summarizing instructional video content
Recent Progress
-Text extraction approaches proposed for
specific text types and specific genre of video
documents
Additional References:
•   C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text
    Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006.
•   D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief
    Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004.
•   L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of
    IEEE international conference on image processing, Vol. 3, pp11-14, 2005.

•   M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization,
    and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp.
    243-255, 2005.
•   S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEE
    Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110,
    2005.
•   CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using
    frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with
    artificial intelligence, Vol. 2 ,pp 539-542, 2007.
•   …
Outline
•   Introduction
•   Recent Progress
•   Performance Evaluation
•   Discussion
Performance Evaluation
R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M.
Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation of
Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol,
to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008.
(http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57)



Evaluation Metrics:
Video Analysis and Content Extraction (VACE)
Text: Task Definition
 Detection Task: Spatially locate the blocks of text in each video
  frame in a video sequence
    • Text blocks (objects) contain all words in a particular line of text where
      the font and size are the same

 Tracking Task: Spatially/temporally locate and track the text objects
  in a video sequence
 Recognition Task: Transcribe the words in each frame, including
  their spatial location (detection implied)
Task Definition
Highlights
• Annotate oriented bounding rectangle around text
  objects (The reference annotation was done by VideoMining Inc., State College, PA)
• Detection and Tracking task
     – Line level annotation with IDs maintained
     – Rules based on similarity of font, proximity and readability levels
• Recognition task
     – Word Level (IDs maintained)
• Documents
     – Annotation guidelines       - Evaluation protocol
• Tools
     – ViPER (Annotation)        - USF-DATE (Scoring)
Data Resources
      VIDEO

                  DATA           NUMBER     TOTAL MINS
                                 OF CLIPS
                  MICRO-CORPUS   5          10

                  TRAINING       50         175

                  TESTING        50         175



 •   Micro-corpus: a small amount of data that was created after
     extensive discussions with the research community to act as a
     seed for initial annotation experiments and to provide new
     participants with a concrete sampling of the datasets and the
     tasks.
Data Resources
    These discussions were coordinated as a series of weekly
     teleconferences with VACE contractors and other eminent
     members of the CV community.
    The discussions made the research community a partner in
     the evaluations and helped us in:
5.   selecting the video recordings to be used in the evaluations,
6.   creating the specifications for the ground truth annotations and
     scoring tools
7.   defining the evaluation infrastructure for the program.
Data Resources
                         TASK                                                 DOMAIN
                         Text Detect & Track                                  Broadcast News ABC & CNN*
                         Face Detect & Track                                  Broadcast News ABC & CNN*
                         Vehicle Detect & Track                               Surveillance i-LIDS**


        MPEG–2 standard, progressive scanned at 720 × 480 resolution.
         GOP (Group of Pictures) of 12 for the broadcast news corpus where
         the frame-rate was 29.97 fps (frames per second) and GOP of 10
         for the surveillance dataset where the frame-rate was 25 fps.


* Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu
** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK Home
Office and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)
Reference Annotations
 Text Ground Truth: Every new text area was marked with a box when it
  appeared in the video. The box was moved and scaled to fit the text as it
  moved in successive frames. This process was done at the text line level
  until the text disappeared from the frame.


 Three readability levels:

READABILITY = 1 (white)
Completely unreadable text


 READABILITY = 1 (gray)
  Partially readable text


READABILITY = 2 (black)
 Clearly readable text
Reference Annotations
• Text regions were tagged based on a comprehensive set of rules:
•   All text within a selected block must contain the same readability level and
    type.

•   Blocks of text must contain the same size and font.

•   The bounding box should be tight to the extent that there is no space
    between the box and the text.

•   Text boxes may not overlap other text boxes unless the characters
    themselves are superimposed atop one another.
Sample Annotation Clip (line-
level)
Detection Metric
•     The Frame Detection Accuracy (FDA) measure calculates the spatial
      overlap between the ground truth and system output objects as a ratio of
      the spatial intersection between the two objects and the spatial union of
      them. The sum of all of the overlaps was normalized over the average of
      the number of ground truth and detected objects

    Frame Detection Accuracy (FDA)

                                                                                                      Gi(t )  Di(t )
                                                                                             (t )
               Overlap Ratio                                                               N mapped
     FDA(t ) =
                N Gt ) + N Dt )
                  (        (                        where, Overlap Ratio =                  ∑i =1     Gi(t )  Di(t )
                       2
 Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t.
 Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t.
 N(t)G and N(t)D denote the number of ground truth objects and the number of detected objects in frame t respectively.
Detection Metric
•    The Sequence Frame Detection Accuracy (SFDA), is essentially the
     average of the FDA measure over all of the relevant frames in the
     sequence.

    Sequence Frame Detection Accuracy (SFDA)
                                 N frames

                                  ∑ FDA (t )
         SFDA =       N frames
                                   t =1
                                                            Range: 0 to 1 (higher is better)

                       ∑
                                            (t )   (t)
                                 ∃( N G OR N D )
                        t =1


          Nframes is the number of frames in the sequence
Tracking Metric
 •    The Average Tracking Accuracy (ATA) is a spatio-temporal measure
      which penalizes fragmentations in both the temporal and spatial dimensions
      while accounting for the number of objects detected and tracked, missed
      objects, and false positives.
                                                                                             N iframes   Gi(t )  Di(t )
                                                                                  N mapped
                                                                                              ∑t =1      Gi(t )  Di(t )
      Sequence Track Detection Accuracy (STDA)                         STDA =       ∑
                                                                                    i =1              N ( Gi ∪ Di ≠φ )

                                                                                STDA
       Average Tracking Accuracy (ATA)                                 ATA =
                                                                              NG + N D 
          Range: 0 to 1 (higher is better)                                             
                                                                                 2     
NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence
respectively. Uniqueness is defined by object IDs.
Example Detection
Scoring
Green: Detected box   Red: Ground truth box     Yellow: Overlap in mapped boxes


                           Spatial alignment error
                           (ratio = .4505)
                            3 false alarm objects


                                                            [0.4505] + [1] 
                                                      FDA =                 = 0.2901
                            Correctly detected                (5 + 5) 2 
                            object – perfect
                            overlap (ratio = 1.0)

                            3 missed objects
Annotation Quality

 Evaluation relies on manual labeling


 The degree of consistency becomes        10% of the entire corpus was
 increasingly important as systems        doubly annotated by multiple
 approach human levels of
                                          annotators and checked for
 performance.
                                          quality using the evaluation
 A high degree of consistency would be
 difficult to achieve with somewhat
                                          measures.
 subjective attributes like readability

 Humans fatigued easily when
 performing such tedious tasks
Annotation Quality
  For double annotated corpus

         Average Sequence Frame      Text detection     95%
         Detection Accuracy (SFDA)

         Average Average Tracking    Text tracking      85%
         Accuracy (ATA)



  The scores for the current state-of-the-art automatic algorithms
  are significantly lower than these numbers (22% relative for text
  detection, and 61% relative for text tracking).
Annotation Quality




  Flowchart of Annotation Quality Control Procedure. Steps denoted by dark shaded boxes were carried
  out by the annotators. Steps denoted by light shaded boxes were carried out by the evaluators.
Text Detection and Tracking –
VACE
   Mean SFDA/ATA Scores for Eng Text Detection and Tracking: BNews
     1
                                                            A
   0.9                                                      B
   0.8                                                      C
                                                            D
   0.7
   0.6
   0.5
   0.4
   0.3
   0.2
   0.1
     0
                  SFDA                                ATA
Text Recognition
Evaluation
• Datesets: Broadcast News
• Training/Dry Run Development Set
   – 5 Clips
      • 14.5 minutes
      • 1181 words

• Evaluation Set
   – 25 Clips
      • 62.5 minutes
      • 4178 word objects
      • 68,738 word frame instances
Text Recognition
Evaluation
 Evaluate only the most easily readable text (to establish
  a baseline at a high level of inter-annotator agreement)
      • Type = graphic (no scene text)
      • Readability = 2
      • Logo = false
      • Occlusion = false
      • Ambiguous = false
         — Exclude scrolling (ticker), dynamic text
           (scoreboard)
      • Case insensitive and punctuation ignored
Sample Annotation Clip (Word-
level)
Recognition Evaluation
Metrics
• Spatially map system output detected words to reference
  words, then compare the strings for mapped words
   – An unmapped word in system output incurs an Insertion (I) error
   – An unmapped word in reference incurs a Deletion (D) error
   – A mapped word with a character mismatch incurs a Substitution
     (S) error

      REF:   The raven caws at midnight                       (I + D + S)
                                                WER =
             D         S      I                         (Total # Words in Ref)
Sys
Output:          raven calls at at midnight   WER = (1 + 1 + 1)/5 = 3/5 (60%)

• Errors are accumulated over entire test set
• Also generate: Character Error Rate
Individual Clip Word Error
Rate
                                                                      Clip-wise WER
                                                        1
WER per clip (normalized by the #words in each clip)




                                                       0.9

                                                       0.8

                                                       0.7

                                                       0.6

                                                       0.5

                                                       0.4

                                                       0.3

                                                       0.2

                                                       0.1

                                                        0
                                                         0   5   10                   15   20   25
                                                                          Clips
Scores (Word Error Rate)
                          Word Error Rates with different Normalizations
          1

         0.9

         0.8

         0.7

         0.6
Values




         0.5

         0.4

         0.3

         0.2

         0.1

          0
               WER/Word                                                    WER/Frames


                          WER                         CER
                          0.4233                    0.2823
Outline
•   Introduction
•   Recent Progress
•   Performance Evaluation
•   Discussion
Discussion
• The recent progresses provide many promising
  solutions and research directions for text extraction
  problem.
• Due to the large variations of text objects in videos, no
  single approach can achieve satisfactory performance in
  all applications.
• To further improve the performance of text extraction
  techniques, much work in the area remains.
Discussion

 Detection and Localization
  – How to efficiently combine several complementary
    extraction algorithms to produce better performance
    and how to extract better features by analyzing the
    shape of characters and the relationships between
    text and its background still need more investigation.
Discussion
 Tracking
  – Although text tracking is an indispensable step for
    text extraction in videos, not many text tracking
    approaches have been reported in recent years.

  – More effort is needed to focus on tracking, not only
    for static and scrolling text, but also for dynamic text
    objects (growing, shrinking, and rotating text).
Discussion
 Datasets:
  – Besides extraction approaches, because most
    algorithms are still tested on their own datasets, in
    order to compare and evaluate all algorithms, a large
    freely available annotated video dataset is urgently
    needed.
THANK YOU!
See you at ICPR 2008 in December

More Related Content

What's hot

“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...Edge AI and Vision Alliance
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Editor IJARCET
 
Text Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text RegionsText Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text RegionsIJCSIS Research Publications
 
Dj31514517
Dj31514517Dj31514517
Dj31514517IJMER
 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
 
Cc31331335
Cc31331335Cc31331335
Cc31331335IJMER
 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital imageKaushik Godhani
 
Natural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural networkNatural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural networkIJECEIAES
 
analysis on concealing information within non secret data
analysis on concealing information within non secret dataanalysis on concealing information within non secret data
analysis on concealing information within non secret dataVema Reddy
 
Semantics In Digital Photos A Contenxtual Analysis
Semantics In Digital Photos A Contenxtual AnalysisSemantics In Digital Photos A Contenxtual Analysis
Semantics In Digital Photos A Contenxtual AnalysisAllenWu
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a surveyeSAT Journals
 
Moving object detection
Moving object detectionMoving object detection
Moving object detectionManav Mittal
 
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalMind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalJonathon Hare
 
IRJET - Text Detection in Natural Scene Images: A Survey
IRJET - Text Detection in Natural Scene Images: A SurveyIRJET - Text Detection in Natural Scene Images: A Survey
IRJET - Text Detection in Natural Scene Images: A SurveyIRJET Journal
 

What's hot (17)

“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124
 
Text Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text RegionsText Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text Regions
 
Dj31514517
Dj31514517Dj31514517
Dj31514517
 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
 
Cc31331335
Cc31331335Cc31331335
Cc31331335
 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital image
 
Natural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural networkNatural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural network
 
analysis on concealing information within non secret data
analysis on concealing information within non secret dataanalysis on concealing information within non secret data
analysis on concealing information within non secret data
 
Semantics In Digital Photos A Contenxtual Analysis
Semantics In Digital Photos A Contenxtual AnalysisSemantics In Digital Photos A Contenxtual Analysis
Semantics In Digital Photos A Contenxtual Analysis
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a survey
 
Sub1586
Sub1586Sub1586
Sub1586
 
O017429398
O017429398O017429398
O017429398
 
E017443136
E017443136E017443136
E017443136
 
Moving object detection
Moving object detectionMoving object detection
Moving object detection
 
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalMind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
 
IRJET - Text Detection in Natural Scene Images: A Survey
IRJET - Text Detection in Natural Scene Images: A SurveyIRJET - Text Detection in Natural Scene Images: A Survey
IRJET - Text Detection in Natural Scene Images: A Survey
 

Similar to Recent Progress in Video Text Extraction Techniques

COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
 
A Survey On Thresholding Operators of Text Extraction In Videos
A Survey On Thresholding Operators of Text Extraction In VideosA Survey On Thresholding Operators of Text Extraction In Videos
A Survey On Thresholding Operators of Text Extraction In VideosCSCJournals
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Editor IJARCET
 
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIESLOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIESaciijournal
 
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIESLOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIESaciijournal
 
Text Extraction of Colour Images using Mathematical Morphology & HAAR Transform
Text Extraction of Colour Images using Mathematical Morphology & HAAR TransformText Extraction of Colour Images using Mathematical Morphology & HAAR Transform
Text Extraction of Colour Images using Mathematical Morphology & HAAR TransformIOSR Journals
 
CRNN model for text detection and classification from natural scenes
CRNN model for text detection and classification from natural scenesCRNN model for text detection and classification from natural scenes
CRNN model for text detection and classification from natural scenesIAESIJAI
 
Comparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling textComparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling texteSAT Publishing House
 
Comparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithmsComparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithmseSAT Journals
 
JPM1417 Characterness: An Indicator of Text in the Wild
JPM1417   Characterness: An Indicator of Text in the WildJPM1417   Characterness: An Indicator of Text in the Wild
JPM1417 Characterness: An Indicator of Text in the Wildchennaijp
 
Detection and Localization of Text Information in Video Frames
Detection and Localization of Text Information in Video FramesDetection and Localization of Text Information in Video Frames
Detection and Localization of Text Information in Video FramesIOSR Journals
 
Customized mask region based convolutional neural networks for un-uniformed ...
Customized mask region based convolutional neural networks  for un-uniformed ...Customized mask region based convolutional neural networks  for un-uniformed ...
Customized mask region based convolutional neural networks for un-uniformed ...IJECEIAES
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Enhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildEnhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildPrerana Mukherjee
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
 

Similar to Recent Progress in Video Text Extraction Techniques (20)

COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
 
A Survey On Thresholding Operators of Text Extraction In Videos
A Survey On Thresholding Operators of Text Extraction In VideosA Survey On Thresholding Operators of Text Extraction In Videos
A Survey On Thresholding Operators of Text Extraction In Videos
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124
 
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIESLOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
 
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIESLOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
LOCALIZATION OF OVERLAID TEXT BASED ON NOISE INCONSISTENCIES
 
Text Extraction of Colour Images using Mathematical Morphology & HAAR Transform
Text Extraction of Colour Images using Mathematical Morphology & HAAR TransformText Extraction of Colour Images using Mathematical Morphology & HAAR Transform
Text Extraction of Colour Images using Mathematical Morphology & HAAR Transform
 
CRNN model for text detection and classification from natural scenes
CRNN model for text detection and classification from natural scenesCRNN model for text detection and classification from natural scenes
CRNN model for text detection and classification from natural scenes
 
Comparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling textComparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling text
 
Comparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithmsComparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithms
 
industrial engg
industrial enggindustrial engg
industrial engg
 
40120140501009
4012014050100940120140501009
40120140501009
 
JPM1417 Characterness: An Indicator of Text in the Wild
JPM1417   Characterness: An Indicator of Text in the WildJPM1417   Characterness: An Indicator of Text in the Wild
JPM1417 Characterness: An Indicator of Text in the Wild
 
C04741319
C04741319C04741319
C04741319
 
E1803012329
E1803012329E1803012329
E1803012329
 
Detection and Localization of Text Information in Video Frames
Detection and Localization of Text Information in Video FramesDetection and Localization of Text Information in Video Frames
Detection and Localization of Text Information in Video Frames
 
Customized mask region based convolutional neural networks for un-uniformed ...
Customized mask region based convolutional neural networks  for un-uniformed ...Customized mask region based convolutional neural networks  for un-uniformed ...
Customized mask region based convolutional neural networks for un-uniformed ...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Enhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildEnhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wild
 
F045053236
F045053236F045053236
F045053236
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
 

Recently uploaded

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 

Recently uploaded (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 

Recent Progress in Video Text Extraction Techniques

  • 1. Extraction of Text Objects in Video Documents: Recent Progress Jing Zhang and Rangachar Kasturi University of South Florida Department of Computer Science and Engineering
  • 2. Acknowledgements The work presented here is that of numerous researchers from around the world. We thank them for their contributions towards the advances in video document processing. In particular we would like to thank the authors of papers whose work is cited in this presentation and in our paper.
  • 3. Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
  • 4. Introduction  Since 1990s, with rapid growth of available multimedia documents and increasing demand for information indexing and retrieval, much effort has been done on text extraction in images and videos.
  • 5. Introduction • Text Extraction in Video – Text consists of words that are well-defined models of concepts for humans communication. – Text objects embedded in video contain much semantic information related to the multimedia content. – Text extraction techniques play an important role in content- based multimedia information indexing and retrieval.
  • 6. Introduction  Extracting text in video presents unique challenge over that in scanned documents: Cons: Pros: Low contrast Temporal Redundancy (text in video usually persists for at least several Low resolution seconds, to give human viewers the necessary time to read it) Color bleeding Unconstrained backgrounds Unknown text color, size, position, orientation, and layout
  • 7. Introduction • Caption Text which is artificially superimposed on the video at the time of editing. • Scene Text which naturally occurs in the field of view of the camera during video capture. • The extraction of scene text is a much tougher task due to varying lighting, complex movement and transformation. Scene Text Caption Text
  • 8. Introduction  Five stages of text extraction in video: 1) Text Detection: finding regions in a video frame that contain text; 2) Text Localization: grouping text regions into text instances and generating a set of tight bounding boxes around all text instances; 3) Text Tracking: following a text event as it moves or changes over time and determining the temporal and spatial locations and extents of text events; 4) Text Binarization: binarizing the text bounded by text regions and marking text as one binary level and background as the other; 5) Text Recognition: performing OCR on the binarized text image.
  • 9. Introduction Video Clips Text Detection Text Localization Text Tracking Text Binarization Text Recognition Text Objects
  • 10. Introduction  The goal of Text detection, text localization and text tracking is to generate accurate bounding boxes of all text objects in video frames and provide a unique identity to each text event which is composed of the same text object appearing in a sequence of consecutive frames.
  • 11. Introduction  This presentation mainly concentrates on the approaches proposed for text extraction in videos in the most recent five years, to summarize and discuss the recent progress in this research area.
  • 12. Introduction  Region Based Approach utilizes the different region properties between text and background to extract text objects. – Bottom-up: separating the image into small regions and then grouping character regions into text regions. – Color features, edge features, and connected component methods  Texture Based Approach uses distinct texture properties of text to extract text objects from background. – Top-down: extracting texture features of the image and then locating text regions. – Spatial variance, Fourier transform, Wavelet transform, and machine learning methods.
  • 13. Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
  • 14. Recent Progress  Text extraction in video documents, as an important research branch of content-based information retrieval and indexing, continues to be a topic of much interest to researchers.  A large number of newly proposed approaches in the literature have contributed to an impressive progress of text extraction techniques.
  • 15. Recent Progress Prior to 2003 Now • Only a few text extraction • Temporal redundancy of video is approaches considered the utilized by almost all recent text temporal nature of video. extraction approaches. • Very little work was done on • Scene text extraction is being scene text. extensively studied. • Objective performance • A comprehensive performance evaluation metrics were evaluation framework has been scarce. developed.
  • 16. Recent Progress  The progress of text extraction in videos can be categorized into three types: • New and improved text extraction approaches • Text extraction techniques adopted from other research fields  Text extraction approaches proposed for specific text types and specific genre of video documents
  • 17. Recent Progress • New and improved text extraction approaches: The new and improved approaches play an important role in the recent progress of text extraction technique for videos. These new approaches introduce not only new algorithms but also new understanding of the problem.
  • 18. Recent Progress -New and improved text extraction approachesNguyen T. and A. Boucher, A novel approach for text detection in images H. Tran, A lux, H.L. using structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005 A text string is modeled as its center line and the skeletons of characters by ridges at different hierarchical scales. First line: Images with rectangle showing the text region. Second line: Zoom on text regions. Third line: ridges detected at two scales (red in high level, blue in small level) in the text region that represent local structures of text lines whatever the type of text.
  • 19. H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005 • Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A text string is modelled by a ridge at a coarse scale representing its center line and numerous short ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge. In this way, we obtain a hierarchical description of text strings, which can provide direct input to an OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it works with a wide variety in size of characters and does not depend on orientation of text string. The experimental results show a good detection. X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008. Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring characters to extract multilingual texts in images. The case of three neighboring characters is represented as the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’ defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each connected component in the binary image as character or non-character according to its neighbors, where a mathematical morphology based method is introduced to detect and connect the separated parts of each character, and a Voronoi partition based method is advised to establish the neighborhoods of connected components. We further present a discriminative training algorithm based on the maximum–minimum similarity (MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In the experiments, we also show that the MMS provides significant improvement of overall performance, compared with influential training criterions of the maximum likelihood (ML) and the maximum classification error (MCE).
  • 20. Recent Progress -New and improved text extraction approaches X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008. The GMM based algorithm treats the text features of three neighboring characters as three mixed Gaussian models to extract text objects. (a) (b) (c) An example of neighborhood computation. In each figure, the image (a) shows a binary image, where black dots denote centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c) The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.
  • 21. Recent Progress -New and improved text extraction approaches P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International Conference Signal Processing, IEEE, Vol. 4, 2006 Only the vertical edge features are utilized to find text regions based on the observation that vertical edges can enhance the characteristic of text and eliminate most irrelevant information. (a) (b) (c) (d) (a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result
  • 22. Recent Progress -New and improved text extraction approaches K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text- Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 33-37, 2007 Character-stroke is used to extract text objects by utilizing three line scans (a set of pixels along the horizontal line of an intensity image) to detect image intensity changes. (a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, ∆ is the stroke width, (c) threshold Ig ≤ 0.35, (d) The thresholded image after morphological operations and connected component analysis.
  • 23. P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International Conference Signal Processing, IEEE, Vol. 4, 2006 • Abstract: Text detection plays a crucial role in various applications. In this paper we present an edge based text detection technique in the complex images for multi purpose application. The technique applied vertical Sobel edge detection and a newly proposed morphological technique that used to connect the edges to form the candidate regions. The technique has special advantage, by providing a distinguishable texture on the text area over the others. The connected components are then extracted using a purposed segmentation algorithm. Later all the candidate regions are verified to specify the text region. The propose techniques has been tested with different types of image acquired from different input sources and environment. The experimental result shows highly successful rate. K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text- Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 33-37, 2007 Abstract: In this paper, we present a new approach for analysis of images for text- localization and extraction. Our approach puts very few constraints on the font, size and color of text and is capable of handling both scene text and articial text well. In this paper, we exploit two well-known features of text: approximately constant stroke width and local contrast, and develop a fast, simple, and effective algorithm to detect character strokes. We also show how these can be used for accurate extraction and motivate some advantages of using this approach for text localization over other colorspace segmentation based approaches. We analyze the performance of our stroke detection algorithm on images collected for the robust-reading competitions at ICDAR 2003
  • 24. Recent Progress -New and improved text extraction approaches D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video, International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003 8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that best correspond to the properties of text are determined empirically. The sum of the absolute values of these coefficients is computed and regarded as a measure of the “text energy” of that block. The motion vectors of MPEG-compressed videos are used for text objects tracking. (a) Original image (b) Text energy (c) Tracking result
  • 25. D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video, International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003 • Abstract. The popularity of digital video is increasing rapidly. To help users navigate libraries of video, algorithms that automatically index video based on content are needed. One approach is to extract text appearing in video, which often reflects a scene’s semantic content. This is a difficult problem due to the unconstrained nature of general-purpose video. Text can have arbitrary color, size, and orientation. Backgrounds may be complex and changing. Most work so far has made restrictive assumptions about the nature of text occurring in video. Such work is therefore not directly applicable to unconstrained, general-purpose video. In addition, most work so far has focused only on detecting the spatial extent of text in individual video frames. However, text occurring in video usually persists for several seconds. This constitutes a text event that should be entered only once in the video index. Therefore it is also necessary to determine the temporal extent of text events. This is a non-trivial problem because text may move, rotate, grow, shrink, or otherwise change over time. Such text effects are common in television programs and commercials but so far have received little attention in the literature. This paper discusses detecting, binarizing, and tracking caption text in general-purpose MPEG-1 video. Solutions are proposed for each of these problems and compared with existing work found in the literature.
  • 26. Recent Progress -New and improved text extraction approaches In addition, many former text extraction approaches have been enhanced and extended recently. By extracting and integrating more comprehensive characteristics of text objects, these new approaches can provide more robust performance than previous approaches. Besides new approaches, many improved approaches are presented to overcome the limitations of former approaches.
  • 27. Recent Progress -New and improved text extraction approaches Caption localization in video sequences by fusion of multiple detectors, S Lefevre, N Vincent, Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp. 106-110, 2005 Color-related detector, wavelet-based texture detector, edge-based contour detector and temporal invariance principle are adopted to detect candidate caption regions. Then a parallel fusion strategy C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006. Euclidean distance based and Cosine similarity based clustering methods are applied on GRB color space complementarily to partition the original image into three clusters: textual foreground, textual background, and noise. Overview of the proposed algorithm combining color and spatial information.
  • 28. S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp. 106-110, 2005 • Abstract: In this article, we focus on the problem of caption detection in video sequences. Contrary to most of existing approaches based on a single detector followed by an ad hoc and costly post-processing, we have decided to consider several detectors and to merge their results in order to combine advantages of each one. First we made a study of captions in video sequences to determine how they are represented in images and to identify their main features (color constancy and background contrast, edge density and regularity, temporal persistence). Based on these features, we then select or define the appropriate detectors and we compare several fusion strategies which can be involved. The logical process we have followed and the satisfying results we have obtained let us validate our contribution. C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006. Abstract: Natural scene images brought new challenges for a few years and one of them is text understanding over images or videos. Text extraction which consists to segment textual foreground from the background succeeds using color information. Faced to the large diversity of text information in daily life and artistic ways of display, we are convinced that this only information is no more enough and we present a color segmentation algorithm using spatial information. Moreover, a new method is proposed in this paper to handle uneven lighting, blur and complex backgrounds which are inherent degradations to natural scene images. To merge text pixels together, complementary clustering distances are used to support simultaneously clear and well-contrasted images with complex and degraded images. Tests on a public database show finally efficiency of the whole proposed method.
  • 29. Recent Progress -New and improved text extraction approaches M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005. The sequential multi-resolution paradigm can remove the redundancy of parallel multi- resolution paradigm. No text edges can appear several times at different resolution levels. Sequential multiresolution paradigm
  • 30. M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005. • Abstract—Text in video is a very compact and accurate clue for video indexing and summarization. Most video text detection and extraction methods hold assumptions on text color, background contrast, and font style. Moreover, few methods can handle multilingual text well since different languages may have quite different appearances. This paper performs a detailed analysis of multilingual text characteristics, including English and Chinese. Based on the analysis, we propose a comprehensive, efficient video text detection, localization, and extraction method, which emphasizes the multilingual capability over the whole processing. The proposed method is also robust to various background complexities and text appearances. The text detection is carried out by edge detection, local thresholding, and hysteresis edge recovery. The coarse-to-fine localization scheme is then performed to identify text regions accurately. The text extraction consists of adaptive thresholding, dam point labeling, and inward filling. Experimental results on a large number of video images and comparisons with other methods are reported in detail.
  • 31. Recent Progress -New and improved text extraction approaches J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp. 283-290, 2006. Fuzzy C-means based individual frame clustering is replaced by the fuzzy clustering ensemble (FCE) based multi-frame clustering to utilize temporal redundancy. Fuzzy cluster ensemble for text detection in videos
  • 32. J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp. 283-290, 2006. • Abstract: Detection and localization of text in videos is an important task towards enabling automatic content-based retrieval of digital video databases. However, since text is often displayed against a complex background, its detection is a challenging problem. In this paper, a novel approach based on fuzzy cluster ensemble techniques to solve this problem is presented. The advantage of this approach is that the fuzzy clustering ensemble allows the incremental inclusion of temporal information regarding the appearance of static text in videos. Comparative experimental results for a test set of 10.92 minutes of video sequences have shown the very good performance of the proposed approach with an overall recall of 92.04% and a precision of 96.71%.
  • 33. Recent Progress 2. Text extraction techniques adopted from other research fields: Another encouraging progress is that more and more techniques that have been successfully applied in other research fields have been adapted for text extraction. Because these approaches were not initially designed for the text extraction task, many unique characteristics of their original research fields are embedded in them intrinsically. Therefore, by using these approaches from other fields, we can view the text extraction problem from the viewpoints of other related research fields and benefit from them. It is a promising way to find good solutions for text extraction task.
  • 34. Recent Progress -Text extraction techniques adopted from other research fields K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003. The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to detect and track faces in a video stream. Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked as white and gray level (white: text region, gray: non-text region), and (d) final detection result
  • 35. Recent Progress -Text extraction techniques adopted from other research fields H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007. The multiscale statistical process control (MSSPC) was originally proposed for detecting changes in univariate and multivariate signals. Substeps involved in the use of MSSPC for videotext event detection
  • 36. K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003. • Abstract—The current paper presents a novel texture-based method for detecting texts in images. A support vector machine (SVM) is used to analyze the textural properties of texts. No external texture feature extraction module is used; rather, the intensities of the raw pixels that make up the textural pattern are fed directly to the SVM, which works well even in high-dimensional spaces. Next, text regions are identified by applying a continuously adaptive mean shift algorithm (CAMSHIFT) to the results of the texture analysis. The combination of CAMSHIFT and SVMs produces both robust and efficient text detection, as time-consuming texture analyses for less relevant pixels are restricted, leaving only a small part of the input image to be texture-analyzed. H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007. Abstract: Text in video, whether overlay or in-scene, contains a wealth of information vital to automated content analysis systems. However, low resolution of the imagery, coupled with richness of the background and compression artifacts limit the detection accuracy that can be achieved in practice using existing text detection algorithms. This paper presents a novel, noncausal temporal aggregation method that acts as a second pass over the output of an existing text detector over the entire video clip. A multiresolution change detection algorithm is used along the time axis to detect the appearance and disappearance of multiple, concurrent lines of text followed by recursive timeaveraged projections on Y and X axes. This algorithm detects and rectifies instances of missed text and enhances spatial boundaries of detected text lines using consensus estimates. Experimental results, which demonstrate significant performance gain on publicly collected and annotated data, are presented.
  • 37. Recent Progress -Text extraction techniques adopted from other research fields D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006. Discriminative Random Fields (DRF) was initially applied to detect man-made building in 2D images. (a) 2D DRF, with state si and one of its neighbors sj . (b) 3D DRF, with multiple 2D DRFs stacked over time. (c) 2D DRF-HMM type(A), with intra-frame dependencies modelled by undirected DRFs, and inter- frame dependencies modelled by HMMs. States are shared between the two models.
  • 38. Recent Progress -Text extraction techniques adopted from other research fields W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 412-416, 2007. Sparse representation was initially used for research on the receptive fields of simple cells. (a) (b) (C) (a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c) binarized result of (b) using Otsu’s method.
  • 39. D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006. • Abstract: In this paper, we propose a general object detection framework which combines the Hidden Markov Model with the Discriminative Random Fields. Recent object detection algorithms have achieved impressive results by using graphical models, such as Markov Random Field. These models, however, have only been applied to two dimensional images. In many scenarios, video is the directly available source rather than images, hence an important information for detecting objects has been omitted — the temporal information. To demonstrate the importance of temporal information, we apply graphical models to the task of text detection in video and compare the result of with and without temporal information. We also show the superiority of the proposed models over simple heuristics such as median filter over time. W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 412-416, 2007. Abstract: A novel text segmentation method from complex background is presented in this paper. The idea is inspired by the recent development in searching for the sparse signal representation among a family of over-complete atoms, which is called a dictionary. We assume that the image under investigation is composed of two components: the foreground text and the complex background. We further assume that the latter can be modeled as a piece-wise smooth function. Then we choose two dictionaries, where the first one gives sparse representation to one component and nonsparse representation to another while the second one does the opposite. By looking for the sparse representations in each dictionary, we can decompose the image into the two composing components. After that, text segmentation can be easily achieved by applying simple thresholding to the text component. Preliminary experiments show some promising results.
  • 40. Recent Progress 3. Text extraction approaches proposed for specific text types and specific genre of video documents: Besides general text extraction approaches, an increasing number of approaches have been proposed for specific text types. Based on domain knowledge, these specific approaches can take advantages of unique properties of specific text type or video genre and often achieve better performance than general approaches.
  • 41. Recent Progress -Text extraction approaches proposed for specific text types and specific genre of video documents W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005. This approach is composed of two stages: 1. localizing road signs; 2. detecting text. Architecture of the proposed framework
  • 42. W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005. • Abstract—A fast and robust framework for incrementally detecting text on road signs from video is presented in this paper. This new framework makes two main contributions. 1) The framework applies a divide-and-conquer strategy to decompose the original task into two subtasks, that is, the localization of road signs and the detection of text on the signs. The algorithms for the two subtasks are naturally incorporated into a unified framework through a feature-based tracking algorithm. 2) The framework provides a novel way to detect text from video by integrating two- dimensional (2-D) image features in each video frame (e.g., color, edges, texture) with the three-dimensional (3-D) geometric structure information of objects extracted from video sequence (such as the vertical plane property of road signs). The feasibility of the proposed framework has been evaluated using 22 video sequences captured from a moving vehicle. This new framework gives an overall text detection rate of 88.9% and a false hit rate of 9.2%. It can easily be applied to other tasks of text detection from video and potentially be embedded in a driver assistance system.
  • 43. Recent Progress -Text extraction approaches proposed for specific text types and specific genre of video documentsT. Liu, Summarization of Visual Content in Instruction videos, IEEE C. Choudary, and Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007. content fluctuation curve based on the number of chalk pixels is used to measure the content in each frame of instructional videos. The frames with enough chalk pixels are extracted as key frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the redundancy of key frames by matching the content and mosaicking the frames. (a) (b) (C) (d) Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.
  • 44. C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos, IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007. • Abstract—In instructional videos of chalk board presentations, the visual content refers to the text and figures written on the boards. Existing methods on video summarization are not effective for this video domain because they are mainly based on low-level image features such as color and edges. In this work, we present a novel approach to summarizing the visual content in instructional videos using middle-level features. We first develop a robust algorithm to extract content text and figures from instructional videos by statistical modelling and clustering. This algorithm addresses the image noise, nonuniformity of the board regions, camera movements, occlusions, and other challenges in the instructional videos that are recorded in real classrooms. Using the extracted text and figures as the middle level features, we retrieve a set of key frames that contain most of the visual content. We further reduce content redundancy and build a mosaicked summary image by matching extracted content based on K-th Hausdorff distance and connected component decomposition. Performance evaluation on four full-length instructional videos shows that our algorithm is highly effective in summarizing instructional video content
  • 45. Recent Progress -Text extraction approaches proposed for specific text types and specific genre of video documents Additional References: • C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006. • D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004. • L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of IEEE international conference on image processing, Vol. 3, pp11-14, 2005. • M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005. • S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEE Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110, 2005. • CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with artificial intelligence, Vol. 2 ,pp 539-542, 2007. • …
  • 46. Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
  • 47. Performance Evaluation R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol, to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008. (http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57) Evaluation Metrics: Video Analysis and Content Extraction (VACE)
  • 48. Text: Task Definition  Detection Task: Spatially locate the blocks of text in each video frame in a video sequence • Text blocks (objects) contain all words in a particular line of text where the font and size are the same  Tracking Task: Spatially/temporally locate and track the text objects in a video sequence  Recognition Task: Transcribe the words in each frame, including their spatial location (detection implied)
  • 49. Task Definition Highlights • Annotate oriented bounding rectangle around text objects (The reference annotation was done by VideoMining Inc., State College, PA) • Detection and Tracking task – Line level annotation with IDs maintained – Rules based on similarity of font, proximity and readability levels • Recognition task – Word Level (IDs maintained) • Documents – Annotation guidelines - Evaluation protocol • Tools – ViPER (Annotation) - USF-DATE (Scoring)
  • 50. Data Resources  VIDEO DATA NUMBER TOTAL MINS OF CLIPS MICRO-CORPUS 5 10 TRAINING 50 175 TESTING 50 175 • Micro-corpus: a small amount of data that was created after extensive discussions with the research community to act as a seed for initial annotation experiments and to provide new participants with a concrete sampling of the datasets and the tasks.
  • 51. Data Resources  These discussions were coordinated as a series of weekly teleconferences with VACE contractors and other eminent members of the CV community.  The discussions made the research community a partner in the evaluations and helped us in: 5. selecting the video recordings to be used in the evaluations, 6. creating the specifications for the ground truth annotations and scoring tools 7. defining the evaluation infrastructure for the program.
  • 52. Data Resources TASK DOMAIN Text Detect & Track Broadcast News ABC & CNN* Face Detect & Track Broadcast News ABC & CNN* Vehicle Detect & Track Surveillance i-LIDS**  MPEG–2 standard, progressive scanned at 720 × 480 resolution. GOP (Group of Pictures) of 12 for the broadcast news corpus where the frame-rate was 29.97 fps (frames per second) and GOP of 10 for the surveillance dataset where the frame-rate was 25 fps. * Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu ** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK Home Office and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)
  • 53. Reference Annotations  Text Ground Truth: Every new text area was marked with a box when it appeared in the video. The box was moved and scaled to fit the text as it moved in successive frames. This process was done at the text line level until the text disappeared from the frame. Three readability levels: READABILITY = 1 (white) Completely unreadable text READABILITY = 1 (gray) Partially readable text READABILITY = 2 (black) Clearly readable text
  • 54. Reference Annotations • Text regions were tagged based on a comprehensive set of rules: • All text within a selected block must contain the same readability level and type. • Blocks of text must contain the same size and font. • The bounding box should be tight to the extent that there is no space between the box and the text. • Text boxes may not overlap other text boxes unless the characters themselves are superimposed atop one another.
  • 55. Sample Annotation Clip (line- level)
  • 56. Detection Metric • The Frame Detection Accuracy (FDA) measure calculates the spatial overlap between the ground truth and system output objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all of the overlaps was normalized over the average of the number of ground truth and detected objects Frame Detection Accuracy (FDA) Gi(t )  Di(t ) (t ) Overlap Ratio N mapped FDA(t ) = N Gt ) + N Dt ) ( ( where, Overlap Ratio = ∑i =1 Gi(t )  Di(t ) 2 Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t. Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t. N(t)G and N(t)D denote the number of ground truth objects and the number of detected objects in frame t respectively.
  • 57. Detection Metric • The Sequence Frame Detection Accuracy (SFDA), is essentially the average of the FDA measure over all of the relevant frames in the sequence. Sequence Frame Detection Accuracy (SFDA) N frames ∑ FDA (t ) SFDA = N frames t =1 Range: 0 to 1 (higher is better) ∑ (t ) (t) ∃( N G OR N D ) t =1 Nframes is the number of frames in the sequence
  • 58. Tracking Metric • The Average Tracking Accuracy (ATA) is a spatio-temporal measure which penalizes fragmentations in both the temporal and spatial dimensions while accounting for the number of objects detected and tracked, missed objects, and false positives. N iframes Gi(t )  Di(t ) N mapped ∑t =1 Gi(t )  Di(t ) Sequence Track Detection Accuracy (STDA) STDA = ∑ i =1 N ( Gi ∪ Di ≠φ ) STDA Average Tracking Accuracy (ATA) ATA =  NG + N D  Range: 0 to 1 (higher is better)    2  NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence respectively. Uniqueness is defined by object IDs.
  • 59. Example Detection Scoring Green: Detected box Red: Ground truth box Yellow: Overlap in mapped boxes Spatial alignment error (ratio = .4505) 3 false alarm objects [0.4505] + [1]  FDA =   = 0.2901 Correctly detected  (5 + 5) 2  object – perfect overlap (ratio = 1.0) 3 missed objects
  • 60. Annotation Quality Evaluation relies on manual labeling The degree of consistency becomes 10% of the entire corpus was increasingly important as systems doubly annotated by multiple approach human levels of annotators and checked for performance. quality using the evaluation A high degree of consistency would be difficult to achieve with somewhat measures. subjective attributes like readability Humans fatigued easily when performing such tedious tasks
  • 61. Annotation Quality  For double annotated corpus Average Sequence Frame Text detection 95% Detection Accuracy (SFDA) Average Average Tracking Text tracking 85% Accuracy (ATA) The scores for the current state-of-the-art automatic algorithms are significantly lower than these numbers (22% relative for text detection, and 61% relative for text tracking).
  • 62. Annotation Quality Flowchart of Annotation Quality Control Procedure. Steps denoted by dark shaded boxes were carried out by the annotators. Steps denoted by light shaded boxes were carried out by the evaluators.
  • 63. Text Detection and Tracking – VACE Mean SFDA/ATA Scores for Eng Text Detection and Tracking: BNews 1 A 0.9 B 0.8 C D 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SFDA ATA
  • 64. Text Recognition Evaluation • Datesets: Broadcast News • Training/Dry Run Development Set – 5 Clips • 14.5 minutes • 1181 words • Evaluation Set – 25 Clips • 62.5 minutes • 4178 word objects • 68,738 word frame instances
  • 65. Text Recognition Evaluation  Evaluate only the most easily readable text (to establish a baseline at a high level of inter-annotator agreement) • Type = graphic (no scene text) • Readability = 2 • Logo = false • Occlusion = false • Ambiguous = false — Exclude scrolling (ticker), dynamic text (scoreboard) • Case insensitive and punctuation ignored
  • 66. Sample Annotation Clip (Word- level)
  • 67. Recognition Evaluation Metrics • Spatially map system output detected words to reference words, then compare the strings for mapped words – An unmapped word in system output incurs an Insertion (I) error – An unmapped word in reference incurs a Deletion (D) error – A mapped word with a character mismatch incurs a Substitution (S) error REF: The raven caws at midnight (I + D + S) WER = D S I (Total # Words in Ref) Sys Output: raven calls at at midnight WER = (1 + 1 + 1)/5 = 3/5 (60%) • Errors are accumulated over entire test set • Also generate: Character Error Rate
  • 68. Individual Clip Word Error Rate Clip-wise WER 1 WER per clip (normalized by the #words in each clip) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 Clips
  • 69. Scores (Word Error Rate) Word Error Rates with different Normalizations 1 0.9 0.8 0.7 0.6 Values 0.5 0.4 0.3 0.2 0.1 0 WER/Word WER/Frames WER CER 0.4233 0.2823
  • 70. Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
  • 71. Discussion • The recent progresses provide many promising solutions and research directions for text extraction problem. • Due to the large variations of text objects in videos, no single approach can achieve satisfactory performance in all applications. • To further improve the performance of text extraction techniques, much work in the area remains.
  • 72. Discussion  Detection and Localization – How to efficiently combine several complementary extraction algorithms to produce better performance and how to extract better features by analyzing the shape of characters and the relationships between text and its background still need more investigation.
  • 73. Discussion  Tracking – Although text tracking is an indispensable step for text extraction in videos, not many text tracking approaches have been reported in recent years. – More effort is needed to focus on tracking, not only for static and scrolling text, but also for dynamic text objects (growing, shrinking, and rotating text).
  • 74. Discussion  Datasets: – Besides extraction approaches, because most algorithms are still tested on their own datasets, in order to compare and evaluate all algorithms, a large freely available annotated video dataset is urgently needed.
  • 75. THANK YOU! See you at ICPR 2008 in December

Editor's Notes

  1. Figs. 2 show two computation examples of the neighborhoods. In each figure, the image (a) shows a binary image, where black dots denote centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. We consider two cases and illustrate their solutions in Figs. 2c. According to the Delaunay triangulation shown in Fig. 2b, we can get three neighbor sets, in which the neighborhood of three neighboring characters in the middle of the text string, i.e. BCD , is ignored. In order to tackle this kind of problem, we take all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets. The solution is illustrated by dotted lines in Fig. 2c.
  2. A smart way to combine color or graylevel variation with spatial information is to use Gabor-based filter. For this purpose and in the context of natural scene images, we have chosen to use Log-Gabor filters as explained in Figure 1 to get our final text cluster, which will be then fed into an OCR algorithm.
  3. once the text edges are detected at a resolution level, they are erased immediately from the current edge map, and then the modified edge map is utilized as input to the next level, so that no text edges can appear several times at different resolution levels.
  4. A fuzzy clustering ensemble (FCE), which can merge the results of several clustering algorithms together in order to improve the quality of the individual clusters and robustness, is adopted to fuse multi-frame information. For a set of consecutive frames, the features are extracted from each frame by applying wavelet transform and clustered by FCM. Then FCE is employed to output an integrated frame with three clusters, “text”, “background” and “complex background”, based on the individual clustering result of each frame. “text” cluster is labeled by finding the smallest distance of each cluster from the ideal text features.
  5. The text density time series is mirrored on both ends to ensure that its length is dyadic (i.e., a power of 2). It is then decomposed using Haar wavelets onto 5 scales (H1 through H5 in Figure) and the residual (L5), with dyadic downsampling at each scale. At each scale, potential change points are detected when a detection threshold (shown as a red envelope in Figure, which is set at 3.5 times the standard deviation in a local neighborhood downsampled for each scale) was exceeded. Scales that exceeded the detection threshold at any given time were selected for reconstruction using inverse wavelet transform. Videotext events were detected as points in time where the reconstructed signal exceeds an adjusted threshold
  6. We extend the 2D DRF to a 3D DRF as follows. We extend the neighboring structure Ni of each state si from 2D to 3D, as in Figure 1 (b). We call neighbors in the same frame as intra-frame neighbors, N intra i , and neighbors across neighboring frames as inter-frame neighbors, N inter i . Anisotropy for inter- and intra-frame is a natural requirement since dependencies along the temporal direction should be different from the spatial domain, hence define I intra( si, sj , o ) = β intra sisj and I inter( si, sj , o ) = β inter sisj . The 3D DRF is in essence collecting more context than the 2D DRF. It therefore has a larger chance to correctly estimate the hidden states.
  7. 1) Discriminative points detection and clustering—detect discriminative feature points in every video frame using the algorithm proposed in [28] and partition them into clusters. 2) Road sign localization—select candidate road sign regions corresponding to clusters of feature points using a vertical plane criterion. 3) Text detection—detect text on candidate road sign areas and track them. 4) Text extraction and recognition—extract text in candidate sign plane for recognition given a satisfactory size.
  8. We refer to the final disjoint key frames (including the mosaicked frames) as summary frames . We get the bounding boxes for the binary content in the summary frames and stitch them together, making a summary image of the instructional video content. We compare the performance of our summarization algorithm with three well-known key frame selection techniques namely, the fixed rate video sampling, the tolerance band [9] and, the unsupervised clustering [10] methods. Figs. clearly show that our method outperforms the conventional key frame selection methods in summarizing the visual content in instructional videos. Our method performs better than the other methods in the following three aspects. First, the conventional methods are based on image dissimilarity measures, so the occlusions, light condition changes, and camera movements negatively affect the resulting key frames.
  9. Should mention that N_G and N_D are the number of unique ground truth and system output objects in the video sequence. Uniqueness being denoted by their ID.
  10. Each clip treated independently..