Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Exploring Inter-Frame Correlation Analysis and Wavelet-Domain Modeling for Real-Time Caption Detection in Streaming Video Jia Li 1 , Yonghong Tian 2 , Wen Gao 1,2 1 Key Laboratory of Intelligent Information Processing, ICT,CAS 2 Institute of Digital Media, School of EE & CS, PKU
  2. 2. Outline <ul><li>Background </li></ul><ul><li>Problem statement </li></ul><ul><li>System architecture </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Background: Caption Detection Caption Text Scene Text Scrolling Text
  4. 4. Background : Frequently used methods <ul><li>Sobel Edges </li></ul><ul><ul><li>Chunmei Liu, et al. Text detection in images based on unsupervised classification of edge-based features. ICDAR 2005. </li></ul></ul><ul><ul><li>Lyu, M.R., et al. A comprehensive method for multilingual video text detection, localization, and extraction. CSVT, 2005. </li></ul></ul><ul><li>Wavelet Domain </li></ul><ul><ul><li>Huiping Li et al. Automatic text detection and tracking in digital video. IEEE Transactions on Image Processing, 2000 </li></ul></ul><ul><ul><li>Qixiang Ye, et al. “Fast and robust text detection in images and video frames”. Image and Vision Computing. 2005 </li></ul></ul><ul><li>Directly On Image </li></ul><ul><ul><li>Kwang In Kim, et al. T exture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm”. PAMI, 2003 </li></ul></ul>
  5. 5. The Problems <ul><li>Why streaming videos are different: </li></ul><ul><ul><li>Faster: frames are coming in real time </li></ul></ul><ul><ul><li>More detailed: few clues – organize texts by their types </li></ul></ul><ul><ul><li>More accurate: simple features to remove text-like textures </li></ul></ul>Text Detection in Images & Video Text Detection in Streaming Video Detection Speed Discern Text Type Removal of Texture
  6. 6. Our Solution Inter-Frame Correlations Static Edges Moving Edges Caption Texts Stable Background Scene Text Moving Textures Scrolling Text Frame Sequence Fast & Robust Quantification
  7. 7. System Architecture <ul><li>Data: using the previous and next frame to assist text detection in the current frame. </li></ul><ul><li>Temporal Analysis: Remove unstable edges </li></ul><ul><li>Spatial Analysis: Remove weak edges </li></ul>
  8. 8. Temporal Analysis <ul><li>The goal: remove unstable edges </li></ul><ul><li>In subbands LH and HL of wavelet domain </li></ul><ul><ul><li>Edge stability in subband WS (WS∈{LH,HL}) is evaluated with Inter-Subband Correlation Coefficients (ISCC) </li></ul></ul><ul><ul><li>Variables: </li></ul></ul>Local Covariance Local Variance
  9. 9. Temporal Analysis (continue) <ul><li>Based on ISCC, the inter-frame correlation coefficients (IFCC) and Temporal stability (TS) are defined as: </li></ul><ul><ul><li>Inter-frame correlation coefficients: </li></ul></ul><ul><ul><li>Temporal stability: </li></ul></ul>
  10. 10. Temporal Analysis (continue) Temporal Stability of Edges Inter-subband correlation Inter frame correlation
  11. 11. ISCC: Robust and Sensitive (a) (b) (c) (d) <ul><li>ISCC is robust to background changing. </li></ul><ul><li>ISCC is sensitive to slight motions. </li></ul>x
  12. 12. Spatial Analysis <ul><li>The goal: remove static backgrounds </li></ul><ul><ul><li>Texts are collections of strong edges </li></ul></ul><ul><ul><li>Using adaptive global thresholds on wavelet subbands LH/HL to remove the backgrounds. </li></ul></ul><ul><li>Modeling wavelet coefficients in LH/HL with two Generalized Gaussian Distributions: </li></ul>Parameters: Mean, Variance, and Shape parameter 1 What is shape parameter? 2
  13. 13. Shape Parameter and Threshold Selection <ul><li>Usually, the coefficients are zero-mean. </li></ul><ul><li>Adaptive global threshold is selected according to variance and shape parameter </li></ul>Kamran, S., and Alberto, L.G. “Estimation of shape parameter for generalized Gaussian distribution in subband decompositions of video”. CSVT, 5(1), 1995 Page(s):52 – 56 More Edges Larger r Less Edges Smaller r Larger Threshold Smaller Threshold Wavelet Subbands
  14. 14. Spatial Analysis (continue) Remove Moving Regions Remove Weak Edges Morphological operations Inter-Frame tracking SVM-based classification
  15. 15. Experiments 1 、 Wavelet domain 2 、 Select 15% edges 3 、 Sophisticated SVM Qixiang Ye, et al. “ Fast and robust text detection in images and video frames”. IVC. 2005 Algorithm 2 1 、 Sobel edges 2 、 Local thresholding 3 、 Iterative projection Lyu, M.R, et al. “ A comprehensive method for multilingual video text detection, localization, and extraction”. CSVT,  2005 Algorithm 1 Algorithms For Comparisons Description
  16. 16. The Data Set (Used in Algorithm 2 for comparison) (Size of the original Video) (Down sampled for CIF) (Down sampled for QCIF) Test Set IV Test Set II Test Set III Pentium IV 3.2G CPU, 512 MB. 16 Video Clips 6 h 49 min. 2 frame/second 49,177 frames 89,639 captions Test Set I Resolution Data Set Test Environment Duration
  17. 17. Experiment 1: Robustness and Sensitivity of IFCC <ul><li>Notes </li></ul><ul><ul><li>IFCC histogram between caption regions with same texts in adjacent frames, are used to evaluate the robustness </li></ul></ul><ul><ul><li>IFCC histogram between caption regions with different texts in adjacent frames, are used to evaluate the sensitivity </li></ul></ul><ul><ul><li>Four Resolutions: 720*576, 400*328, 352*288, 176*144 </li></ul></ul><ul><ul><li>(a) IFCC histogram to evaluate robustness; </li></ul></ul><ul><ul><li>(b) IFCC histogram to evaluate the sensitivity </li></ul></ul> (a) (b)
  18. 18. Experiment 2 : Detection Speed Table 1. Detection Speed and Speed Nonstationarity <ul><ul><li>Notes: </li></ul></ul><ul><ul><li>7 frame sequences with no scene/scrolling texts are selected from test set II. </li></ul></ul><ul><ul><li>Nonstationarity is defined as: </li></ul></ul><ul><ul><li>Smaller nonstationarity means the algorithm spends similar time on simple and complex frame. </li></ul></ul>12.69 11.54 5.13 Nonstationarity (%) 1.18 4.46 9.09 Speed(Frames/s) Algorithm 2 Algorithm 1 Our Algorithm
  19. 19. Why we are faster ISCC is calculated with 2D-separable filters 2 Works on LH and HL, less pixels to deal with 1 Simple but robust features in SVM classification 4 Adaptive global threshold is faster in removing background than local thresholding 3 Our Demo 5
  20. 20. Experiment 3: Scene/Scrolling Text Removal- Examples of Success <ul><ul><li>Notes: </li></ul></ul><ul><ul><li>Using 9 videos from all 4 test sets with scene/scrolling texts </li></ul></ul><ul><ul><li>Captions can be well distinguished from scene/scrolling texts </li></ul></ul><ul><ul><ul><li>(a). Small scrolling texts on the bottom with simple background. </li></ul></ul></ul><ul><ul><ul><li>(b). small scrolling texts on the bottom with transparent background. </li></ul></ul></ul><ul><ul><ul><li>(c). Big scrolling texts above the caption with transparent background. </li></ul></ul></ul><ul><ul><ul><li>(d). scene texts </li></ul></ul></ul>(a) (b) (c) (d)
  21. 21. Experiment 3: Scene/Scrolling Text Removal- Examples of Failures <ul><ul><li>Explanations: </li></ul></ul><ul><ul><li>Hard to distinguish captions from static scene text lines </li></ul></ul><ul><ul><li>Better features is required for distinguishing text-like textures </li></ul></ul>(a) (b) (c) (d)
  22. 22. Experiment 4: Recall and False alarm rate Table 2. Performance comparison in experiment 1 and experiment 2 <ul><ul><li>Detected area — S D , GroundTruth area — S G , Intersection area—S O </li></ul></ul>65.02 36.93 55.99 37.49 88.68 Algorithm 2 100 38.17 82.11 38.17 82.11 Algorithm 1 100 28.98 90.66 28.98 90.66 Our Temporal Coverage (%) False Alarm Rate(%) Recall (%) False Alarm Rate(%) Recall (%) Algorithm Experiment 2 Experiment 1
  23. 23. Why we perform better Adaptive global threshold makes the text edges “suppress” other edges. 2 Moving text-like textures are removed from IFCC, thus a higher precision. 1 Only a small number of parameters to adjust 4 Selection of fixed percentage of pixels in algorithm 2 leads to additional false alarms in the frame with no texts. 3
  24. 24. Conclusion Better features aiming at distinguishing captions from textures 2 Inter-frame correlation is useful for high speed caption detection 1 Incrementally learning the parameter settings (automatic or Semiautomatic) 4 Distributed architecture: test various parameters at several terms and merge results together 3
  25. 25. <ul><li>Thanks </li></ul><ul><li>Q&A? </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.