Your SlideShare is downloading. ×
Video identification using spatio temporal salient points
Video identification using spatio temporal salient points
Video identification using spatio temporal salient points
Video identification using spatio temporal salient points
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Video identification using spatio temporal salient points

388

Published on

Video identification using spatio temporal s

Video identification using spatio temporal s

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
388
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 2009 Fifth International Conference on Information Assurance and Security Video Identification Using Spatio-temporal Salient Points Yue-nan Li1 Zhe-ming Lu1,2 1 2 Shenzhen Graduate School, School of Aeronautics and Astronautics Harbin Institute of Technology Zhejiang University Shen Zhen, P.R. China Hang Zhou, P.R. China e-mail: yuenanlee@yahoo.com.cn e-mail: zheminglu@zju.edu.cn Abstract—Automatic identification of video is a key technique the aforementioned algorithms, robust hash has to consider to various applications such as content based retrieval, the security aspect of feature extraction to resist to content copyright protection and broadcast monitoring. In this paper, forgery attacks. Performances of state-of-the-art video we present a novel video identification scheme via spatio- identification algorithms are evaluated and compared in [2] temporal salient points. To achieve higher robustness to and [7]. As it is reported by most literatures, geometric content preserved manipulations, especially geometric distortion (e.g., rotation, scaling and translation) is still one distortions, salient points are detected from both spatial and of the most challenging problems in video identification. In temporal aspects. The salient points are found to be invariant order to enhance the robustness of video identifications in a series of distortions. In addition, we have compared our against geometric distortions, a novel spatio-temporal salient approach with a state-of-the-art one on diverse sequences and superior performances on detection accuracy have been point based identification algorithm is proposed. In the observed in the proposed work. proposed work, the most stable points in spatial and temporal domains are detected for identification. Experimental results Keywords-video indentification; copy detection; spatio- demonstrate that the proposed algorithm show considerable temporal domain; salient points; Harris detector robustness against geometric distortions, as well as excellent discriminative abilities. The rest parts of this paper are organized as follows. The I. INTRODUCTION proposed spatio-temporal salient points based identification The prevalence of network infrastructure and media- algorithm is described in Section 2. Experimental results and capture device has greatly increased the volume of comparisons are provided in Section 3. Finally, this paper is multimedia information. By virtue of the properties of concluded in Section 4. content-rich, intuitive and easy capturing, video has become the most popular multimedia information. Video is involved II. SPATIO-TEMPORAL SALIENT POINT BASED VIDEO in most multimedia applications, such as distant learning, IDENTIFICATION network broadcasting, video-on-demand, and wireless TV. The findings in computer vision have demonstrated the Consequently, the massive capacity of video imposes the high stableness of salient points under various distortions. demand of identification techniques. Video identification Under this consideration, we propose a salient point based aims at finding the sequences derived from the same source. video identification scheme, and we believe that the temporal Hence, one typical application of video identification is variation of video can provide a useful clue for salient point content based retrieval. For example, a short video clip may detection. The main components of the proposed algorithm be released on Internet as content abstraction. With the aid of will be described in the following subsections. video identification, the person attracted by its content can found the whole sequence using the short clip as query A. Pre-processing of Input Sequence reference. As a complementary technique to digital The diverse sequences usually have different format watermarking, video identification also plays an important attributes like spatial size, frame rate and shot length. To role in copyright protection. It is reported that YouTube has make different sequences comparable, video identification already launched a video identification tool to help content should be format-independent. In the proposed algorithm, owners to detect the unlicensed sharing of their works. input sequence is first normalized into a standard form with Video identification has drawn much attention in the past the same size and shot length. decade. Most of the early algorithms use global features such The frames within a shot often show great similarities as as intensity or color histograms [1]. Regarding the dynamic they are continuously captured by the same camera. The nature of video, motion directions [2] and trajectory [3] are temporal span of a salient point is no more than one shot. also exploited to facilitate video identification. Moreover, Hence, the salient points in video should be independently some researchers propose to use combined features for video detected within each shot. As a result, we first segment the identification. For example, Hoad et al. use shot length, color input sequence into shots via shot boundary detection [8]. shift as well as centroid motion to construct a combined Pixel tubes are formed by collecting each pixel in the first signature in [4]. In recent years, robust hash algorithms [5, 6] frame of a shot along the temporal direction. One have also been applied for video identification. Differ from978-0-7695-3744-3/09 $25.00 © 2009 IEEE 79DOI 10.1109/IAS.2009.291Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.
  • 2. dimensional Gaussian filter is implemented on each pixel 2 tube to remove temporal noises. Consequently, we sub- I I 2 2 E ( x, y ) x y O( x , y ) sample these temporal smoothed frames in each shot to three x, y W x y frames. Similarly, 2D spatial Gaussian smoothing and sub- 2 2 sampling are then performed on three representative frames 2 I 2 I I I in each shot. In this manner, the input sequence is converted x y 2 x y into the standard form with dimension of 128×128 pixels and x, y W x y x y 3 frames in each shot. The advantages of pre-processing are two folds. Firstly, spatial and temporal noise points can be eliminated, which can enhance the stableness of salient point Let s denote the shift vector ( x, y), then E(x, y) can be detector. Secondly, the computation burden of salient point written in the form of matrix and vector multiplications detection can be reduced, as spatial and temporal T redundancies are reduced in sub-sampling. The E ( x, y ) sMs , normalization process is illustrated in Fig. 1. where Shot 1 Shot 2 I 2 I I … x x y … M . Input frames 2 I I I Pixel tube x y y Obviously, the real and symmetric matrix M can be unitary equivalent to a diagonal matrix =diag(a1, a2) as shown in (5), where a1 and a2 are its two eigen values. Representative frames * M U U By substituting (5) into (3), a more concise representation of E(x, y) can be obtained: Normalized frames T E ( x, y ) Q Q . * where Q sU ( s , u1 , s , u 2 ) . Here, , denotes the Figure 1. Normalization of the input sequence inner production operation, and u1 and u2 are two column vectors of U*. Consequently, B. Spatio-temporal Salient Point Detection 2 2 E ( x, y ) a1 s , u1 a2 s , u 2 . Most salient point detection schemes are designed for still image. Although these algorithms can be extended to Up to now, the property of I(x, y) can be derived from a1 and video by individually detecting salient points in each frame, a2. Two small eigen values result in a tiny illumination only limited stableness can be achieved in this way. The variation E, which indicates a flat region. If one is high and reason is that they only focus on spatial variation of a given the other is low, an edge can be declared. For example, if is 2 point without considering its temporal properties. In order to a1 high and a2 is low, then E a1 s , u1 . Thus, there is no cope with the temporal dynamic nature of video, we propose lamination change when the shift vector s is orthogonal with to find salient points by incorporating both spatial and temporal detections. Spatial salient points are first detected u1, (i.e., s , u1 0 ). While for other directions, the in each frame individually to select candidate points for illumination variation E is high. Apparently, the edge is temporal evaluation. Consequently, the temporal stableness perpendicular to u1. The case of two high eigen values of each candidate point is examined. The Harris point corresponds to a corner point. The illumination variation is detection algorithm [9] is adopted in the proposed work to significant for all shift directions. In other words, I(x, y) select spatial candidate points. In this section, we provide a shows remarkable contrast with its neighbors, which is the mathematical interpretation of Harris detector. Consider a property of corner. Based on the above discussion, the pixel I(x, y), its illumination variance E(x, y) is calculated by following equation is used to measure the salient response of weighting the squared differences with I(x+ x, y+ y) in a I(x, y), where is a small constant to prevent the window W centered on (x, y) denominator from being zero. E( x, y) G( x, y) I (x x, y y) I ( x, y) . 2 det( M ) Sa ( x, y ) x, y W trace( M ) By approximating I(x+ x, y+ y) with its Taylor expansion, In the proposed algorithm, the points are sorted according we can obtain (2). For expression clarities, the term of to salient response, and the first M points are selected as Gaussian weighting factor G( x, y) is omitted in the candidate points in each frame. The trajectories of candidate following equations. points are then traced to examine their temporal stableness, 80Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.
  • 3. according to which their salient responses are updated. For Let V1 and V2 denote two sequences under comparison, the each candidate, an 8×8 block it centered is searched in identification process can be described as hypothesis test pervious and following frames using motion estimation as with, illustrated in Fig.2. If candidate points can still be observed H0: V1 and V2 are perceptual identical; in forward or backward (or both) matched blocks, it means H1: V1 and V2 are perceptual distinct. that the current candidate point is stable during temporal Decision can be made based on the relationship between two variations. Accordingly, its response is increased by prior probabilities, P(D|H0) and P(D|H1). Hypothesis H0 is multiplying with an update factor mt, accepted if P(D|H0)>P(D|H1), otherwise H1 is accepted. t Sa ( x, y ) Sa ( x, y ) m , ( m 1, t =0,1,2) where t is the time that the matched points of (x, y) in two III. EXPERIMENTAL RESULTS neighboring frames are detected as candidates. Finally, all the candidate points are sorted again in each frame according In this section, the performance of proposed algorithm is to updated salient responses, and we choose N points with investigated. We have collected 100 distinct video sequences the largest responses to compose the signature of this frame. to form the test set. The test set consist of documentary, movie, cartoon, sports match and standard test sequences. For robustness evaluations, a series of content preserved time distortions are implemented on test sequences, including Matched point MPEG2 compression (5Mbps), 4% pepper & salt noise addition, 3×3 median filter, blur, contrast enhance, 2 ration, y 8-pixel horizontal translation, 1/2 scaling, 50% frame dropping. The parameters in the proposed algorithm are set f +1 as follows, the variance of temporal and spatial Gaussian Candidate point filters in pre-processing are 2s= 2t=5, the number of f candidate points in each frame M=40, number of salient Matched point points N=20, update factor for salient response m=1.5. f -1 x Signatures of original and distorted sequences are extracted and compared. The histograms of signature distances between content identical and distinct sequences (also Figure 2. Illustration of tracing the trajectory of candidate point referred as intra and inter distances) are calculated and shown in Fig.3, where the curves represent the fitted C. Signature comparison and identification distributions. Both of the distributions of intra and inter The similarity of two sequences is measured by the distances are modeled as Log-normal distributions, 2 distance between their signatures (i.e., coordinates of salient P( D | H 0 ) Log-N(0.71, 0.83 ) points). However, the synchronization between salient points 2 can be easily broken in geometric distortions as some points P ( D | H 1 ) Log-N(2.76, 0.21 ) . may be replaced by new-emerging ones. As a result, if the The overlapped region of two distributions that corresponds distance between salient points is measured by to sum of false reject and false acceptance rates is small. correspondence-based metrics like binary correlation, the Thus, it can be concluded that the salient point based signature distance between the original sequence and its signature can make a satisfied tradeoff between robustness distorted version can be quite high. The Hausdorff distance is and distinctness. Identification accuracy of the proposed developed as an alternative metric to measure the similarity algorithm is evaluated by precision and recall rates. The between two unorganized point sets, and it is superior to motion direction (MD) based identification algorithm [2] is other metrics in robustness to content preserved also implemented for performance comparison. In the MD manipulations. In this paper, the modified Hausdorff distance algorithm, the motion direction histogram contains 5 bins, (MHD) [10] that is insensitive to outliers is adopted for and the similarity between histograms is measured by signature comparison. The distance between two salient Euclidean distance. The precision-recall graphs of the point set S1 and S2 is calculated as follow, proposed and MD algorithms are plotted in Fig.4, from D ( S1 , S 2 ) max[ hMHD ( S1 , S 2 ), hMHD ( S 2 , S1 )] , which we can see that the proposed algorithm can achieve better identification accuracy than the MD algorithm. The where salient points detected in the hall monitor sequence as well 1 as the rotated, scaled and translated versions are shown in hMHD ( S1 , S 2 ) min p1 p2 N p1 S1 p2 S2 Fig.5. It can be seen from these normalized frames that most of the detected spatio-temporal salient points keep invariant 1 under geometric distortions. hMHD ( S 2 , S1 ) min p1 p2 . N p2 S2 p1 S1 81Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.
  • 4. IV. CONCLUSION A novel video identification algorithm is presented in this paper. Considering the temporal dynamic of video, temporal and spatial stableness evaluations are combined to detect salient points. The detected points are observed to be quite stable in content preserved distortions. Moreover, the modified Hausdorff distance based signature comparison can further improve the robustness to geometric distortions. Despite the high robustness, the proposed algorithm can also effectively distinguish content distinct sequences. Thus, satisfied identification accuracies can be achieved by the proposed algorithm. Figure 3. Distributions of intra and inter signature distances measured in the proposed algorithm REFERENCES [1] Y.P. Tan, S.R. Kularni, and P.J. Ramadge, “A Framework for Measuring Video Similarity and Its Application to Video Query by Example”. Proc. Image Procss., 1999, pp.106-110 [2] A. Hampapur and R. Bolle. “Comparison of Sequence Matching Techniques for Video Copy Detection”. Int. Conf. on Storage and Retrieval for Media Databases, 2002, pp.194- 201 [3] Z. Li, A.K. Katsaggelos and B. Gandhi, “Fast Video Shot Retrieval Based on Trace Geometry Matching”, IEE Proc. Vis. Image Signal Process., vol.152, no.3, 2005, pp.367-372 [4] T. C. Hoad and J. Zobel, “Detection of Video Sequences Using Compact Signatures”, ACM Trans. on Inf. Syst., 2006, vol.24, no.1, pp.1-50. [5] B. Coskun, B. Sankur, and N. Menon, “Spatio-temporal Figure 4. P-R graphs of the proposed and MD algorithms Transform-based Video Hashing”. IEEE Trans. on Multimedia, 2006, vol.8, no.6, pp.1190-1208. [6] C. D. Roover, C. D. Vleeschouwer, F. Lefμebvre, and B. Macq, “Robust Video Hashing Based on Radial Projections of Key Frames", IEEE Trans. Signal Process., 2005, vol.53, no.10, pp.4020-4037, [7] L. T. Julien, C. Li, A. Joly et al., “Video Copy Detection: A Comparative Study”, Proc. of the 6th ACM Int. Conf. on Image and Video Retrieval, 2007, pp.371-378 [8] J. Bescos, G. Cisneros, J. M. Martinez, J. M. Menendez, and J. Cabrera, “A Unified Model for Techniques on Video Shot (a) (b) Transition Detection”, IEEE Trans. Multimedia, 2005, vol.7, no.2, pp.293-307 [9] C. Harris and M. Stevens, “A Combined Corner and Edge Detector”. In Proc. of 4th Alvey Vision Conf., 1988, pp. 153- 158 [10] M. P. Dubuisson, A.K. Jain, “A Modified Hausdorff Distance for Object Matching”, Pattern Recognition, 1994, vol.1, no.9, pp.566 -568 (c) (d) Figure 5. Salient points detected in normalized hall-monitor sequences. (a) Original, (b) Rotated, (c) Translated, (d) Scaled 82Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.

×