dimensional Gaussian filter is implemented on each pixel 2 tube to remove temporal noises. Consequently, we sub- I I 2 2 E ( x, y ) x y O( x , y ) sample these temporal smoothed frames in each shot to three x, y W x y frames. Similarly, 2D spatial Gaussian smoothing and sub- 2 2 sampling are then performed on three representative frames 2 I 2 I I I in each shot. In this manner, the input sequence is converted x y 2 x y into the standard form with dimension of 128×128 pixels and x, y W x y x y 3 frames in each shot. The advantages of pre-processing are two folds. Firstly, spatial and temporal noise points can be eliminated, which can enhance the stableness of salient point Let s denote the shift vector ( x, y), then E(x, y) can be detector. Secondly, the computation burden of salient point written in the form of matrix and vector multiplications detection can be reduced, as spatial and temporal T redundancies are reduced in sub-sampling. The E ( x, y ) sMs , normalization process is illustrated in Fig. 1. where Shot 1 Shot 2 I 2 I I … x x y … M . Input frames 2 I I I Pixel tube x y y Obviously, the real and symmetric matrix M can be unitary equivalent to a diagonal matrix =diag(a1, a2) as shown in (5), where a1 and a2 are its two eigen values. Representative frames * M U U By substituting (5) into (3), a more concise representation of E(x, y) can be obtained: Normalized frames T E ( x, y ) Q Q . * where Q sU ( s , u1 , s , u 2 ) . Here, , denotes the Figure 1. Normalization of the input sequence inner production operation, and u1 and u2 are two column vectors of U*. Consequently, B. Spatio-temporal Salient Point Detection 2 2 E ( x, y ) a1 s , u1 a2 s , u 2 . Most salient point detection schemes are designed for still image. Although these algorithms can be extended to Up to now, the property of I(x, y) can be derived from a1 and video by individually detecting salient points in each frame, a2. Two small eigen values result in a tiny illumination only limited stableness can be achieved in this way. The variation E, which indicates a flat region. If one is high and reason is that they only focus on spatial variation of a given the other is low, an edge can be declared. For example, if is 2 point without considering its temporal properties. In order to a1 high and a2 is low, then E a1 s , u1 . Thus, there is no cope with the temporal dynamic nature of video, we propose lamination change when the shift vector s is orthogonal with to find salient points by incorporating both spatial and temporal detections. Spatial salient points are first detected u1, (i.e., s , u1 0 ). While for other directions, the in each frame individually to select candidate points for illumination variation E is high. Apparently, the edge is temporal evaluation. Consequently, the temporal stableness perpendicular to u1. The case of two high eigen values of each candidate point is examined. The Harris point corresponds to a corner point. The illumination variation is detection algorithm  is adopted in the proposed work to significant for all shift directions. In other words, I(x, y) select spatial candidate points. In this section, we provide a shows remarkable contrast with its neighbors, which is the mathematical interpretation of Harris detector. Consider a property of corner. Based on the above discussion, the pixel I(x, y), its illumination variance E(x, y) is calculated by following equation is used to measure the salient response of weighting the squared differences with I(x+ x, y+ y) in a I(x, y), where is a small constant to prevent the window W centered on (x, y) denominator from being zero. E( x, y) G( x, y) I (x x, y y) I ( x, y) . 2 det( M ) Sa ( x, y ) x, y W trace( M ) By approximating I(x+ x, y+ y) with its Taylor expansion, In the proposed algorithm, the points are sorted according we can obtain (2). For expression clarities, the term of to salient response, and the first M points are selected as Gaussian weighting factor G( x, y) is omitted in the candidate points in each frame. The trajectories of candidate following equations. points are then traced to examine their temporal stableness, 80Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.
according to which their salient responses are updated. For Let V1 and V2 denote two sequences under comparison, the each candidate, an 8×8 block it centered is searched in identification process can be described as hypothesis test pervious and following frames using motion estimation as with, illustrated in Fig.2. If candidate points can still be observed H0: V1 and V2 are perceptual identical; in forward or backward (or both) matched blocks, it means H1: V1 and V2 are perceptual distinct. that the current candidate point is stable during temporal Decision can be made based on the relationship between two variations. Accordingly, its response is increased by prior probabilities, P(D|H0) and P(D|H1). Hypothesis H0 is multiplying with an update factor mt, accepted if P(D|H0)>P(D|H1), otherwise H1 is accepted. t Sa ( x, y ) Sa ( x, y ) m , ( m 1, t =0,1,2) where t is the time that the matched points of (x, y) in two III. EXPERIMENTAL RESULTS neighboring frames are detected as candidates. Finally, all the candidate points are sorted again in each frame according In this section, the performance of proposed algorithm is to updated salient responses, and we choose N points with investigated. We have collected 100 distinct video sequences the largest responses to compose the signature of this frame. to form the test set. The test set consist of documentary, movie, cartoon, sports match and standard test sequences. For robustness evaluations, a series of content preserved time distortions are implemented on test sequences, including Matched point MPEG2 compression (5Mbps), 4% pepper & salt noise addition, 3×3 median filter, blur, contrast enhance, 2 ration, y 8-pixel horizontal translation, 1/2 scaling, 50% frame dropping. The parameters in the proposed algorithm are set f +1 as follows, the variance of temporal and spatial Gaussian Candidate point filters in pre-processing are 2s= 2t=5, the number of f candidate points in each frame M=40, number of salient Matched point points N=20, update factor for salient response m=1.5. f -1 x Signatures of original and distorted sequences are extracted and compared. The histograms of signature distances between content identical and distinct sequences (also Figure 2. Illustration of tracing the trajectory of candidate point referred as intra and inter distances) are calculated and shown in Fig.3, where the curves represent the fitted C. Signature comparison and identification distributions. Both of the distributions of intra and inter The similarity of two sequences is measured by the distances are modeled as Log-normal distributions, 2 distance between their signatures (i.e., coordinates of salient P( D | H 0 ) Log-N(0.71, 0.83 ) points). However, the synchronization between salient points 2 can be easily broken in geometric distortions as some points P ( D | H 1 ) Log-N(2.76, 0.21 ) . may be replaced by new-emerging ones. As a result, if the The overlapped region of two distributions that corresponds distance between salient points is measured by to sum of false reject and false acceptance rates is small. correspondence-based metrics like binary correlation, the Thus, it can be concluded that the salient point based signature distance between the original sequence and its signature can make a satisfied tradeoff between robustness distorted version can be quite high. The Hausdorff distance is and distinctness. Identification accuracy of the proposed developed as an alternative metric to measure the similarity algorithm is evaluated by precision and recall rates. The between two unorganized point sets, and it is superior to motion direction (MD) based identification algorithm  is other metrics in robustness to content preserved also implemented for performance comparison. In the MD manipulations. In this paper, the modified Hausdorff distance algorithm, the motion direction histogram contains 5 bins, (MHD)  that is insensitive to outliers is adopted for and the similarity between histograms is measured by signature comparison. The distance between two salient Euclidean distance. The precision-recall graphs of the point set S1 and S2 is calculated as follow, proposed and MD algorithms are plotted in Fig.4, from D ( S1 , S 2 ) max[ hMHD ( S1 , S 2 ), hMHD ( S 2 , S1 )] , which we can see that the proposed algorithm can achieve better identification accuracy than the MD algorithm. The where salient points detected in the hall monitor sequence as well 1 as the rotated, scaled and translated versions are shown in hMHD ( S1 , S 2 ) min p1 p2 N p1 S1 p2 S2 Fig.5. It can be seen from these normalized frames that most of the detected spatio-temporal salient points keep invariant 1 under geometric distortions. hMHD ( S 2 , S1 ) min p1 p2 . N p2 S2 p1 S1 81Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.
IV. CONCLUSION A novel video identification algorithm is presented in this paper. Considering the temporal dynamic of video, temporal and spatial stableness evaluations are combined to detect salient points. The detected points are observed to be quite stable in content preserved distortions. Moreover, the modified Hausdorff distance based signature comparison can further improve the robustness to geometric distortions. Despite the high robustness, the proposed algorithm can also effectively distinguish content distinct sequences. Thus, satisfied identification accuracies can be achieved by the proposed algorithm. Figure 3. Distributions of intra and inter signature distances measured in the proposed algorithm REFERENCES  Y.P. Tan, S.R. Kularni, and P.J. Ramadge, “A Framework for Measuring Video Similarity and Its Application to Video Query by Example”. Proc. Image Procss., 1999, pp.106-110  A. Hampapur and R. Bolle. “Comparison of Sequence Matching Techniques for Video Copy Detection”. Int. Conf. on Storage and Retrieval for Media Databases, 2002, pp.194- 201  Z. Li, A.K. Katsaggelos and B. Gandhi, “Fast Video Shot Retrieval Based on Trace Geometry Matching”, IEE Proc. Vis. Image Signal Process., vol.152, no.3, 2005, pp.367-372  T. C. Hoad and J. Zobel, “Detection of Video Sequences Using Compact Signatures”, ACM Trans. on Inf. Syst., 2006, vol.24, no.1, pp.1-50.  B. Coskun, B. Sankur, and N. Menon, “Spatio-temporal Figure 4. P-R graphs of the proposed and MD algorithms Transform-based Video Hashing”. IEEE Trans. on Multimedia, 2006, vol.8, no.6, pp.1190-1208.  C. D. Roover, C. D. Vleeschouwer, F. Lefμebvre, and B. Macq, “Robust Video Hashing Based on Radial Projections of Key Frames", IEEE Trans. Signal Process., 2005, vol.53, no.10, pp.4020-4037,  L. T. Julien, C. Li, A. Joly et al., “Video Copy Detection: A Comparative Study”, Proc. of the 6th ACM Int. Conf. on Image and Video Retrieval, 2007, pp.371-378  J. Bescos, G. Cisneros, J. M. Martinez, J. M. Menendez, and J. Cabrera, “A Unified Model for Techniques on Video Shot (a) (b) Transition Detection”, IEEE Trans. Multimedia, 2005, vol.7, no.2, pp.293-307  C. Harris and M. Stevens, “A Combined Corner and Edge Detector”. In Proc. of 4th Alvey Vision Conf., 1988, pp. 153- 158  M. P. Dubuisson, A.K. Jain, “A Modified Hausdorff Distance for Object Matching”, Pattern Recognition, 1994, vol.1, no.9, pp.566 -568 (c) (d) Figure 5. Salient points detected in normalized hall-monitor sequences. (a) Original, (b) Rotated, (c) Translated, (d) Scaled 82Authorized licensed use limited to: Bharati Vidyapeeth College of Engineering. Downloaded on April 05,2010 at 02:37:46 EDT from IEEE Xplore. Restrictions apply.