Your SlideShare is downloading. ×
  • Like
Video indexing using shot boundary detection approach and search tracks
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Video indexing using shot boundary detection approach and search tracks



Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 432 VIDEO INDEXING USING SHOT BOUNDARY DETECTION APPROACH AND SEARCH TRACKS IN VIDEO Reshma R.Gulwani1 , Sudhirkumar D.Sawarkar2 1 (Computer Engineering, Ramrao Adik Institute of Technology/ Mumbai University, Mumbai, India) 2 (Computer Engineering, Datta Meghe College of Engineering / Mumbai University, Mumbai, India) ABSTRACT Video indexing and retrieving is an important process towards searching in videos. Shot boundary detection approach is proposed to perform video indexing. To reduce the computational cost; frames that are clearly not shot boundaries are first removed from the original video. After that key points are found by dividing frame in to n*n blocks, and apply average function to each n*n block. Supervised learning classifier like support vector machine (SVM) is used for key points matching to capture different kinds of transitions such as abrupt (cut) and gradual (fade, wipe, dissolve).Frames shows transitions are represented in form of thumbnails. Audio characteristics like energy of signals are used to detect sound (tracks) in videos. Applications chosen for above approaches are CCTV and film videos. Keywords: Keypoint Extraction, Key Frame Extraction, Shot Boundary Detection, Support Vector Machine (SVM), video retrieval. 1. INTRODUCTION Videos are important form of multimedia information. The advances in the digital and network technology have produced a flood of information. The amount of video information in particular has led to unprecedented high volumes of data. When fast-forwarding through videotape, a user searches for an image or sequence similar to that in their imagination. In some complex cases queries are not that simple, but a system that can locate and present keys relevant to the video content-instead of depending on the user's imagination-will promote easier handling of extensive videos. The essential issues involve assisting users by extracting INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 3, May-June (2013), pp. 432-440 © IAEME: Journal Impact Factor (2013): 6.1302 (Calculated by GISI) IJCET © I A E M E
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 433 physical features from video data and designing an effective application interface. We can extract physical features to partition video data into useful footage segments and store the segment attribute information, or annotations, as indexes. These indexes should describe the essential information pertaining to the video segments, and should be content based. Indexes can be visualized through the interface so users can perform various functions. We extract physical features such as inter frame differences, motion vectors, and color distributions from image data, obtaining useful indexes related to editing information. The foundation step of content based video retrieval is shot boundary detection. A shot is a consecutive sequence of frames captured by camera action that takes place between start and stop operations, which mark the shot boundaries [15].There are strong content correlation between frames in a shot. Therefore shots are considered to be the fundamental units to organize the contents of video sequences. Shot boundaries can be broadly classified into two types: abrupt transition and gradual transition. Abrupt transition is instantaneous transition from one shot to the subsequent shot. Gradual transition occurs over multiple frames, which is generated via the application of more elaborated editing effects involving several frames, so that ݂ ௜ frame belongs to one shot, frame ݂ ௜ାேto the second, and the N-1 frames in between represent a gradual transformation of ݂௜ into ݂௜ାே ሾ5]. Gradual transition can be further classified into fade out/in(FOI) transition, dissolve transition, wipe transition, and others transition, according to the characteristics of the different editing effects [1][3]. Many different papers have been proposed in last few years such as pixel by pixel comparison, Histogram based approach, Edge change ratio. In pixel comparison method, direct pixel comparisons of two consecutive frames are performed. If the number of different pixels is large enough, the two processed frames are declared to belong to different shots. The pixel-based method is easy and fast. But it is extremely sensitive, since it has captured any details of frame, such as highly sensitive to local motion ,camera motion and minor changes in illumination[1][3].To handle these drawbacks, several ameliorative methods have been proposed, for example luminance/color histogram-based method and edge-based method. Histogram based method uses the statistics of color/luminance. Xue L et al. [12] proposed a shot boundary detection measure that the features are obtained from the color histogram of the hue and saturation image of the video frame. The advantage of the histogram-based shot change detection is that it is quite discriminant, easy to compute, and mostly insensitive to translational, rotational, and zooming camera motions. The weakness of the histogram-based shot boundary detection is that it does not incorporate the spatial distribution information of various color, hence it will fail in the case which similar histograms but different structures [1]. A better tradeoff between pixel and global color histogram methods can be achieved by block-matching methods [6] [13], in which each frame is divided into several non overlapping blocks and luminance/color histogram feature of each block are extracted. The edge information is an obvious choice for characterizing image [1] [12] [14]. The advantage of this feature is that it is sufficiently invariant to illumination changes and several types of motion, and it is related to the human visual perception of a scene. Its main disadvantage is computational cost and noise sensitivity [1]. Our proposed method first uses block color histogram differences between two frames to find out the key frames from original video in order to reduce the detection time .Key frames are the frames which represent the salient content and information like shows the boundaries of the shot. Then new frames sequence NSEQ is constructed based on key frame. Next features are extracted from each frame of NSEQ to detect shot boundaries. Frames are
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 434 divided into n*n blocks and find out the keypoint from each block by applying average function. Then those keypoints are matched by Support Vector Machine (SVM).Furthermore our system uses different algorithms for different kinds of shot transitions. Frames shows transitions are represented in the form of thumbnails. Audio characteristics like energy of signals are used to detect the sound (tracks) in video. Experiments are carried out on CCTV videos and film videos. 2. KEYPOINT EXTRACTION Each frame is divided in to n*n blocks. Then the key points are found by finding average of each n*n block in each frame. 3. SUPPORT VECTOR MACHINE Having obtained keypoints from two images, now an important issue is to find the matched keypoints between two images. Traditional, the keypoint matching is computed based on Euclidean distance of their feature vectors. However it has several difficulties in achieving successful results. So we propose machine learning methods in this paper for keypoint matching. Support vector machine (SVM), machine learning method is preferred in this paper. The Support vector machine (SVM) is a kind of machine learning method that analyzes data and recognizes patterns, used for classification and regression analysis. SVM (Support Vector Machine) is a useful technique for data classification, which based on the concept of the structural risk minimization using the Vapnik-Chervonenkis(VC) dimension[8]. A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one "target value" (class labels) and several "attributes" (features). The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes. Keypoints which are extracted from frames are compared by using SVM methods. To train a SVM model for the keypoint matching, we have annotated a training set consisting of positive examples and negative examples. F= {(ܺଵܻଵ)…………. (ܺ௜ܻ௜)} ‫א‬ (ܺ௜ܻ௜ሻ݈ Where X୧ is input feature vector. Y୧ ‫א‬ (1,-1) is the output vector. We assume that class labeled 1 corresponds to the correct matches of keypoint , and Class labeled -1 to the incorrect matches of keypoint. The number of the keypoint matching is regarded as the similarity score of two images, denoted by NKM (Number of keypoint matching). 4. SHOT BOUNDARY DETECTION It is inefficient and extremely time consuming to apply boundary detection process to detect all the frames [4].So, our method removes the frames that are clearly not shot boundaries from original videos, detects only those frames that is likely to contain shot boundaries. Different algorithms are used to detect different kinds of shot transitions.
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 435 The details of each detection process are explained in the following section. 4.1 KEYFRAME EXTRACTION There are the great redundancies among the frames in same shot, therefore certain frames that reflect the best shot contents are selected as key frames[9][10][11]succinctly to represent the shot. In our paper, the method for key frame extraction consists of three steps: First, frame is decomposed by n x n block. Step 1: To calculate the block color histogram difference: If hue value of same block of 2 adjacent frames is greater than threshold, then block color histogram difference is set to 1 otherwise it is set to 0. Step 2: To calculate frame color histogram difference of two adjacent frames: It is computed by adding the block color histogram difference of all the blocks which are present in two adjacent frames(which is already calculated in step1) Step 3: If frame color histogram is above threshold then it is judged that frame is shot transition candidate new sequence is created known as “NSEQ”. Assign value -1 to the new sequence, if it shows shot boundary. Otherwise assign value 1 4.2. CUT TRANSITION Cut transition is instantaneous transitions from one shot to the subsequent shot, which just involves two consecutive frames of different shots. Cut transition can be detected by similarity between adjacent frames. Similarity between frames is found by using above mentioned SVM approach. To detect cut transition: 1 if NKM ( f୧ିଵ, f୧,) <Thresholdେ୳୲ Cut ( f୧ିଵ, f୧,) = 0 Otherwise (1) If NKM ( f୧ିଵ, f୧,) is lower thanThresholdେ୳୲ then cut transition is detected. If Cut ( f୧ିଵ, f୧) =1, then NSEQ ( f୧ିଵ, f୧) =1 (2) 4.3. FADE TRANSITION DETECTION A fade of a video sequence is a shot transition with the first shot gradually disappearing (fade out)before the second shot appears(fade in)[1].During fade out/in, two shots are spatially and temporally well separated by some monochrome frames [5].During a fade-out, the images gradually disappear into monochrome, often black image. During fade- in, the images gradually appear from monochrome, often black image. During a fade out, visually the image becomes cloudy [1], until monochrome frame appears, and during a fade in the image becomes clear. The more clarity the image is, more number of the frame keypoint is. This implies that the number of the frame keypoint is reduced, along with the image becomes cloudy. When it is the monochrome frame, then the average value of all pixels in the frame is less than monochrome threshold. When an image becomes clear, the number of the frame keypoint is increasing.
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 436 The Details of detection of fade-out/in transition is explained in following section First, to determine whether the current frame is monochrome or not as shown in Eq. (3) 1 if F୅୴୥(f୧,) <MonoThreshold (3) Mono (f୧) = 0 Otherwise Where ‫ܨ‬஺௩௚(f୧,) is average of all pixels in current frame. If the current frame is not a monochrome frame, processing is stopped. Otherwise whether the current frame is the starting point of a fade out or ending point of a fade in is determined. A section of fade in/out is detected based on consecutive monotonic increases/decreases in the average number of the frame pixel value. The following formulas are used for the determination: Eq. (4) is for monotonic increases and Eq. (5) is for monotonic decreases. 1 if F୅୴୥ (f୧ିଵ,) <F୅୴୥ሺf୧,) (4) INCF୅୴୥ (f୧ିଵ, f୧) = 0 Otherwise 1 if F୅୴୥ (f୧ିଵ) > F୅୴୥ (f୧) (5) DECF୅୴୥ (f୧ିଵ, f୧) = 0 Otherwise 4.4. WIPE TRANSITION DETECTION A transition from one shot to another wherein the images of new shot are revealed by moving boundary is called a wipe. Generally the boundaries can be of any geometric shape. Most of the time they are lines or set of lines .It is a shot transition that one scene or picture gradually enters across the view while another gradually leaves. During wipe, the appearing and disappearing shots coexist in different spatial regions of the intermediate video frames, and the region occupied by the former grows until it entirely replaces the latter [2]. To detect all kinds of the wipe transitions, one of the important properties of the change during a wipe is that one portion of the frame match to the starting frame, and the rest portion of the frame matches to the ending frame. First, the starting point of a wipe and the ending point of a wipe are needed to be determined. On a series of frames, where NSEQ ( f୧ିଵ, f୧) =-1 the starting frame of this series of frames is regarded as F୵ୠሺf୧ሻ and ending frame of this series of frames is regarded F୵ୣሺf୧ሻ.
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 437 To detect the beginning of wipe frame: F୵ୠሺf୧ିଵሻ =1 if NSEQ ሺf୧ሻ =-1, NSEQ (f୧ିଵ) =-1 NSEQ ( f୧ିଶ,) =1 (6) To detect the ending of wipe frame: F୵ୣሺf୧ିଵሻ =1 if NSEQ ሺf୧ሻ =1, NSEQ (f୧ିଵ) =-1 (7) 4.5. DISSOLVE TRANSITION DETECTION A dissolve in a video sequence is shot transition with the first shot gradually disappearing while the second shot gradually appears [1].In this proposed method for dissolve transition, we are interested in the similarity between frames that are a specific distance apart from each other. Similarity between two frames is calculated by finding the difference between the gray values of two frames that is considered as distance between the frames. Set maximum and minimum threshold for dissolve transition. If distance between frames is higher than maximum threshold then the dissolve transition is detected otherwise there is no dissolve transition Dist=∑ ሺ‫ݕܽݎ݃2ܾ݃ݎ‬ሺ݂௜ାଵ ሻ௡ ௜ୀଵ െ ‫ݕܽݎ݃2ܾ݃ݎ‬ሺ݂௜ሻ ሻ (8) 1 if Dist >dissMaxTh (9) Dissolveሺ݂௜ሻ= -1 if Dist < dissMinTh 5. DETECTION OF SOUND (TRACKS) IN VIDEO To detect tracks in video, first extract the audio from video file. Matlab does not support to fetch an audio from video files directly. In order to extract audio, first video files such as .avi or .wmv are converted into .wav files by using third party utility like dbpoweramp music converter. Then this .wav file can be read in Matlab to fetch energy of signal. We are expecting this energy should be high so that based on configured thresholds song can be detected in video 6. EXPERIMENTAL RESULTS In this section, we will carry out experiments on CCTV videos and film videos. All experiments are conducted in Matlab. First, we should decide some parameters in the experiment. For SVM, we use the software Libsvm provided by the National Science Council of Taiwan to do SVM classification [7].We have chosen RBF(Radial Basis Function) kernel for creating model. There are two parameters for RBF kernel: c and gamma. It is not known beforehand which C and gamma are best for our given problem. The goal is to identify good(c, gamma) so that classifier can accurately predict unknown data. We uses cross validation technique to obtain C and gamma in this paper.
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 438 Following are the different transitions which are detected in this experiment: Figure 1. Cut detection Figure 2. Fade out/in detection Figure 3. Dissolve detection Figure 4. Wipe detection In audio, First we converts .wmv or .avi video file into .wav file to extract the audio.To extract an audio, dbpoweramp software is used .To track the song, find out the average of the frames which comes in continuous 50 seconds. if that average is greater than threshold, then it detects song. we carry out the experiments on two short videos, first video contains only one song as shown in Fig.(5) and second video contains three songs as shown in Fig.(6).
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 439 Figure 5. Graph for detecting single song in video Figure 6. Graph for detecting three songs in video 7. CONCLUSION A method is proposed that avoids calculating all the frame features which tries to detect shot boundary and also skips the processing of frames that are not clearly shot boundaries and calculates all the features only for parts of video that are likely to contain shot boundaries. We are using SVM approaches for keypoint matching. Different algorithms are used to capture the different characteristics for different kinds of shot transitions and sound (track) is also detected. 0 20 40 60 80 100 120 140 350 400 450 500 550 600 650 700 time nearbyFrameAvg 0 50 100 150 200 250 100 200 300 400 500 600 700 time nearbyFrameAvg
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 440 REFERENCES Journal Papers [1] C. Cotsaces, N. Nikolaidis, and I. Pitas. “Video Shot Detection and Condensed Representation”. Journal of IEEE Signal Processing Magazine, March, pp. 28--37, 2006. [2] H. H. YU, and W. WOLF, “A hierarchical multiresolution video shot Transition Detection scheme [J]”, Journal of Computer Vision and Image Understanding, vol. 75, no. 1/2, pp. 196-213, 1999. [3] J. H. Yuan, H. Y. Wang, and B. Zhang. “A formal study of shot boundary detection”. Journal of Transactions on Circuits and Systems for Video Technology, 17(2), pp. 168— 186 ,February 2007 [4] Y. Kawai, H. Sumiyoshi, and N. Yagi, “Shot Boundary Detection at TRECVID 2007”, In TRECVID 2007 Workshop. [5] A. Hanjalic, “Shot-Boundary Detection: Unraveled and Resolved?”, Journal of IEEE Transaction on Circuits and Systems for Video Technology, vol. 12, no. 2, pp. 90-105, 2002. [6] J. Bescós, G. Cisneros, J. M. Martínez, J. M. Menéndez, and J. Cabrera, “A Unified Model for Techniques on Video-Shot Transition Detection”. Journal of IEEE TRANSACTIONS ON MULTIMEDIA, 7(2), pp. 293—306, April 2005. [7] C. W. Hsu, C. C. Chang, and C. J. Lin, “A Practical Guide to Support Vector Classification”, [8] V. Vapnik. “Statistical learning theory”. John Wiley, New York, 1998. [9] K. W. Sze, K. M. Lam, and G. P. Qiu, “A new key frame representation for video segment retrieval,” IEEE Trans. Circuits Syst. Video Technology, vol. 15, no. 9, pp. 1148-1155, Sep. 2005. [10] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” ACM Trans. Multimedia Comput., Commun. Appl., vol. 3, no. 1, art. 3, pp. 1-37, Feb.2007. [11] D. P. Mukherjee, S. K. Das, and S. Saha, “Key frame estimation in vide using randomness measure of feature point pattern,” IEEE Trans. Circuits Syst. Video Technology, vol. 7, no. 5, pp. 612-620, May. 2007. Proceedings Papers [12] L. Xue, C . Li, H. Li, and Z. Xiong. “A general method for shot boundary detection”. In Proceedings of the 2008 International Conference on Multimedia and Ubiquitous Engineering,PP.394—397,2008. [13] Z. P. Zong, K. Liu, and J. H. Peng, “Shot Boundary Detection Based on Histogram of Mismatching-Pixel Count of FMB”. In Proceedings of ICIEA 2006, pp. 24--26, 2006 [14] H. ZHAO, X. H. LI, “Shot Boundary Detection Based on Mutual Information and Canny Edge Detector”. In Proceedings of 2008 International Conference on Computer Science and Software, pp:1124--1128, 2008. [15] C. H Yeo, Y. W. Zhu, Q. B. Sun, and S. F Chang, “A Framework for sub-window shot detection,” in Proc. Int. Multimedia Modelling Conf.,Jan. 2005, pp. 84–91.