Video indexing using shot boundary detection approach and search tracks

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
432
VIDEO INDEXING USING SHOT BOUNDARY DETECTION
APPROACH AND SEARCH TRACKS IN VIDEO
Reshma R.Gulwani1
, Sudhirkumar D.Sawarkar2
1
(Computer Engineering, Ramrao Adik Institute of Technology/ Mumbai University,
Mumbai, India)
2
(Computer Engineering, Datta Meghe College of Engineering / Mumbai University,
Mumbai, India)
ABSTRACT
Video indexing and retrieving is an important process towards searching in videos.
Shot boundary detection approach is proposed to perform video indexing. To reduce the
computational cost; frames that are clearly not shot boundaries are first removed from the
original video. After that key points are found by dividing frame in to n*n blocks, and apply
average function to each n*n block. Supervised learning classifier like support vector
machine (SVM) is used for key points matching to capture different kinds of transitions such
as abrupt (cut) and gradual (fade, wipe, dissolve).Frames shows transitions are represented in
form of thumbnails. Audio characteristics like energy of signals are used to detect sound
(tracks) in videos. Applications chosen for above approaches are CCTV and film videos.
Keywords: Keypoint Extraction, Key Frame Extraction, Shot Boundary Detection, Support
Vector Machine (SVM), video retrieval.
1. INTRODUCTION
Videos are important form of multimedia information. The advances in the digital and
network technology have produced a flood of information. The amount of video information
in particular has led to unprecedented high volumes of data. When fast-forwarding through
videotape, a user searches for an image or sequence similar to that in their imagination. In
some complex cases queries are not that simple, but a system that can locate and present keys
relevant to the video content-instead of depending on the user's imagination-will promote
easier handling of extensive videos. The essential issues involve assisting users by extracting
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 3, May-June (2013), pp. 432-440
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

433
physical features from video data and designing an effective application interface. We can
extract physical features to partition video data into useful footage segments and store the
segment attribute information, or annotations, as indexes. These indexes should describe the
essential information pertaining to the video segments, and should be content based. Indexes
can be visualized through the interface so users can perform various functions. We extract
physical features such as inter frame differences, motion vectors, and color distributions from
image data, obtaining useful indexes related to editing information.
The foundation step of content based video retrieval is shot boundary detection. A
shot is a consecutive sequence of frames captured by camera action that takes place between
start and stop operations, which mark the shot boundaries [15].There are strong content
correlation between frames in a shot. Therefore shots are considered to be the fundamental
units to organize the contents of video sequences. Shot boundaries can be broadly classified
into two types: abrupt transition and gradual transition. Abrupt transition is instantaneous
transition from one shot to the subsequent shot. Gradual transition occurs over multiple
frames, which is generated via the application of more elaborated editing effects involving
several frames, so that ݂ ௜ frame belongs to one shot, frame ݂ ௜ାேto the second, and the N-1
frames in between represent a gradual transformation of ݂௜ into ݂௜ାே ሾ5]. Gradual transition
can be further classified into fade out/in(FOI) transition, dissolve transition, wipe transition,
and others transition, according to the characteristics of the different editing effects [1][3].
Many different papers have been proposed in last few years such as pixel by pixel
comparison, Histogram based approach, Edge change ratio. In pixel comparison method,
direct pixel comparisons of two consecutive frames are performed. If the number of different
pixels is large enough, the two processed frames are declared to belong to different shots. The
pixel-based method is easy and fast. But it is extremely sensitive, since it has captured any
details of frame, such as highly sensitive to local motion ,camera motion and minor changes
in illumination[1][3].To handle these drawbacks, several ameliorative methods have been
proposed, for example luminance/color histogram-based method and edge-based method.
Histogram based method uses the statistics of color/luminance. Xue L et al. [12]
proposed a shot boundary detection measure that the features are obtained from the color
histogram of the hue and saturation image of the video frame. The advantage of the
histogram-based shot change detection is that it is quite discriminant, easy to compute, and
mostly insensitive to translational, rotational, and zooming camera motions. The weakness of
the histogram-based shot boundary detection is that it does not incorporate the spatial
distribution information of various color, hence it will fail in the case which similar
histograms but different structures [1]. A better tradeoff between pixel and global color
histogram methods can be achieved by block-matching methods [6] [13], in which each
frame is divided into several non overlapping blocks and luminance/color histogram feature
of each block are extracted.
The edge information is an obvious choice for characterizing image [1] [12] [14]. The
advantage of this feature is that it is sufficiently invariant to illumination changes and several
types of motion, and it is related to the human visual perception of a scene. Its main
disadvantage is computational cost and noise sensitivity [1].
Our proposed method first uses block color histogram differences between two frames
to find out the key frames from original video in order to reduce the detection time .Key
frames are the frames which represent the salient content and information like shows the
boundaries of the shot. Then new frames sequence NSEQ is constructed based on key frame.
Next features are extracted from each frame of NSEQ to detect shot boundaries. Frames are

434
divided into n*n blocks and find out the keypoint from each block by applying average
function. Then those keypoints are matched by Support Vector Machine (SVM).Furthermore
our system uses different algorithms for different kinds of shot transitions. Frames shows
transitions are represented in the form of thumbnails. Audio characteristics like energy of
signals are used to detect the sound (tracks) in video. Experiments are carried out on CCTV
videos and film videos.
2. KEYPOINT EXTRACTION
Each frame is divided in to n*n blocks. Then the key points are found by finding
average of each n*n block in each frame.
3. SUPPORT VECTOR MACHINE
Having obtained keypoints from two images, now an important issue is to find the
matched keypoints between two images. Traditional, the keypoint matching is computed
based on Euclidean distance of their feature vectors. However it has several difficulties in
achieving successful results. So we propose machine learning methods in this paper for
keypoint matching. Support vector machine (SVM), machine learning method is preferred in
this paper.
The Support vector machine (SVM) is a kind of machine learning method that
analyzes data and recognizes patterns, used for classification and regression analysis. SVM
(Support Vector Machine) is a useful technique for data classification, which based on the
concept of the structural risk minimization using the Vapnik-Chervonenkis(VC)
dimension[8]. A classification task usually involves with training and testing data which
consist of some data instances. Each instance in the training set contains one "target value"
(class labels) and several "attributes" (features). The goal of SVM is to produce a model
which predicts target value of data instances in the testing set which are given only the
attributes. Keypoints which are extracted from frames are compared by using SVM methods.
To train a SVM model for the keypoint matching, we have annotated a training set consisting
of positive examples and negative examples.
F= {(ܺଵܻଵ)…………. (ܺ௜ܻ௜)} ‫א‬ (ܺ௜ܻ௜ሻ݈
Where X୧ is input feature vector. Y୧ ‫א‬ (1,-1) is the output vector. We assume that
class labeled 1 corresponds to the correct matches of keypoint , and Class labeled -1 to the
incorrect matches of keypoint. The number of the keypoint matching is regarded as the
similarity score of two images, denoted by NKM (Number of keypoint matching).
4. SHOT BOUNDARY DETECTION
It is inefficient and extremely time consuming to apply boundary detection process to
detect all the frames [4].So, our method removes the frames that are clearly not shot
boundaries from original videos, detects only those frames that is likely to contain shot
boundaries. Different algorithms are used to detect different kinds of shot transitions.

435
The details of each detection process are explained in the following section.
4.1 KEYFRAME EXTRACTION
There are the great redundancies among the frames in same shot, therefore certain
frames that reflect the best shot contents are selected as key frames[9][10][11]succinctly to
represent the shot.
In our paper, the method for key frame extraction consists of three steps:
First, frame is decomposed by n x n block.
Step 1: To calculate the block color histogram difference:
If hue value of same block of 2 adjacent frames is greater than threshold, then block color
histogram difference is set to 1 otherwise it is set to 0.
Step 2: To calculate frame color histogram difference of two adjacent frames:
It is computed by adding the block color histogram difference of all the blocks which are
present in two adjacent frames(which is already calculated in step1)
Step 3: If frame color histogram is above threshold then it is judged that frame is shot
transition candidate new sequence is created known as “NSEQ”. Assign value -1 to the new
sequence, if it shows shot boundary. Otherwise assign value 1
4.2. CUT TRANSITION
Cut transition is instantaneous transitions from one shot to the subsequent
shot, which just involves two consecutive frames of different shots. Cut transition can be
detected by similarity between adjacent frames. Similarity between frames is found by using
above mentioned SVM approach.
To detect cut transition:
1 if NKM ( f୧ିଵ, f୧,) <Thresholdେ୳୲
Cut ( f୧ିଵ, f୧,) =
0 Otherwise (1)
If NKM ( f୧ିଵ, f୧,) is lower thanThresholdେ୳୲ then cut transition is detected.
If Cut ( f୧ିଵ, f୧) =1, then NSEQ ( f୧ିଵ, f୧) =1 (2)
4.3. FADE TRANSITION DETECTION
A fade of a video sequence is a shot transition with the first shot gradually
disappearing (fade out)before the second shot appears(fade in)[1].During fade out/in, two
shots are spatially and temporally well separated by some monochrome frames [5].During a
fade-out, the images gradually disappear into monochrome, often black image. During fade-
in, the images gradually appear from monochrome, often black image. During a fade out,
visually the image becomes cloudy [1], until monochrome frame appears, and during a fade
in the image becomes clear. The more clarity the image is, more number of the frame
keypoint is. This implies that the number of the frame keypoint is reduced, along with the
image becomes cloudy. When it is the monochrome frame, then the average value of all
pixels in the frame is less than monochrome threshold. When an image becomes clear, the
number of the frame keypoint is increasing.

436
The Details of detection of fade-out/in transition is explained in following section
First, to determine whether the current frame is monochrome or not as shown in Eq. (3)
1 if F୅୴୥(f୧,) <MonoThreshold (3)
Mono (f୧) =
0 Otherwise
Where ‫ܨ‬஺௩௚(f୧,) is average of all pixels in current frame.
If the current frame is not a monochrome frame, processing is stopped. Otherwise
whether the current frame is the starting point of a fade out or ending point of a fade in is
determined. A section of fade in/out is detected based on consecutive monotonic
increases/decreases in the average number of the frame pixel value. The following formulas
are used for the determination: Eq. (4) is for monotonic increases and Eq. (5) is for
monotonic decreases.
1 if F୅୴୥ (f୧ିଵ,) <F୅୴୥ሺf୧,) (4)
INCF୅୴୥ (f୧ିଵ, f୧) =
0 Otherwise
1 if F୅୴୥ (f୧ିଵ) > F୅୴୥ (f୧) (5)
DECF୅୴୥ (f୧ିଵ, f୧) =
0 Otherwise
4.4. WIPE TRANSITION DETECTION
A transition from one shot to another wherein the images of new shot are revealed
by moving boundary is called a wipe. Generally the boundaries can be of any geometric
shape. Most of the time they are lines or set of lines .It is a shot transition that one scene or
picture gradually enters across the view while another gradually leaves. During wipe, the
appearing and disappearing shots coexist in different spatial regions of the intermediate video
frames, and the region occupied by the former grows until it entirely replaces the latter [2].
To detect all kinds of the wipe transitions, one of the important properties of the
change during a wipe is that one portion of the frame match to the starting frame, and the rest
portion of the frame matches to the ending frame.
First, the starting point of a wipe and the ending point of a wipe are needed to be
determined. On a series of frames, where NSEQ ( f୧ିଵ, f୧) =-1 the starting frame of this series
of frames is regarded as F୵ୠሺf୧ሻ and ending frame of this series of frames is
regarded F୵ୣሺf୧ሻ.

437
To detect the beginning of wipe frame:
F୵ୠሺf୧ିଵሻ =1 if NSEQ ሺf୧ሻ =-1, NSEQ (f୧ିଵ) =-1 NSEQ ( f୧ିଶ,) =1 (6)
To detect the ending of wipe frame:
F୵ୣሺf୧ିଵሻ =1 if NSEQ ሺf୧ሻ =1, NSEQ (f୧ିଵ) =-1 (7)
4.5. DISSOLVE TRANSITION DETECTION
A dissolve in a video sequence is shot transition with the first shot gradually
disappearing while the second shot gradually appears [1].In this proposed method for
dissolve transition, we are interested in the similarity between frames that are a specific
distance apart from each other. Similarity between two frames is calculated by finding the
difference between the gray values of two frames that is considered as distance between the
frames. Set maximum and minimum threshold for dissolve transition. If distance between
frames is higher than maximum threshold then the dissolve transition is detected otherwise
there is no dissolve transition
Dist=∑ ሺ‫ݕܽݎ݃2ܾ݃ݎ‬ሺ݂௜ାଵ ሻ௡
௜ୀଵ െ ‫ݕܽݎ݃2ܾ݃ݎ‬ሺ݂௜ሻ ሻ (8)
1 if Dist >dissMaxTh (9)
Dissolveሺ݂௜ሻ=
-1 if Dist < dissMinTh
5. DETECTION OF SOUND (TRACKS) IN VIDEO
To detect tracks in video, first extract the audio from video file. Matlab does not
support to fetch an audio from video files directly. In order to extract audio, first video files
such as .avi or .wmv are converted into .wav files by using third party utility like
dbpoweramp music converter. Then this .wav file can be read in Matlab to fetch energy of
signal. We are expecting this energy should be high so that based on configured thresholds
song can be detected in video
6. EXPERIMENTAL RESULTS
In this section, we will carry out experiments on CCTV videos and film videos. All
experiments are conducted in Matlab. First, we should decide some parameters in the
experiment. For SVM, we use the software Libsvm provided by the National Science Council
of Taiwan to do SVM classification [7].We have chosen RBF(Radial Basis Function) kernel
for creating model. There are two parameters for RBF kernel: c and gamma. It is not known
beforehand which C and gamma are best for our given problem. The goal is to identify
good(c, gamma) so that classifier can accurately predict unknown data. We uses cross
validation technique to obtain C and gamma in this paper.

438
Following are the different transitions which are detected in this experiment:
Figure 1. Cut detection
Figure 2. Fade out/in detection
Figure 3. Dissolve detection
Figure 4. Wipe detection
In audio, First we converts .wmv or .avi video file into .wav file to extract the
audio.To extract an audio, dbpoweramp software is used .To track the song, find out the
average of the frames which comes in continuous 50 seconds. if that average is greater than
threshold, then it detects song.
we carry out the experiments on two short videos, first video contains only one song
as shown in Fig.(5) and second video contains three songs as shown in Fig.(6).

439
Figure 5. Graph for detecting single song in video
Figure 6. Graph for detecting three songs in video
7. CONCLUSION
A method is proposed that avoids calculating all the frame features which tries to
detect shot boundary and also skips the processing of frames that are not clearly shot
boundaries and calculates all the features only for parts of video that are likely to contain shot
boundaries. We are using SVM approaches for keypoint matching. Different algorithms are
used to capture the different characteristics for different kinds of shot transitions and sound
(track) is also detected.
0 20 40 60 80 100 120 140
350
400
450
500
550
600
650
700
time
nearbyFrameAvg
0 50 100 150 200 250
100
200
300
400
500
600
700
time
nearbyFrameAvg

440
REFERENCES
Journal Papers
[1] C. Cotsaces, N. Nikolaidis, and I. Pitas. “Video Shot Detection and Condensed
Representation”. Journal of IEEE Signal Processing Magazine, March, pp. 28--37, 2006.
[2] H. H. YU, and W. WOLF, “A hierarchical multiresolution video shot Transition
Detection scheme [J]”, Journal of Computer Vision and Image Understanding, vol. 75,
no. 1/2, pp. 196-213, 1999.
[3] J. H. Yuan, H. Y. Wang, and B. Zhang. “A formal study of shot boundary detection”.
Journal of Transactions on Circuits and Systems for Video Technology, 17(2), pp. 168—
186 ,February 2007
[4] Y. Kawai, H. Sumiyoshi, and N. Yagi, “Shot Boundary Detection at TRECVID 2007”,
In TRECVID 2007 Workshop.
http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html.
[5] A. Hanjalic, “Shot-Boundary Detection: Unraveled and Resolved?”, Journal of IEEE
Transaction on Circuits and Systems for Video Technology, vol. 12, no. 2, pp. 90-105,
2002.
[6] J. Bescós, G. Cisneros, J. M. Martínez, J. M. Menéndez, and J. Cabrera, “A Unified
Model for Techniques on Video-Shot Transition Detection”. Journal of IEEE
TRANSACTIONS ON MULTIMEDIA, 7(2), pp. 293—306, April 2005.
[7] C. W. Hsu, C. C. Chang, and C. J. Lin, “A Practical Guide to Support Vector
Classification”, http://www.csie.ntu.edu.tw/~cjlin.
[8] V. Vapnik. “Statistical learning theory”. John Wiley, New York, 1998.
[9] K. W. Sze, K. M. Lam, and G. P. Qiu, “A new key frame representation for video
segment retrieval,” IEEE Trans. Circuits Syst. Video Technology, vol. 15, no. 9, pp.
1148-1155, Sep. 2005.
[10] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and
classification,” ACM Trans. Multimedia Comput., Commun. Appl., vol. 3, no. 1, art. 3,
pp. 1-37, Feb.2007.
[11] D. P. Mukherjee, S. K. Das, and S. Saha, “Key frame estimation in vide using
randomness measure of feature point pattern,” IEEE Trans. Circuits Syst. Video
Technology, vol. 7, no. 5, pp. 612-620, May. 2007.
Proceedings Papers
[12] L. Xue, C . Li, H. Li, and Z. Xiong. “A general method for shot boundary detection”. In
Proceedings of the 2008 International Conference on Multimedia and Ubiquitous
Engineering,PP.394—397,2008.
[13] Z. P. Zong, K. Liu, and J. H. Peng, “Shot Boundary Detection Based on Histogram of
Mismatching-Pixel Count of FMB”. In Proceedings of ICIEA 2006, pp. 24--26, 2006
[14] H. ZHAO, X. H. LI, “Shot Boundary Detection Based on Mutual Information and Canny
Edge Detector”. In Proceedings of 2008 International Conference on Computer Science
and Software, pp:1124--1128, 2008.
[15] C. H Yeo, Y. W. Zhu, Q. B. Sun, and S. F Chang, “A Framework for sub-window shot
detection,” in Proc. Int. Multimedia Modelling Conf.,Jan. 2005, pp. 84–91.

Video indexing using shot boundary detection approach and search tracks

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Video indexing using shot boundary detection approach and search tracks

Similar to Video indexing using shot boundary detection approach and search tracks (20)

More from IAEME Publication

More from IAEME Publication (20)

Recently uploaded

Recently uploaded (20)

Video indexing using shot boundary detection approach and search tracks