More Related Content
Similar to Content based indexing and retrieval from vehicle surveillance videos
Similar to Content based indexing and retrieval from vehicle surveillance videos (20)
More from IAEME Publication
More from IAEME Publication (20)
Content based indexing and retrieval from vehicle surveillance videos
- 1. INTERNATIONALComputer EngineeringCOMPUTER ENGINEERING
International Journal of JOURNAL OF and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 1, January- February (2013), pp. 420-429
IJCET
© IAEME:www.iaeme.com/ijcet.asp
Journal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEME
www.jifactor.com
CONTENT BASED INDEXING AND RETRIEVAL FROM VEHICLE
SURVEILLANCE VIDEOS USING GAUSSIAN MIXTURE MODEL
Shruti V Kamath 1, Mayank Darbari 2, Dr. Rajashree Shettar 3
1
Student (8thsemB.E), Department of Computer Science and Engineering,
R.V College of Engineering,
R.V Vidyaniketan Post, Mysore Road, Bangalore-59, Karnataka, India
2
Student (8thsemB.E), Department of Computer Science and Engineering,
R.V College of Engineering,
R.V Vidyaniketan Post, Mysore Road, Bangalore-59, Karnataka, India
3
Professor, Department of Computer Science and Engineering,
R.V College of Engineering,
R.V Vidyaniketan Post, Mysore Road, Bangalore-59, Karnataka, India
ABSTRACT
Visual vehicle surveillance is widely used and produces a huge amount of video data
which are stored for future or immediate use. To find an interesting vehicle from these videos
because of car crashes, illegal U-turns, speeding or anything that the user may be interested
in, is a common case. This would be a very labour intensive process, if there is not an
effective method for the indexing and retrieval of vehicles.
In the proposed approach, a surveillance video from a highway is taken. Then shots
are detected using colour histogram method. There are great redundancies among the frames
in the same shot; therefore, key frame are extracted to avoid redundancy. After key frames
are extracted from the input video, objects are detected and tracked (moving vehicles) using
Gaussian Mixture Model (GMM). The proposed method extracts vehicles and classifies them
based on area and colour. Unsupervised clustering is performed on the vehicles extracted
using K-means and SOM (Self Organizing Map) algorithms. Clustering is done based on size
(small, medium, large) and colour (red, white, grey) of the vehicles. Using the clustering
results, a matching matrix is computed for these two clustering algorithms, with respect to the
object tracking method. The proposed system gives an accuracy of up to 97% (with respect to
size) and up to 94% (with respect to colour).
420
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
Keywords – Color Histogram, Gaussian Mixture Model, k-means, SOM, Vehicle
Surveillance, Vehicle Classification
I. INTRODUCTION
Digital Surveillance Systems capture a lot of video data. To make the processing of
this data efficient and accurate, a context based indexing and retrieval system is used for this.
This paper proposes a system which demonstrates its strong ability in digital surveillance and
commonly has applications in police department, traffic control, intelligence bureau and
defence.A surveillance video from a highway is taken. Then shots are detected using colour
histogram method [1] [7], which takes the histogram difference of the frames and computes a
histogram. A threshold value is then set. The frames where the value is above the threshold
are identified as shots.
To represent a shot succinctly, the redundancies are removed from the video using
key frame extraction [5][6]. After key frames are extracted from the input video, we detect
and track objects (moving vehicles) using the Gaussian Mixture Model [8]. We have assumed
a stationary camerawhich is fixed at an angle for the input videos. Vehicles identified are
classified based on area and colour.
Unsupervised clustering is performed on the objects detected using K-means and
SOM (Self Organizing Map) [9] [10] algorithms. Clustering is done for vehicles under 3
classes namely small, medium and large with respect to area and 3 classes namely, dark grey,
light grey and red with respect to colour. Then given a threshold value, the extracted vehicles
are clustered and the matching matrix (which is a specific table layout that allows
visualization of the performance of an algorithm) is calculated using the results of k-means
and SOM, with respect to the GMM.
The remainder of this paper is organized as follows. Section II is a Literature Survey,
citing about the existing algorithms. A brief System Overview is present in Section III. How
the feature is extracted and indexed is explained in Section IV, Video Processing & Object
Tracking and Indexing. Experiments and Results are discussed in Section V. Finally the
conclusion and future work is presented in Section VI.
II. LITERATURE SURVEY
The research on shot boundary detection has a long history, and there exist specific
surveys on video shot boundary detection. A shot is a consecutive sequence of frames
captured by a camera action that takes place between start and stop operations, which mark
the shot boundaries. Generally, shot boundaries are classified as cut in which the transition
between successive shots is abrupt and gradual transitions which include dissolve, fade in,
fade out, wipe, etc., stretching over a number of frames. Methods for shot boundary detection
usually first extract visual features from each frame, then measure similarities between
frames using the extracted features, and, finally, detect shot boundaries between frames that
are dissimilar. Using the measured similarities between frames, shot boundaries can be
detected. To measure similarity between frames using the extracted features is the second
step required for shot boundary detection. Current similarity metrics for extracted feature
vectors include the 1-norm cosine dissimilarity, the Euclidean distance, the histogram
intersection, and the chi-squared similarity. [1][7].
421
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
The key frames of a video reflect the characteristics of the video to some extent.
Traditional image retrieval techniques can be applied to key frames to achieve video retrieval.
The static key frame features useful for video indexing and retrieval are mainly classified as
color-based, texture-based, and shape-based. Color-based features include color histograms,
color moments, color correlograms, a mixture of Gaussian models [8], etc. Shape-based
features that describe object shapes in the image can be extracted from object contours or
regions. Color is usually represented by color histogram, color correlogram, color coherence
vector and color moment, under certain a color space. The color histogram feature has been
used by many researchers for image retrieval. A color histogram is a vector, where each
element represents the number of pixels falling in a bin, of an image [13].
The detection of moving objects in video sequences is an important and challenging
task in multimedia technologies. To segment moving foreground objects from the
background a pure background image has to be estimated. This reference background image
is then subtracted from each frame and binary masks with the moving foreground objects are
obtained by thresholding the resulting difference images [1].
Gaussian Mixture Model for object detection has different variations. In one such
variation, each pixel in a camera scene is modeled by an adaptive parametric mixture model
of three Gaussian distributions. A common optimization scheme used to fit a Gaussian
mixture model is the Expectation Maximization (EM) algorithm. The EM algorithm is an
iterative method that guarantees to converge to a local maximum in a search space [8].
Machine learning-based approach uses labeled samples with low-level features to
train a classifier or a set of classifiers for videos.Vehicle detection and vehicle classification
using neural network (NN), can be achieved by video monitoring systems [10].
III. SYSTEM OVERVIEW
Shot Key Frame Vehicle Vehicle Clustering Vehicle
Boundary Extraction Tracking Extraction Retrieval
Detection
Feature
Extraction
Store in .mat
file
Fig. 1 shows the architecture of the proposed framework.
As shown in Fig. 1, the raw surveillance video is captured and the vehicles are
extracted. During object tracking, on one hand the information of vehicles is extracted and
saved as .mat file. The color feature is picked up from the vehicles and stored in .mat file.
When the user selects a color or/and a type provided by the system, there trieval results will
be returned.
422
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
IV. VIDEO PROCESSING & OBJECT TRACKING AND INDEXING
A. SHOT BOUNDARY DETECTION
A shot is a consecutive sequence of frames captured by a camera action that takes place
between start and stop operations, which mark the shot boundaries. There are strong content
correlations between frames in a shot. In the surveillance videos we used, abrupt transitions
were present, where transitions between successive shots are abrupt.
In this paper, threshold-based approach is implemented [1].
Threshold-Based Approach: The threshold-based approach detects shot boundaries by
comparing the measured pair-wise similarities between frames with a predefined threshold:
When a similarity is less than the threshold, a boundary is detected. The threshold can be
global, adaptive, or global and adaptive combined. The global threshold-based algorithms use
the same threshold, which is generally set empirically, over the whole video, as in. The major
limitation of the global threshold-based algorithms is that local content variations are not
effectively incorporated into the estimation of the global threshold, therefore influencing the
boundary detection accuracy.
Abrupt Transition: One of the effective ways is intensity histogram. According to NTSC
standard, the intensity for a RGB frame can be calculated as,
I = 0.299R + 0.587G + 0.114B (1)
where R, G and B are Red, Green and Blue channel of the pixel.
The intensity histogram difference [12] is expressed as,
ܵ ݅ܦൌ ∑ீ | ݅ܪሺ݆ሻ െ ݄݅ 1ሺ݆ሻ|
ୀଵ (2)
where Hi(j) is the histogram value for ith frame at level j. G denotes the total number of levels
for the histogram.
In a continuous video frame sequence, the histogram difference is small, whereas for
abrupt transition detection, the intensity histogram difference spikes. Even there is a notable
movement or illumination changes between neighboring frames, the intensity histogram
difference is relatively small compared with those peaks caused by abrupt changes.
Therefore, the difference of intensity histogram with a proper threshold is effective in
detecting abrupt transitions.
The threshold value to determine whether the intensity histogram difference indicates a
abrupt transition can be set to,
ܾܶ ൌ ߤ ߙߪ (3)
where ߤ and ߪ are the mean value and standard deviation of the intensity histogram
difference. The value of ߙ typically varies from 3 to 6.
Sometimes gradual transitions can have spikes that even higher than those in abrupt
transitions. In order to differentiate the abrupt transitions from gradual transitions, the
neighboring frames of a detected spike are also tested, if there’re multiple spikes nearby, the
transition is more likely to be gradual transition, and we simply drop this detection.
423
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
B. KEY FRAME EXTRACTION
There are great redundancies among the frames in the same shot; therefore, certain
frames that best reflect the shot contents are selected as key frames to succinctly represent the
shot. The extracted key frames should contain as much salient content of the shot as possible
and avoid as much redundancy as possible. The features used for key frame extraction
include colours (particularly the colour histogram), edges, shapes, optical flow, MPEG-7
motion descriptors. We have adopted the colour feature to extract key frames.
Sequential Comparison between Frames: In this algorithm [1], frames subsequent to a
previously extract key frame are sequentially compared with the key frame until a frame
which is very different from the key frame is obtained. This frame is selected as the next key
frame. The merits of the sequential comparison-based algorithms include their simplicity,
intuitiveness, low computational complexity, and adaptation of the number of key frames to
the length of the shot. The limitations of these algorithms include the following. The key
frames represent local properties of the shot rather than the global properties. The irregular
distribution and uncontrolled number of key frames make these algorithms unsuitable for
applications that need an even distribution or a fixed number of key frames. Redundancy can
occur when there are contents appearing repeatedly in the same shot. This approach proved
efficient in removing redundancies in the input video.
The Algorithm:
1. Firstly, we choose the first frame as the standard frame that is used to compare with the
following frames.
2. Get the corresponding pixel value in both frames one by one, and computing their
difference respectively.
3. After finishing 2, add the results in 2 altogether. The sum will be the difference between
these two frames.
4. Finally, if sum is larger than a threshold we set, select frame (1+i) as a key-frame, then
frame (1+i) becomes the standard frame. Redo 1 to 4 until there is no frame can be captured.
C. OBJECT TRACKING USING GAUSSIAN MIXTURE MODEL
An adaptive non-parametric Gaussian mixture model is employed to detect objects.
This model can also lessen the effect of small repetitive motions; for example, moving
vegetation like trees and bushes as well as small camera displacement.
Background Extraction
Gaussian Mixture Model (GMM) is utilized to represent the background in RGB color
space due to its effectiveness. For an observed pixel, its probability will be estimated by K
Gaussian distributions. To update the model, each pixel value in a new video frame is
processed to determine if it matches any of the existing Gaussian distributions at that pixel’s
location. If a match is confirmed for one of the distributions, a Boolean variable is set. In
each frame, the weighting factor for each distribution is updated, and ‘a’ is the learning rate
that controls the speed of learning. Pixels that are dissimilar to the background Gaussian
distributions are classified as belonging to the foreground. In our experiments, the following
parameters were empirically set as K=5 and a=0.005 and Variance when initializing a new
Gaussian Mode = (30/255)^2.
424
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
D. FEATURE EXTRACTION
The vehicles are extracted from the video at a point where they are prominent. In the
proposed method, a range value is specified within which vehicles are clearly visible. The
row value of centroid of the bounding box (i.e. position) is considered for the comparison
with the above range. Vehicles within the bounding box are extracted and stored in a mat file
along with the area and frame number.
1. A range is set within which the bounding box of the vehicles are extracted, when they
come within range. The range is set such view of the camera is uniform.
2. Vehicles do not move uniformly, the same vehicle may be detected more than once.
3. An array is set which holds the vehicles of the previous 8-10 vehicles extracted. We
have observed through experimental analysis that a max of the previous 10 vehicles,
the object may be extracted more than once for our dataset.
4. A object detected is thus compared to previously extracted vehicles by using
Euclidean distance norm formula(taking RGB values of the extracted portion of
vehicle)
5. The area (number of pixels within the blob) and the image of cropped portion of car is
stored in a mat file.
Here, the colors red, black and white are considered as the color category. In order to find
out which of these is a dominant color in the image, the cropped image of the vehicle is
divided into 8x8 blocks and the centroid pixel’s HSL color value is extracted. This is similar
to extraction of color in [11]. We have approximated the HSL value for the color category.
Low values of luminance are closer to black and higher values are closer to black. An
approximate hue and saturation value range is associated for the color required. Starting with
lightness value of the centroid pixel of the blocks, a histogram is computed for the values
using the approximate ranges. The values which satisfy a particular range of luminance for a
color are considered for comparing with saturation range, and those satisfying the saturation
range of the same color, are considered for comparison with hue range of that particular
color. The count of a color if significant in an image is considered as the block’s dominant
color. Thus using the block’s centroid color value, we obtain dominant color of an image. An
index value is assigned for each of vehicles extracted and stored in a file.
E. CLUSTERING
Clustering is performed on the samples gathered based on size of vehicle and color.
K-means and Self-Organizing maps are two unsupervised clustering methods used.
Self-organizing maps are different from other artificial neural networks in the sense
that they use a neighborhood function to preserve the topological properties of the input
space. This makes SOMs useful for visualizing low-dimensional views of high-dimensional
data, akin to multidimensional scaling. The model was first described as an artificial neural
network by the Finnish professor TeuvoKohonen, and is sometimes called a Kohonen map or
network.
K-means clustering is a method of cluster analysis which aims to partition ‘n’
observations into ‘k’ clusters in which each observation belongs to the cluster with the
nearest mean using Euclidean distance formula.
425
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
Clustering is performed on the input data set based on area dividing into 3 classes
(small, medium, large). For SOM, 200 epoches are used.
A threshold is set and the actual classification into small, medium and large is
determined. Similarly, this approach is done for colour as well. The matching matrix is then
calculated.
The results of the clustering are stored in .mat files, which are used for retrieval.
F. VEHICLE RETRIEVAL
The user is provided with a graphical interface. The user selection is matched with
values stored in the mat files. The output is displayed to the user. Hence this approach is
quick and simple.
V. EXPERIMENTAL ANALYSIS AND RESULTS
Matlab is used for our implementation. The area of the vehicle is obtained by
calculating the area of the blob. This approach chooses a fixed range is where the areas of the
blobs are captured while processing the video. Kohonen Map and k-means algorithm is used
to perform clustering on the input dataset. 200 epoches are used.
The test dataset consists of a video having 4321 frames. After shots boundaries have
been detected and key frames have been extracted from the video, the number of frames is
brought down to 3222. Therefore the total frames from 4321 are brought down to 3222.
A threshold is set and the actual classification into small, medium and large is
determined. The approach used for calculating the actual class of the dataset is as shown in
Table 1. The median value of the area is taken as M. Small sized vehicles fall below the range
of 1.4*M (=a), medium sized between 1.4*M and 3.5*M (=b) and the rest is large sized
vehicles. The difference between ‘a’ and ‘b’ value must at least be 2.5 for sample size
(number of vehicles) of 50 and above.
Experimental results obtained are shown in Table 2. Tables 3 and 4 show the
matching matrix for SOM and k-means respectively. The results of the clustering are stored
in .mat files, which are used for retrieval. Depending on the users’ selection, the vehicles are
displayed.
PERFORMANCE MEASURES
ACCURACY
Accuracy is the overall correctness of the model and is calculated as the sum of
correct classifications divided by the total number of classifications.
Accuracy = P/C (4)
Where P is the total no. of predicted classifications, and C is the total no. of actual
classifications
426
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
PRECISION
Precision is a measure of the accuracy provided that a specific class has been predicted. It is
defined by:
Precision = tp/(tp + fp) (5)
where tp and fp are the numbers of true positive and false positive predictions for the
considered class. The result is always between 0 and 1.
Vehicles 32 54 75 100
->
SOM K means SOM K means K means SOM K means SOM
Accuracy 29/32 29/32 50/54 66/75 50/54 66/75 85/100 97/100
=90.62% =90.62% =92.59% =88% =92.59% =88% =85% =97%
Precision 18/19 18/19 31/35 42/51 31/35 42/51 54/69 55/58
Small =94.73% =94.73% =88.57% =82.35% =88.57% =82.35% =78.26 % =94.82
%
Precision 5/5 5/5 12/12 17/17 12/12 17/17 24/24 35/35
Medium =100% =100% =100% =100% =100% =100% =100% =100%
Precision 6/8 6/8 7/7 7/7 7/7 7/7 7/7 7/7
Large =75% =75% =100% =100% =100% =100% =100% =100%
Table 1: Experimental results
SOM K-means
Class after Small Medium Large Small Medium Large
Clustering
Actual Class
Small 55 0 0 54 0 0
Medium 3 35 0 15 24 0
Large 0 0 7 0 0 7
Table 2: Matching Matrix for SOM and K-means for a sample size of 100
In regard with the colour identification of the vehicle, for a sample of 100 vehicles,
Accuracy = 90/100 = 90%
Precision Red = 6/8 = 75%
Precision White = 34/39 = 87.17%
Precision Grey = 50/53 = 94.33%
427
- 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
Fig. 2 Size of vehicle is chosen as ‘medium’ and color as ‘white’ and output is displayed.
VI. CONCLUSION
As shown in the results, the combination of steps involved is unique. The vehicle
extraction technique applied is an exclusive method implemented. The discussed method
adopts content-based video retrieval for surveillance videos. The results show that it is an
efficient, simple and quick method. In the future, we aim to detect different types of objects,
incorporating more features for indexing.
REFERENCES
[1] Weiming Hu, NianhuaXie, Li Li, XianglinZeng,, Stephen May bank “A Survey on Visual
Content-Based Video Indexing and Retrieval,” IEEE Transactions on System, Man, and
Cybernetics—Part C: ApplicationsandReviews, Vol. 41, No. 6, November 2011.
[2] B. D. Lucas and T. Kanade, “An iterative image registration technique with an
applicationto stereo vision,” in IJCAI81, pp. 674–679, 1981.
[3] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence,vol.
17, pp. 185–203, August 1981.
[4] B. K. P. Horn and B. G. Schunck, “Determining optical flow: A retrospective,” Artificial
Intelligence, vol. 59, pp. 81–87, Feb. 1993.
[5] T. Can and P. Duygulu, “Searching for repeated video sequences,” inProc. ACM MIR’07,
Augsburg, Germany, Sept. 28–29, 2007.
[6] X. Yang, P. Xue, and Q. Tian, “Automatically discovering unknownshort video repeats,”
in Proc. ICASSP’07, Hawai, USA, 2007.
428
- 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME
[7] J. S.Boreczky and L.A. Rowe, “Comparisonof video shot boundary detection techniques,”
in Proc, SPIE Conf, on Vision, Communication, and Image Processing.
[8] Katharina Quast, Matthias Obermann and Andr´eKaup, “Real-Time Moving Object
Detection In Video Sequences Using Gaussian Mixture Models,” Multimedia
Communications and Signal Processing, University of Erlangen-Nuremberg.
[9] Chang Liu, George Chen, Yingdong Ma, Xiankai Chen, Jinwu Wu, “A System for
Indexing and Retrieving Vehicle Surveillance Videos,” 4th International Congress on Image
and Signal Processing, 2011.
[10] P.M. Daigavane, Preeti R. Bajaj, M.B. Daigavane, “Vehicle Detection and Neural
Network Application for Vehicle Classification,” International Conference on Computational
Intelligence and Communication Systems, 2011.
[11] M.BabuRao et al.,” Content Based Image Retrieval using dominant color, texture and
shape”, International Journal of Engineering Science and Technology (IJEST), 2011.
[12] ZhongweiGuo, Gang Wang, Renan Ma, Yulin Yang, “Battlefield Video Target Mining”,
3rd International Congress on Image and Signal Processing, 2010.
[13] ManimalaSingha, K.Hemachandran, “Content Based Image Retrieval using Color and
Texture”, Signal & Image Processing: An International Journal (SIPIJ) Vol.3, No.1, February
2012.
429