Action event retrieval from cricket video using audio energy feature for event

1,174 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,174
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Action event retrieval from cricket video using audio energy feature for event

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 267 ACTION EVENT RETRIEVAL FROM CRICKET VIDEO USING AUDIO ENERGY FEATURE FOR EVENT SUMMARIZATION Vilas Naik1 , Prasanna Patil2 , Vishwanath Chikaraddi3 1 Department of CSE, Basaveshwar Engg College, Bagalkot, India. 2 Department of CSE, Basaveshwar Engg College, Bagalkot, India. 3 Department of CSE, Basaveshwar Engg College, Bagalkot, India. ABSTRACT Content-Based Video Retrieval (CBVR) is an active research discipline focused on computational strategies to search for relevant videos based on multimodal content analysis in video such as visual, audio, text to browse and index video. However, finding the desired video or event from a large amount of video database remains a challenging and time-consuming task. As a result, efficient video retrieval/event retrieval becomes more challenging. We present audio based approaches for event retrieval from sports video. The approach has been shown effective applied to cricket videos. The approach retrieves the action event based on audio level of the played shot of a batsman and loud cheering of audience as a response in a cricket match. These audio symbols can be retrieved by measuring audio level and pattern which is normally higher than regular audio level using audio energy features. The experiments conducted and the results analyzed reveal that the mechanism can be efficiently used on cricket video for extraction of events like power stroke actions and crowd cheer from stadium. Keywords: Adaptive Thresholding, Audio Energy, Event Retrieval, MFCC, Short Time Energy, Video Summarization, Zero Crossing Rate. 1. INTRODUCTION Sports video distribution over various networks should contribute to quick adoption and widespread usage of multimedia services worldwide because processing of sports video operations like browsing, indexing, summarization and retrieval makes it possible to deliver sports video over narrow band networks such as the Internet and wireless. Amount of daily content created by TV channels over the world cannot be measured On-site news shooting, in-studio programs, sports event broadcasts, in-house produced films, serials, documentaries, and other production jobs, broadcasted everyday to homes of billions. This increased generation and distribution rate of audiovisual content INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), pp. 267-274 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 268 created a new problem of management. It is clear that when accessing lengthy voluminous video programs the ability to access highlights and to skip the less interesting parts of the videos will save not only the viewer’s time, but also data downloading/airtime costs if the viewer receives videos wirelessly from remote servers. Moreover, it would be very attractive if users can access and view the content based on their own preferences. To realize above needs, the source video has to be tagged with semantic labels. These labels must not only be broad to cover general events in the video, e.g., shot of a batsman, wicket falling sound when ball hits the wicket, audience shouting sounds when batsman misses shot, audience cheering sound. This is a very challenging task and would require exploiting multimodal and multi context approaches. Video content can be accessed by using either a top-down approach or a bottom-up approach. The top-down approach i.e. video browsing is useful when need to get an essence of the content. The bottom-up approach i.e. video retrieval is useful when knowing exactly what are looking for in the content. In video summarization, what “essence” the summary should capture depends on whether the content is scripted or not. Considerable progress has been made in multimodal analysis, video representation, summarization, browsing and retrieval, which are the five fundamental bases for accessing video content. The first three bases focus on meta-data generation & organization while the last two focus on meta-data consumption. Multimodal analysis deals with the signal processing part of the video system, including shot boundary detection, key frame extraction, key object detection, audio analysis, closed caption analysis etc. Video representation is concerned with the structure of the video. Again, it is useful to have different representations for scripted and unscripted content. Built on top of the video representation, video summarization, either based on ToC generation or highlights extraction, deals with how to use the representation structure to provide the viewers top-down access using the summary for video browsing. Finally, video retrieval is concerned with retrieving specific video objects. For today's video content, techniques are urgently needed for automatically (or semi-automatically) constructing video, video Highlights and video Indices to facilitate summarization, browsing and retrieval. The work proposed employs energy levels of MFCC coefficients of audio sample find knocking sound of bat hitting ball to detect batsman stroke action followed by cheers by spectators in ground. The remaining part of the work is organized in to 4 sections. Section 2 presents the related work and background of the algorithms used. Section 3 gives a detailed description of characterization of video and audio characterization and features. Section 4 describes the new proposed algorithm. Section 5 discusses the results obtained. Section 6 brings up the conclusion. 2. RELATED WORK Research towards the automatic detection and retrieval of events in sport videos data has attracted a lot of attention in recent years. Sports video analysis and events/highlights extraction and summarization are probably one of the important topics research. The review of literature is conducted and summary is presented here. The transform or subband audio coders [1], which are employed in many modern audio coding standards, describes a new coder in which quantization strategies extended by incorporating run-length and arithmetic encoders. In [2], the method describes a necessary capability for content-based retrieval is to support the paradigm of query by example, and the work presents an algorithm for matching multimodal (audio-visual) patterns for the purpose of content-based video retrieval. A generalized sound recognition system that uses reduced- dimension log-spectral features and a minimum entropy hidden Markov model classifier [3], address the major challenges of generalized sound recognition. In [4], the author focused on the use of Hidden Markov Models (HMMs) for structure analysis of videos, and demonstrates how they can be efficiently applied to merge audio and visual cues. The exploitation of features from multiple modalities, namely, audio, video, and text are described in [5]. Concept representations are modeled
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July using Gaussian mixtures models (GMM), hidden machines (SVM). In [6], NVRS, which is convenient for users to fast browsing and retrieving news video by different categories such as political, audio cues will shown to play an important SVM. The author addresses the problem of bridging semantic implemented an MPEG-7 compliant browsing system for semantic sports video [8]. A CBIR system for use in a psychological movement and Dyslexia is described audio cues for attaining this level indexing approach [10], which analyzes both audio and interrelations to extract high-level semantic cues. automatically extracted from video and used to index its contents. audio–visual feature based framework for event detection in broadcast video field sports. The methods of segmenting, visualizing, and considering audio and visual data are author reviewed different research works in 3 types of video, i.e., video of broadcast news, and sports video. In [15], author presented assessing semantic relevance in video retrieval like indexing. The related work reveals that the audio pattern can be prominent cue for identification of significant events from sports video. Based on these audio patterns the requested events can be detected or classified. 3. VIDEO CHARACTERIZATION Video characterization is the process of understanding sequence, the procedure is important part of most of video processing tasks retrieving, summarization and indexing and sounds of commentator, audience, whistling, level features that are successfully used in speech analysis provide good results for audio signal analysis in sports Zero Crossing Rate (ZCR) and Short Time Energy (STE) signals, a zero crossing is said to occur if successive sample have different algebraic signs. The rate at which zero crossings occur is a simple measure of the frequency content of a signal. This average zero-crossing rate gives a reasonable way to estimate the frequency of sine wave. Zero crossing is suitable for narrowband signals, but audio signals may in components. For audio signals, short-time energy is an essential parameter for distinguishing silence clips from non-silence clips. It is evident that the short lower than those of non-silence clips. The short effective measurement to differentiate silence clips from non have much smaller ZCR values than the non Mel-Frequency Cepstral Coefficient (MFCC) tasks based on audio features. To extract MFCC features, input audio is divided into overlapping International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 269 using Gaussian mixtures models (GMM), hidden Markov models (HMM), and support vector which is convenient for users to fast browsing and retrieving news t categories such as political, finance, amusement, etc. In [7], author reveals that shown to play an important role in semantics inference of video using weaknesses of es the problem of bridging semantic gap in the sports dom 7 compliant browsing system for semantic retrieval and summarization of CBIR system for use in a psychological study of the relationship between human is described in [9]. The work presents a novel use of interactive visual and audio cues for attaining this level of indexing performance. Content-based movie parsing and which analyzes both audio and visual sources and accounts for their level semantic cues. The Aural and visual cues described in [11] automatically extracted from video and used to index its contents. In [12], the work framework for event detection in broadcast video of multiple methods of segmenting, visualizing, and indexing presentation videos by separately are investigated in [13]. In [14], Semantic Retrieval of Video works in 3 types of video, i.e., video of meetings, movies and In [15], author presented different complementary approaches for elevance in video retrieval like adaptive video indexing and elemental co The related work reveals that the audio pattern can be prominent cue for identification of significant events from sports video. Based on these audio patterns the requested events can be VIDEO CHARACTERIZATION AND AUDIO CUE BASED EVENT RETRIEVAL is the process of understanding syntax and semantics of is important part of most of video processing tasks summarization and indexing. In sports video the audio signal mainly comes from and sounds of commentator, audience, whistling, and environment. Therefore, first extract some low features that are successfully used in speech analysis and then experiment whether they can audio signal analysis in sports video. Zero Crossing Rate (ZCR) and Short Time Energy (STE) - In the context of discrete signals, a zero crossing is said to occur if successive sample have different algebraic signs. The rate t which zero crossings occur is a simple measure of the frequency content of a signal. This average crossing rate gives a reasonable way to estimate the frequency of sine wave. Zero crossing is suitable for narrowband signals, but audio signals may include both narrowband and broadband time energy is an essential parameter for distinguishing silence clips silence clips. It is evident that the short-time energy values of silence clips are remarkably silence clips. The short-time average zero-crossing rate (ZCR) is another effective measurement to differentiate silence clips from non-silence clips, as the silence clips have much smaller ZCR values than the non-silence clips. ZCR is defined formally as Frequency Cepstral Coefficient (MFCC) - MFCC features are a natural choice for recognition To extract MFCC features, input audio is divided into overlapping International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- August (2013), © IAEME Markov models (HMM), and support vector which is convenient for users to fast browsing and retrieving news In [7], author reveals that of video using weaknesses of ports domain and retrieval and summarization of study of the relationship between human of interactive visual and based movie parsing and visual sources and accounts for their described in [11] can be the work proposed is of multiple different indexing presentation videos by separately Semantic Retrieval of Video, meetings, movies and complementary approaches for elemental concept The related work reveals that the audio pattern can be prominent cue for identification of significant events from sports video. Based on these audio patterns the requested events can be AND AUDIO CUE BASED EVENT RETRIEVAL syntax and semantics of a video like segmenting, the audio signal mainly comes from speech first extract some low and then experiment whether they can In the context of discrete-time signals, a zero crossing is said to occur if successive sample have different algebraic signs. The rate t which zero crossings occur is a simple measure of the frequency content of a signal. This average crossing rate gives a reasonable way to estimate the frequency of sine wave. Zero crossing is clude both narrowband and broadband time energy is an essential parameter for distinguishing silence clips time energy values of silence clips are remarkably crossing rate (ZCR) is another silence clips, as the silence clips ZCR is defined formally as MFCC features are a natural choice for recognition To extract MFCC features, input audio is divided into overlapping
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 270 frames of duration 30 ms with 10 ms overlapping for consecutive frames. Each frame is then multiplied by a hamming window function: Where, N is the number of samples in the window. After performing FFT on each windowed frame, MFCC are calculated using the following discrete cosine transform: Where, K is the number of sub-bands, and L is the desired length of cepstrum. Si’s, 1 ≤ i ≤ K, represent the filter bank energy after passing through the triangular band-pass filters. Figure 3.6 summarizes the MFCC extraction process. Figure 3.1: MFCC feature extraction 4. PROPOSED ALGORITHM AND IMPLEMENTATION The solution implementation contains three important modules. One is a module for extracting audio and video data from sports video stream and next is matlab code module to detect peaks in audio and last is action event retrieval. The proposed algorithm is described in the following steps, Step1: Read the input video stream and separate the audio track and video manually using external tools. The separated audio track and video are stored in the matlab module. Step2: The number of samples/sampling frequency of separated audio is calculated. Step3: Calculate MFCC coefficients for each second of audio. MFCC includes these sub steps- divide the signal into frames, for each frame take the Fourier transform, take the logarithm, convert to Mel spectrum, take the discrete cosine transform (DCT). Step4: The function wenergy() is used to calculate wave energy values for each second using MFCC values obtained previously. Step5: Store all energy values in an array and plot those energy values to get graphical view. Step6: Using Adaptive threshold determine the high peak action event location where the energy value is adaptively high. The frames around 25-50 before and after the audio peak are considered for action event retrieval. The steps 1-6 are repeated for a fixed size of the input data for the complete sports video stream.
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 271 5. EXPERIMENTATION AND RESULT ANALYSIS The proposed algorithm is implemented in Matlab 2012a version with the default optimization turned on. The system used for the experimentation is an Intel I5- 2.66 GHz processor with 4 GB of RAM and running under the Windows 7 ultimate operating system. Various Cricket videos used in experiment were collected via the Internet. The data set for experimentation comprises of 10 videos each of size 30 MB. The proposed model accepts the input video in the form of “.avi”, the algorithm can also work on other formats of video. Results are explained below. Figure 5.1 Input video clip The video sample is a cricket video clip where cricket player bravo hits three stunning sixes and one boundary and video shows players and audience celebrating the moment. The sample has audio variations and manual examination reveals the maximum amplitude for audio during the hit and cheering. The problem solution is implemented by separating the input cricket video stream into video and audio streams and then analyzing the audio stream for peak values present. Using adaptive threshold technique the peak values corresponding to the video frame numbers is calculated for all samples of audio using wave energy as shown below. Figure 5.2: Wave Energy of an audio Then frames corresponding to particular peak are extracted and stored as peak frames. Frame detection based approach shows excellent detection accuracy and also results in saving of processing time. The algorithm gives many action events each containing 25 frames from input video.
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 272 Figure 5.3: Peak Frames Detected for given video Meanwhile, all other action events with 25 frames when the audio peak is at high are extracted. The frames retrieved indicate the action event whose wave energy peak is at high level which is loud cheering of audience, shot of a batsman or it may be falling of a wicket. With these frames, one second early and one second later frames are clubbed together to form a action video clip as shown below. Figure 5.4: Action Events Retrieved for given video For the evaluation purpose we have selected few video clips downloaded from different datasets. The preprocessing step in the algorithm is omitted. In this section, it presents quantitative results on the performance of the action event detection and retrieval system. Table 5.1 shows the overview of the tests performed. It demonstrates the total number of frames, number of frames retrieved. In the next columns it gives size and length of video and finally the fidelity which describes the number of significant action events retrieved from the original video.
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 273 Table 5.1 Experimental results of proposed algorithm, Experimented on Cricket Video Name of the File Number of Frames Size (in MBs) Length of video in minutes Number of frames Retrieved Total Number of Action Events sample.avi (Global thresholding) 3577 28.1 2:23 75 28 sample.avi (Adaptive thresholding with 5 frames) 3577 28.1 2:23 75 24 sample.avi (Adaptive thresholding with 10 frames) 3577 28.1 2:23 75 28 The chapter gave the detailed description about the experiments that were conducted for testing and evaluation of the proposed methodology for action event retrieval. The experimental results are indeed encouraging and shown significant efficiency in event retrieval with 86% of accuracy. It also described about how the peak frames formed into a video file in .avi format with the manual investigation of the input cricket video. 6. CONCLUSION The algorithm for retrieval of batsman stroke action and audience cheer events in cricket video is designed and experimented on sufficient number of cricket clips. The algorithm is implemented in Matlab 2012a and executed on Intel I5- 2.66 GHz processor with 4 GB of RAM. The algorithm is implemented to extract batsman stoke action by detecting knocking sound of bat hitting a ball heard in broadcasted cricket video following by detection of cheers from audience by energy level of audio in MFCC domain. The event retrieved using audio cue is successfully experimented and the audio pattern using wave energy features has proved that a particular event in a video can be identified by its peculiar audio pattern and its parameters. The presented algorithm first separated the Audio content from the video via a software tool, and then extracted the audio samples per frame, once got the audio samples, and then calculated MFCC coefficients and wave energy values for whole audio segments. After applying adaptive thresholding peak frames were found and the event video is considering the frames around that audio peak. The experimental results are indeed encouraging and shown significant efficiency in event detection with 86% for an audio track. 7. REFERENCES [1]. Henrique Malvar, 1998, “Enhancing the Performance of Subband Audio Coders for Speech Signals,” in Proc. of IEEE International Symposium on Circuits and Systems – Monterey, CA, June 1998. [2]. Milind R. Naphade, Roy Wang and Thomas S. Huang, 2001, “Multimodal Pattern Matching For Audio-Visual Query and Retrieval,” department of Electrical and Computer Engineering Beckman Institute for Advanced Science and Technology University of Illinois, Urbana- Champaign, 2001.
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 274 [3]. Michael A. Casey, 2001, “Reduced-Rank Spectra and Minimum-Entropy Priors as Consistent and Reliable Cues for Generalized Sound Recognition,” MERL, Cambridge Research Laboratory, 2001. [4]. E. Kijak, G. Gravier, P. Gros, L. Oisel and F. Bimbot, 2003, “Hmm Based Structuring Of Tennis Videos Using Visual And Audio Cues,” Thomson multimedia R&D, France, de Belle- Fontaine, 35510 Cesson-Sevigne, France, 2003. [5]. W. H. Adams, Giridharan Iyengar, 2003, “Semantic indexing of multimedia Content Using Visual, Audio, And Text Cues,” EURASIP Journal on Applied Signal Processing, 2, 1–16, 2003. [6]. Huayong LIU, Tingting HE, 2004, “A Content-Based News Video Retrieval System: NVRS,” department of Computer Science, Central China Normal University, Wuhan 430079, PR China, 2004. [7]. Min Xu, Ling-Yu Duan, Liang-Tien Chia, 2004, “Audio Keyword Generation For Sports Video Analysis,” School of Computer Engineering, Nanyang Technological University, Singapore, October 10-16, 2004. [8]. Baoxin Li, James H. Errico, 2004, “Bridging the Semantic Gap in Sports Video Retrieval and Summarization,” SHARP Laboratories of America, 5750 NW Pacific Rim Blvd., Camas, WA 98607, USA- 2004. [9]. L. Joyeux, E. Doyle, H. Denman, 2004, “Content Based Access for A Massive Database of Human Observation Video,” in Proc. of the 6th ACM SIGMM international workshop on Multimedia information retrieval, 46 – 52, 2004. [10]. Ying Li, 2004, “Content-Based Movie Analysis and Indexing Based On Audiovisual Cues,” in Proc. of IEEE Transactions On Circuits And Systems For Video Technology, Vol. 14, No. 8, August 2004. [11]. Michael G. Christel, Chang Huang, Neema Moraveji, and Norman Papernick, 2004, “Exploiting Multiple Modalities for Interactive Video Retrieval,” Carnegie Mellon University Pittsburgh, 5-1-2004. [12]. David A. Sadlier and Noel E. O’Connor, 2005 ,“Event Detection In Field Sports Video Using Audio–Visual Features And A Support Vector Machine,” in Proc. of IEEE transactions on circuits and systems for video technology, vol. 15, no. 10, October 2005. [13]. Alexander Haubold and John R. Kender, 2006, “Augmented Segmentation and Visualization for Presentation Videos,” Department of Computer Science, Columbia University, New York, 2006. [14]. Ziyou Xiong, Xiang Zhou, Qi Tian, Rui Yong, and Thomas S. Huang, 2006, “Semantic Retrieval Of Video,” United Technologies Research Center, East Hartford, 2006. [15]. JOSE A. LAY, PAISARN MUNEESAWANG, TAHIR AMIN AND LING GUAN, 2007, “Assessing Semantic Relevance by Using Audiovisual Cues,” International Journal Of Information And Systems Sciences, Volume 3, Number 3, Pages 420-427. [16] Reeja S R and Dr. N. P Kavya, “Motion Detection for Video Denoising – The State of Art and the Challenges”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 518 - 525, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [16] Reshma R.Gulwani and Sudhirkumar D.Sawarkar, “Video Indexing using Shot Boundary Detection Approach and Search Tracks in Video”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 432 - 440, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [17]. Vilas Naik and Raghavendra Havin, “Entropy Features Trained Support Vector Machine Based Logo Detection Method for Replay Detection and Extraction from Sports Videos”, International Journal of Graphics and Multimedia (IJGM), Volume 4, Issue 1, 2013, pp. 20 - 30, ISSN Print: 0976 – 6448, ISSN Online: 0976 –6456.

×