SlideShare a Scribd company logo
1 of 4
Download to read offline
Multi-modal analysis of Movies for Rhythm Extraction
Devon Bates and Arnav Jhala
Computational Cinematics Studio
University of California Santa Cruz
{dlbates, jhala}@soe.ucsc.edu
Abstract
This paper motivates a multi-modal approach for anal-
ysis of aesthetic elements of films through integration
of visual and auditory features. Prior work in charac-
terizing aesthetic elements of film has predominantly
focused on visual features. We present comparison of
analysis from multiple modalities in a rhythm extrac-
tion task. For detection of important events based on a
model of rhythm/tempo we compare analysis of visual
features and auditory features. We introduce an audio
tempo function that characterizes the pacing of a video
segment. We compare this function with its visual pace
counterpart. Preliminary results indicate that an inte-
grated approach could reveal more semantic and aes-
thetic information from digital media. With combined
information from the two signals, tasks such as auto-
matic identification of important narrative events, can
enable deeper analysis of large-scale video corpora.
Introduction
Pace or tempo is an aspect of performance art that is aesthet-
ically important. In film, pace is created through deliberate
decisions by the director and editors to control the ebb and
flow of action in order to manipulate the audience perception
of time and narrative engagement. Unlike the more literal
definition regarding physical motion, pace in film is a more
intangible quantity that is difficult to measure. Some media
forms, like music, have pace as an easily identifiable feature.
Music has a beat which along with other aspects make pace
readily recognizable. Modeling pace is important for several
areas of film analysis. Pace is shown to be a viable semantic
stylistic indicator of genre that can be used to detect a films
genre. The pace can also be useful for scene and story sec-
tion detection. Pace adds a level of film analysis that adds
more possible areas of indexing and classifying films. Film
as a medium is unique from music in that it is a combination
of audio and visual components. While, we tend to focus
consciously on the visual film components, there is signifi-
cant information in the audio signal that is often overlooked
in analysis. While the video components of pace are impor-
tant in contributing to the audience perception of pace, the
Copyright c 2014, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
audio components are also very important for pace, particu-
larly in signifying narrative importance of events.
The visual pace of a film is determined by the content and
movement of a shot as well as the length of a shot. In other
words, the tempo is higher for shots that are shorter. That
effect can be especially seen in movies by Michael Bay who
has developed a style that involves rapid cuts between shots.
That style has a fast tempo and is more often used in action
movies. Another aspect of a shot that contributes to pace in
a film is the amount of motion in the frame during the dura-
tion of the shot. To illustrate the effect of motion on the pace
or tempo of the shot, we can imagine the same scene of two
people talking but in one take one of the people is pacing
nervously while they speak and in the other the two char-
acters are stationary. If the two takes are the same in every
way except that, the first take would be perceived to have a
faster tempo because of the additional motion in the frame.
In terms of automatic tempo detection, visual pace is a use-
ful measure. The main factors of visual pace are shot length
and motion energy. These can be calculated automatically,
so pace can be generated without manual annotations.
The visual pace function presented by Adams et. al (?) is
a useful tool in the automatic detection of the visual pace of
a film. The function uses the shot length s in frames and the
motion m of each shot n. The pace function also calculates
the mean µ and standard deviation σ of the motion and shot
length throughout the entire movie and uses those to weight
the motion and shot length for individual shots. The visual
pace function is defined as :
P(n) = α(meds−s(n))
σs
+ β(m(n)−µm)
σm
Auditory Pace Function
We define audio pace similar to visual pace with changes
in audio characteristics between shots to calculate pace fea-
tures. The pace function we define, utilizes time domain
characteristic — loudness, and a frequency domain feature
— brightness, to characterize the audio signal. The pace is
then calculated with a similar pace function that compares
the changing characteristics. We use L(n) to represent the av-
erage loudness and B(n) to represent the brightness of each
shot n. The medians medL and medB represent the medians
of the brightness and loudness over all shots in the film. The
standard deviations of the changes are also used. The audio
pace function if structured thus:
Paudio(n) = α(B(n)−medB))
σB
+ β(Ln−med(L))
σL
Loudness in an audio signal means the average energy of
the signal through the duration of the shot in decibel form.
Higher levels of loudness correspond to higher pace. The
brightness is the centroid of spectral frequencies in the fre-
quency domain. The higher the brightness of the signal, we
can assume there are more components of the audio and that
would indicate higher pace. There are many different charac-
teristics or measurements that could be used to characterize
an audio signal. In this paper, we chose a simple model with
two features that corresponded well to the visual analogue.
Figure 1: Motion energy per shot The average motion energy
and value of start frames are all that is needed to calculate
the pace function, P(n)
Figure 2: Pace function with Gaussian smoothing
Audio and Video Processing : We analyzed two film
clips that were 25 frames/second with frame sizes of 640px
by 272 px. In all, that constituted 157,500 frames for Scream
and 136,500 frames for Cabin in the Woods. We imported
and processed the frames in sections of 10,000 frames and
compiled the results. After importing these frames, the first
step was to perform cut detection to detect the start frames
for each shot in the film. The dc image is the average of n by
n squares of the original image and then the difference is cal-
culated from frame to frame. The size of the window must
be big enough to eliminate the noise. We used histograms of
the dc image with 4 bins and performed χ2
difference be-
tween each frame to detect the magnitude of the change in
each frame. The number of histograms bins have to be high
enough to detect changes but not too high that they over de-
tect change in noise. The average motion energy of the frame
is also calculated by simply calculating the difference be-
tween n by n squares in the dc image. So if a shot is detected
from frames f1 to f2, the average motion energy of that shot
is calculated by finding the sum of differences between dc
image squares between f1 and f2, divided by the number of
frames f1-f2. The results of the pace function are then ana-
lyzed to find the edges that have steep slopes. There are two
levels of edges detected, the smaller is used to detect story
events and the larger indicates story sections. We present a
Figure 3: Visual pace results
breakdown of the story sections by the range of frame num-
bers for Scream along with a description of the events during
that story section below in Table 1. The story events de-
tected using the video components of the film are a separate
list from the story events detected using the audio compo-
nents. It is assumed that some events will be detected by
both methods and some will only be detected by one. To ac-
count for that we combined story events that occur within the
length of the frame used for non-maximum suppression, in
this case 10 shots. We show details of our results in a section
of frames where we have manually annotated story events to
correspond to the time-stamps extracted from analysis of the
Frame Range Description
1- 29939 Murder of first girl and after-
math
29940-43112 Ghostface attacks Sydney
43113-44875 Billy shows up and Sydney
suspects that hes Ghostface
44876-79772 Billy gets off the hook and
fights with Sydney
79773-102258 All the teens go to the
party, where Sydney and Billy
makeup
102259-116637 Ghostface kills Tatum and
Billy
116638-146020 Sydney tries to get away and
finds out the truth
146021-150000 Aftermath of the killings are
resolved
Table 1: Story sections from Scream
two modes. These results are for the story section from shot
792 to shot 1081. This story section contains a number of
typical horror elements. It is the final set-piece and show-
down between Sydney and the murderer Ghostface. It starts
when he reveals himself by stabbing Billy and ends with the
reporter Gale Weathers killing Billy to end the rampage.
Figure 5 and classification results from Table 2 show com-
parison of the audio pace and visual pace separately. There
are some sections where the two functions overlap, but in
other areas they are mirror edges. From shots 1 to 200 the
functions are almost mirror images, but they align almost
perfectly from shots 850 to 1000. Each method has strengths
and weaknesses, however neither can paint a sufficiently
vivid picture of the film without the other. Overall, this pre-
liminary analysis indicates that audio pace wasnt an appro-
priate way to detect story sections but was useful in detecting
story events. Similarly, visual pace was a successful way to
detect story events. However, during periods of sustained vi-
sual action, there was some difficulty in detecting a full set
of events. In the horror genre, there are often periods of sus-
tained action making visual pace insufficient to detect a full
range of story events.
Pace is an important feature to be able to detect story
structure within a film. Our techniques were limited in de-
tecting the structure of the film without any underlying se-
mantic meaning. Pace can detect events, but not necessarily
the tone of those events. It is easy to see how the basic tech-
nique of finding pace can be expanded into not only detect-
ing story events and sections but also characterizing them
which would be an invaluable tool for automatic video an-
notation.
Discussion
This is a preliminary study to compare models of aesthetic
elements in film from diverse techniques in visual and au-
Figure 4: Audio pace results
dio processing. This is worthwhile due to the richness and
diversity of information that could be gained from integra-
tion of information from these modalities. Audio processing
could be better integrated with video due to the similarity of
underlying signal processing and modeling techniques but
there is much work in that area that needs to be done for any
semantic and narrative-based analysis of audio. In this work,
we focused on low-level features of audio. Much work in the
audio analysis community has been done on deeper analysis
of audio with respect to semantic elements such as narrative
modules, tone, scene transitions, etc. There are also features
that could be utilized to characterize genre, character, and
other classes. Results in this paper motivate more work on
integrated multi-modal analysis. Future work on integrating
textual medium for such analysis in addition to video and
audio could prove promising.
References
Beeferman, D.; Berger, A.; and Lafferty, J. 1999. Statisti-
cal models for text segmentation. Machine learning 34(1-
3):177–210.
Chen, L., and Chua, T.-S. 2001. A match and tiling approach
to content-based video retrieval. In ICME. Citeseer.
Cour, T.; Jordan, C.; Miltsakaki, E.; and Taskar, B. 2008.
Movie/script: Alignment and parsing of video and text tran-
scription. In Computer Vision–ECCV 2008. Springer. 158–
171.
Foltz, P. W.; Kintsch, W.; and Landauer, T. K. 1998. The
measurement of textual coherence with latent semantic anal-
ysis. Discourse processes 25(2-3):285–307.
Hanjalic, A.; Lagendijk, R. L.; and Biemond, J. 1999. Au-
tomated high-level movie segmentation for advanced video-
retrieval systems. Circuits and Systems for Video Technol-
ogy, IEEE Transactions on 9(4):580–588.
Hearst, M. A. 1997. Texttiling: Segmenting text into
Figure 5: Pace results for both methods
multi-paragraph subtopic passages. Computational linguis-
tics 23(1):33–64.
Kozima, H., and Furugori, T. 1994. Segmenting narrative
text into coherent scenes. Literary and Linguistic Computing
9(1):13–19.
Landauer, T. K.; Foltz, P. W.; and Laham, D. 1998. An in-
troduction to latent semantic analysis. Discourse processes
25(2-3):259–284.
Lehane, B.; O’Connor, N. E.; and Murphy, N. 2004. Di-
alogue scene detection in movies using low and mid-level
visual features.
Monaco, J. 2000. How to Read a Film: The World of Movies,
Media, and Multimedia: Language, History and Theory.
Oxford University Press.
Ngo, C.-W.; Ma, Y.-F.; and Zhang, H.-J. 2005. Video sum-
marization and scene detection by graph modeling. Circuits
and Systems for Video Technology, IEEE Transactions on
15(2):296–305.
Passonneau, R. J., and Litman, D. J. 1997. Discourse seg-
mentation by human and automated means. Computational
Linguistics 23(1):103–139.
Rasheed, Z., and Shah, M. 2003. Scene detection in holly-
wood movies and tv shows. In Computer Vision and Pattern
Recognition, 2003. Proceedings. 2003 IEEE Computer So-
ciety Conference on, volume 2, II–343. IEEE.
Rui, Y.; Huang, T. S.; and Mehrotra, S. 1999. Constructing
table-of-content for videos. Multimedia systems 7(5):359–
368.
Salton, G.; Wong, A.; and Yang, C.-S. 1975. A vector space
model for automatic indexing. Communications of the ACM
18(11):613–620.
Shahraray, B. 1995. Scene change detection and content-
based sampling of video sequences. In Proc. SPIE, volume
2419.
Sundaram, H., and Chang, S.-F. 2002. Computable scenes
Shot Event Descrip-
tion
A or V
793 Ghostface stabs
Billy
A
822 Sydney falls off
the roof
A
870 Gale realizes her
camera man is
dead
A
909 Sydney locks
Randy and Stu out
of the house
A
924 Billy shoots
Randy
A
962 Billy tells Sydney
that he killed her
mom
V
991 Billy stabs Stu
to cover up their
crimes
A/V
1026 The reporter holds
up the bad guys
with a gun
A
1071 Sydney kills Stu
by dumping a TV
on him
V
1089 Gale shoots Billy V
Table 2: Story events from a section of Scream results show
how both the audio and visual pace components can miss
valid events.
and structures in films. Multimedia, IEEE Transactions on
4(4):482–491.
Tarjan, R. 1972. Depth-first search and linear graph algo-
rithms. SIAM journal on computing 1(2):146–160.

More Related Content

What's hot

Paper id 212014133
Paper id 212014133Paper id 212014133
Paper id 212014133IJRAT
 
Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...
Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...
Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...IJCSIS Research Publications
 
Adaptive denoising technique for colour images
Adaptive denoising technique for colour imagesAdaptive denoising technique for colour images
Adaptive denoising technique for colour imageseSAT Journals
 
image compression using matlab project report
image compression  using matlab project reportimage compression  using matlab project report
image compression using matlab project reportkgaurav113
 
image denoising technique using disctere wavelet transform
image denoising technique using disctere wavelet transformimage denoising technique using disctere wavelet transform
image denoising technique using disctere wavelet transformalishapb
 
Digital radiographic image enhancement for improved visualization
Digital radiographic image enhancement for improved visualizationDigital radiographic image enhancement for improved visualization
Digital radiographic image enhancement for improved visualizationNisar Ahmed Rana
 
In tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodes
In tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodesIn tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodes
In tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodesMeyli Valin Fernández
 
The application of image enhancement in color and grayscale images
The application of image enhancement in color and grayscale imagesThe application of image enhancement in color and grayscale images
The application of image enhancement in color and grayscale imagesNisar Ahmed Rana
 
Comparative study of Salt & Pepper filters and Gaussian filters
Comparative study of Salt & Pepper filters and Gaussian filtersComparative study of Salt & Pepper filters and Gaussian filters
Comparative study of Salt & Pepper filters and Gaussian filtersAnkush Srivastava
 
Wavelet Applications in Image Denoising Using MATLAB
Wavelet Applications in Image Denoising Using MATLABWavelet Applications in Image Denoising Using MATLAB
Wavelet Applications in Image Denoising Using MATLABajayhakkumar
 
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital ImagesIRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital ImagesIRJET Journal
 
Digital Image Processing Fundamental
Digital Image Processing FundamentalDigital Image Processing Fundamental
Digital Image Processing FundamentalThuong Nguyen Canh
 
IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...
IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...
IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...Dibya Jyoti Bora
 

What's hot (19)

Blurred image recognization system
Blurred image recognization systemBlurred image recognization system
Blurred image recognization system
 
Paper id 212014133
Paper id 212014133Paper id 212014133
Paper id 212014133
 
Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...
Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...
Self Organizing Migration Algorithm with Curvelet Based Non Local Means Metho...
 
Adaptive denoising technique for colour images
Adaptive denoising technique for colour imagesAdaptive denoising technique for colour images
Adaptive denoising technique for colour images
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
image compression using matlab project report
image compression  using matlab project reportimage compression  using matlab project report
image compression using matlab project report
 
image denoising technique using disctere wavelet transform
image denoising technique using disctere wavelet transformimage denoising technique using disctere wavelet transform
image denoising technique using disctere wavelet transform
 
פוסטר דר פרידמן
פוסטר דר פרידמןפוסטר דר פרידמן
פוסטר דר פרידמן
 
DIP - Image Restoration
DIP - Image RestorationDIP - Image Restoration
DIP - Image Restoration
 
final_project
final_projectfinal_project
final_project
 
F0953235
F0953235F0953235
F0953235
 
Digital radiographic image enhancement for improved visualization
Digital radiographic image enhancement for improved visualizationDigital radiographic image enhancement for improved visualization
Digital radiographic image enhancement for improved visualization
 
In tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodes
In tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodesIn tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodes
In tech multiple-wavelength_holographic_interferometry_with_tunable_laser_diodes
 
The application of image enhancement in color and grayscale images
The application of image enhancement in color and grayscale imagesThe application of image enhancement in color and grayscale images
The application of image enhancement in color and grayscale images
 
Comparative study of Salt & Pepper filters and Gaussian filters
Comparative study of Salt & Pepper filters and Gaussian filtersComparative study of Salt & Pepper filters and Gaussian filters
Comparative study of Salt & Pepper filters and Gaussian filters
 
Wavelet Applications in Image Denoising Using MATLAB
Wavelet Applications in Image Denoising Using MATLABWavelet Applications in Image Denoising Using MATLAB
Wavelet Applications in Image Denoising Using MATLAB
 
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital ImagesIRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
 
Digital Image Processing Fundamental
Digital Image Processing FundamentalDigital Image Processing Fundamental
Digital Image Processing Fundamental
 
IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...
IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...
IMPORTANCE OF IMAGE ENHANCEMENT TECHNIQUES IN COLOR IMAGE SEGMENTATION: A COM...
 

Viewers also liked

Wealth Technology Strategy
Wealth Technology StrategyWealth Technology Strategy
Wealth Technology StrategyAlison Rooney
 
Mexico and Brazil Intercultural Assignment
Mexico and Brazil Intercultural AssignmentMexico and Brazil Intercultural Assignment
Mexico and Brazil Intercultural AssignmentSalvador Acosta
 
Cv u norlin 2016 08 sv.
Cv  u norlin 2016 08 sv.Cv  u norlin 2016 08 sv.
Cv u norlin 2016 08 sv.Ulrica Norlin
 
Kimberly's Resume 2015 IN2
Kimberly's Resume 2015 IN2Kimberly's Resume 2015 IN2
Kimberly's Resume 2015 IN2Kimberly Booze
 
Blood glucose prediction
Blood glucose predictionBlood glucose prediction
Blood glucose predictionRAVI KANT
 
Ankit updated Resume
Ankit updated  ResumeAnkit updated  Resume
Ankit updated ResumeAnkit Agarwal
 
Cv u norlin 2016 08 sv.
Cv  u norlin 2016 08 sv.Cv  u norlin 2016 08 sv.
Cv u norlin 2016 08 sv.Ulrica Norlin
 
natural polymer
natural polymer natural polymer
natural polymer RAVI KANT
 
Ultrasound Machine-A Revolution In Medical Imaging
Ultrasound Machine-A Revolution In Medical ImagingUltrasound Machine-A Revolution In Medical Imaging
Ultrasound Machine-A Revolution In Medical ImagingRAVI KANT
 

Viewers also liked (14)

Wealth Technology Strategy
Wealth Technology StrategyWealth Technology Strategy
Wealth Technology Strategy
 
Mexico and Brazil Intercultural Assignment
Mexico and Brazil Intercultural AssignmentMexico and Brazil Intercultural Assignment
Mexico and Brazil Intercultural Assignment
 
Lugares turísticos que desconocías de méxico
Lugares turísticos que desconocías de méxicoLugares turísticos que desconocías de méxico
Lugares turísticos que desconocías de méxico
 
Cv u norlin 2016 08 sv.
Cv  u norlin 2016 08 sv.Cv  u norlin 2016 08 sv.
Cv u norlin 2016 08 sv.
 
Kimberly's Resume 2015 IN2
Kimberly's Resume 2015 IN2Kimberly's Resume 2015 IN2
Kimberly's Resume 2015 IN2
 
Android operative sistem
Android operative sistemAndroid operative sistem
Android operative sistem
 
final_report
final_reportfinal_report
final_report
 
Blood glucose prediction
Blood glucose predictionBlood glucose prediction
Blood glucose prediction
 
Ankit updated Resume
Ankit updated  ResumeAnkit updated  Resume
Ankit updated Resume
 
Market Research
Market Research Market Research
Market Research
 
Cv u norlin 2016 08 sv.
Cv  u norlin 2016 08 sv.Cv  u norlin 2016 08 sv.
Cv u norlin 2016 08 sv.
 
natural polymer
natural polymer natural polymer
natural polymer
 
Ultrasound Machine-A Revolution In Medical Imaging
Ultrasound Machine-A Revolution In Medical ImagingUltrasound Machine-A Revolution In Medical Imaging
Ultrasound Machine-A Revolution In Medical Imaging
 
Ceramic ppt
Ceramic pptCeramic ppt
Ceramic ppt
 

Similar to SceneBoundaryDetection (1)

Animation Movies Trailer Computation
Animation Movies Trailer ComputationAnimation Movies Trailer Computation
Animation Movies Trailer ComputationEmma Burke
 
survey on Scene Detection Techniques on video
survey on Scene Detection Techniques on videosurvey on Scene Detection Techniques on video
survey on Scene Detection Techniques on videoChandra Shekhar Mithlesh
 
Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351Editor IJARCET
 
motion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosmotion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosshiva kumar cheruku
 
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...theijes
 
Recognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesRecognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesIJCSEA Journal
 
Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Chris Huang
 
Parameterized Image Filtering Using fuzzy Logic
Parameterized Image Filtering Using fuzzy LogicParameterized Image Filtering Using fuzzy Logic
Parameterized Image Filtering Using fuzzy LogicEditor IJCATR
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET Journal
 
An Application of Second Generation Wavelets for Image Denoising using Dual T...
An Application of Second Generation Wavelets for Image Denoising using Dual T...An Application of Second Generation Wavelets for Image Denoising using Dual T...
An Application of Second Generation Wavelets for Image Denoising using Dual T...IDES Editor
 
Shot Boundary Detection In Videos Sequences Using Motion Activities
Shot Boundary Detection In Videos Sequences Using Motion ActivitiesShot Boundary Detection In Videos Sequences Using Motion Activities
Shot Boundary Detection In Videos Sequences Using Motion ActivitiesCSCJournals
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clusteringcsandit
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clusteringcsandit
 

Similar to SceneBoundaryDetection (1) (20)

Animation Movies Trailer Computation
Animation Movies Trailer ComputationAnimation Movies Trailer Computation
Animation Movies Trailer Computation
 
survey on Scene Detection Techniques on video
survey on Scene Detection Techniques on videosurvey on Scene Detection Techniques on video
survey on Scene Detection Techniques on video
 
Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351
 
motion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosmotion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videos
 
1829 1833
1829 18331829 1833
1829 1833
 
1829 1833
1829 18331829 1833
1829 1833
 
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
Comparative Study of Various Algorithms for Detection of Fades in Video Seque...
 
Recognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesRecognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenes
 
Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...
 
1paper
1paper1paper
1paper
 
Parameterized Image Filtering Using fuzzy Logic
Parameterized Image Filtering Using fuzzy LogicParameterized Image Filtering Using fuzzy Logic
Parameterized Image Filtering Using fuzzy Logic
 
C04841417
C04841417C04841417
C04841417
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
05397385
0539738505397385
05397385
 
05397385
0539738505397385
05397385
 
IJSRDV3I40293
IJSRDV3I40293IJSRDV3I40293
IJSRDV3I40293
 
An Application of Second Generation Wavelets for Image Denoising using Dual T...
An Application of Second Generation Wavelets for Image Denoising using Dual T...An Application of Second Generation Wavelets for Image Denoising using Dual T...
An Application of Second Generation Wavelets for Image Denoising using Dual T...
 
Shot Boundary Detection In Videos Sequences Using Motion Activities
Shot Boundary Detection In Videos Sequences Using Motion ActivitiesShot Boundary Detection In Videos Sequences Using Motion Activities
Shot Boundary Detection In Videos Sequences Using Motion Activities
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 

SceneBoundaryDetection (1)

  • 1. Multi-modal analysis of Movies for Rhythm Extraction Devon Bates and Arnav Jhala Computational Cinematics Studio University of California Santa Cruz {dlbates, jhala}@soe.ucsc.edu Abstract This paper motivates a multi-modal approach for anal- ysis of aesthetic elements of films through integration of visual and auditory features. Prior work in charac- terizing aesthetic elements of film has predominantly focused on visual features. We present comparison of analysis from multiple modalities in a rhythm extrac- tion task. For detection of important events based on a model of rhythm/tempo we compare analysis of visual features and auditory features. We introduce an audio tempo function that characterizes the pacing of a video segment. We compare this function with its visual pace counterpart. Preliminary results indicate that an inte- grated approach could reveal more semantic and aes- thetic information from digital media. With combined information from the two signals, tasks such as auto- matic identification of important narrative events, can enable deeper analysis of large-scale video corpora. Introduction Pace or tempo is an aspect of performance art that is aesthet- ically important. In film, pace is created through deliberate decisions by the director and editors to control the ebb and flow of action in order to manipulate the audience perception of time and narrative engagement. Unlike the more literal definition regarding physical motion, pace in film is a more intangible quantity that is difficult to measure. Some media forms, like music, have pace as an easily identifiable feature. Music has a beat which along with other aspects make pace readily recognizable. Modeling pace is important for several areas of film analysis. Pace is shown to be a viable semantic stylistic indicator of genre that can be used to detect a films genre. The pace can also be useful for scene and story sec- tion detection. Pace adds a level of film analysis that adds more possible areas of indexing and classifying films. Film as a medium is unique from music in that it is a combination of audio and visual components. While, we tend to focus consciously on the visual film components, there is signifi- cant information in the audio signal that is often overlooked in analysis. While the video components of pace are impor- tant in contributing to the audience perception of pace, the Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. audio components are also very important for pace, particu- larly in signifying narrative importance of events. The visual pace of a film is determined by the content and movement of a shot as well as the length of a shot. In other words, the tempo is higher for shots that are shorter. That effect can be especially seen in movies by Michael Bay who has developed a style that involves rapid cuts between shots. That style has a fast tempo and is more often used in action movies. Another aspect of a shot that contributes to pace in a film is the amount of motion in the frame during the dura- tion of the shot. To illustrate the effect of motion on the pace or tempo of the shot, we can imagine the same scene of two people talking but in one take one of the people is pacing nervously while they speak and in the other the two char- acters are stationary. If the two takes are the same in every way except that, the first take would be perceived to have a faster tempo because of the additional motion in the frame. In terms of automatic tempo detection, visual pace is a use- ful measure. The main factors of visual pace are shot length and motion energy. These can be calculated automatically, so pace can be generated without manual annotations. The visual pace function presented by Adams et. al (?) is a useful tool in the automatic detection of the visual pace of a film. The function uses the shot length s in frames and the motion m of each shot n. The pace function also calculates the mean µ and standard deviation σ of the motion and shot length throughout the entire movie and uses those to weight the motion and shot length for individual shots. The visual pace function is defined as : P(n) = α(meds−s(n)) σs + β(m(n)−µm) σm Auditory Pace Function We define audio pace similar to visual pace with changes in audio characteristics between shots to calculate pace fea- tures. The pace function we define, utilizes time domain characteristic — loudness, and a frequency domain feature — brightness, to characterize the audio signal. The pace is then calculated with a similar pace function that compares the changing characteristics. We use L(n) to represent the av- erage loudness and B(n) to represent the brightness of each shot n. The medians medL and medB represent the medians of the brightness and loudness over all shots in the film. The standard deviations of the changes are also used. The audio pace function if structured thus:
  • 2. Paudio(n) = α(B(n)−medB)) σB + β(Ln−med(L)) σL Loudness in an audio signal means the average energy of the signal through the duration of the shot in decibel form. Higher levels of loudness correspond to higher pace. The brightness is the centroid of spectral frequencies in the fre- quency domain. The higher the brightness of the signal, we can assume there are more components of the audio and that would indicate higher pace. There are many different charac- teristics or measurements that could be used to characterize an audio signal. In this paper, we chose a simple model with two features that corresponded well to the visual analogue. Figure 1: Motion energy per shot The average motion energy and value of start frames are all that is needed to calculate the pace function, P(n) Figure 2: Pace function with Gaussian smoothing Audio and Video Processing : We analyzed two film clips that were 25 frames/second with frame sizes of 640px by 272 px. In all, that constituted 157,500 frames for Scream and 136,500 frames for Cabin in the Woods. We imported and processed the frames in sections of 10,000 frames and compiled the results. After importing these frames, the first step was to perform cut detection to detect the start frames for each shot in the film. The dc image is the average of n by n squares of the original image and then the difference is cal- culated from frame to frame. The size of the window must be big enough to eliminate the noise. We used histograms of the dc image with 4 bins and performed χ2 difference be- tween each frame to detect the magnitude of the change in each frame. The number of histograms bins have to be high enough to detect changes but not too high that they over de- tect change in noise. The average motion energy of the frame is also calculated by simply calculating the difference be- tween n by n squares in the dc image. So if a shot is detected from frames f1 to f2, the average motion energy of that shot is calculated by finding the sum of differences between dc image squares between f1 and f2, divided by the number of frames f1-f2. The results of the pace function are then ana- lyzed to find the edges that have steep slopes. There are two levels of edges detected, the smaller is used to detect story events and the larger indicates story sections. We present a Figure 3: Visual pace results breakdown of the story sections by the range of frame num- bers for Scream along with a description of the events during that story section below in Table 1. The story events de- tected using the video components of the film are a separate list from the story events detected using the audio compo- nents. It is assumed that some events will be detected by both methods and some will only be detected by one. To ac- count for that we combined story events that occur within the length of the frame used for non-maximum suppression, in this case 10 shots. We show details of our results in a section of frames where we have manually annotated story events to correspond to the time-stamps extracted from analysis of the
  • 3. Frame Range Description 1- 29939 Murder of first girl and after- math 29940-43112 Ghostface attacks Sydney 43113-44875 Billy shows up and Sydney suspects that hes Ghostface 44876-79772 Billy gets off the hook and fights with Sydney 79773-102258 All the teens go to the party, where Sydney and Billy makeup 102259-116637 Ghostface kills Tatum and Billy 116638-146020 Sydney tries to get away and finds out the truth 146021-150000 Aftermath of the killings are resolved Table 1: Story sections from Scream two modes. These results are for the story section from shot 792 to shot 1081. This story section contains a number of typical horror elements. It is the final set-piece and show- down between Sydney and the murderer Ghostface. It starts when he reveals himself by stabbing Billy and ends with the reporter Gale Weathers killing Billy to end the rampage. Figure 5 and classification results from Table 2 show com- parison of the audio pace and visual pace separately. There are some sections where the two functions overlap, but in other areas they are mirror edges. From shots 1 to 200 the functions are almost mirror images, but they align almost perfectly from shots 850 to 1000. Each method has strengths and weaknesses, however neither can paint a sufficiently vivid picture of the film without the other. Overall, this pre- liminary analysis indicates that audio pace wasnt an appro- priate way to detect story sections but was useful in detecting story events. Similarly, visual pace was a successful way to detect story events. However, during periods of sustained vi- sual action, there was some difficulty in detecting a full set of events. In the horror genre, there are often periods of sus- tained action making visual pace insufficient to detect a full range of story events. Pace is an important feature to be able to detect story structure within a film. Our techniques were limited in de- tecting the structure of the film without any underlying se- mantic meaning. Pace can detect events, but not necessarily the tone of those events. It is easy to see how the basic tech- nique of finding pace can be expanded into not only detect- ing story events and sections but also characterizing them which would be an invaluable tool for automatic video an- notation. Discussion This is a preliminary study to compare models of aesthetic elements in film from diverse techniques in visual and au- Figure 4: Audio pace results dio processing. This is worthwhile due to the richness and diversity of information that could be gained from integra- tion of information from these modalities. Audio processing could be better integrated with video due to the similarity of underlying signal processing and modeling techniques but there is much work in that area that needs to be done for any semantic and narrative-based analysis of audio. In this work, we focused on low-level features of audio. Much work in the audio analysis community has been done on deeper analysis of audio with respect to semantic elements such as narrative modules, tone, scene transitions, etc. There are also features that could be utilized to characterize genre, character, and other classes. Results in this paper motivate more work on integrated multi-modal analysis. Future work on integrating textual medium for such analysis in addition to video and audio could prove promising. References Beeferman, D.; Berger, A.; and Lafferty, J. 1999. Statisti- cal models for text segmentation. Machine learning 34(1- 3):177–210. Chen, L., and Chua, T.-S. 2001. A match and tiling approach to content-based video retrieval. In ICME. Citeseer. Cour, T.; Jordan, C.; Miltsakaki, E.; and Taskar, B. 2008. Movie/script: Alignment and parsing of video and text tran- scription. In Computer Vision–ECCV 2008. Springer. 158– 171. Foltz, P. W.; Kintsch, W.; and Landauer, T. K. 1998. The measurement of textual coherence with latent semantic anal- ysis. Discourse processes 25(2-3):285–307. Hanjalic, A.; Lagendijk, R. L.; and Biemond, J. 1999. Au- tomated high-level movie segmentation for advanced video- retrieval systems. Circuits and Systems for Video Technol- ogy, IEEE Transactions on 9(4):580–588. Hearst, M. A. 1997. Texttiling: Segmenting text into
  • 4. Figure 5: Pace results for both methods multi-paragraph subtopic passages. Computational linguis- tics 23(1):33–64. Kozima, H., and Furugori, T. 1994. Segmenting narrative text into coherent scenes. Literary and Linguistic Computing 9(1):13–19. Landauer, T. K.; Foltz, P. W.; and Laham, D. 1998. An in- troduction to latent semantic analysis. Discourse processes 25(2-3):259–284. Lehane, B.; O’Connor, N. E.; and Murphy, N. 2004. Di- alogue scene detection in movies using low and mid-level visual features. Monaco, J. 2000. How to Read a Film: The World of Movies, Media, and Multimedia: Language, History and Theory. Oxford University Press. Ngo, C.-W.; Ma, Y.-F.; and Zhang, H.-J. 2005. Video sum- marization and scene detection by graph modeling. Circuits and Systems for Video Technology, IEEE Transactions on 15(2):296–305. Passonneau, R. J., and Litman, D. J. 1997. Discourse seg- mentation by human and automated means. Computational Linguistics 23(1):103–139. Rasheed, Z., and Shah, M. 2003. Scene detection in holly- wood movies and tv shows. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer So- ciety Conference on, volume 2, II–343. IEEE. Rui, Y.; Huang, T. S.; and Mehrotra, S. 1999. Constructing table-of-content for videos. Multimedia systems 7(5):359– 368. Salton, G.; Wong, A.; and Yang, C.-S. 1975. A vector space model for automatic indexing. Communications of the ACM 18(11):613–620. Shahraray, B. 1995. Scene change detection and content- based sampling of video sequences. In Proc. SPIE, volume 2419. Sundaram, H., and Chang, S.-F. 2002. Computable scenes Shot Event Descrip- tion A or V 793 Ghostface stabs Billy A 822 Sydney falls off the roof A 870 Gale realizes her camera man is dead A 909 Sydney locks Randy and Stu out of the house A 924 Billy shoots Randy A 962 Billy tells Sydney that he killed her mom V 991 Billy stabs Stu to cover up their crimes A/V 1026 The reporter holds up the bad guys with a gun A 1071 Sydney kills Stu by dumping a TV on him V 1089 Gale shoots Billy V Table 2: Story events from a section of Scream results show how both the audio and visual pace components can miss valid events. and structures in films. Multimedia, IEEE Transactions on 4(4):482–491. Tarjan, R. 1972. Depth-first search and linear graph algo- rithms. SIAM journal on computing 1(2):146–160.