CASSANDRA: audio-video sensor fusion for aggression detection∗                            W. Zajdel∗         J.D. Krijnder...
principled approach for detecting aggression involves de-                                                                 ...
Given the noisy and ambiguous domain we resort to a                                                                   prob...
train-noise indicator nk . Additionally, each train model en-                                     T                       ...
audio−aggression detections                                                                                               ...
ground-truth       detected events                            Cumulative system performace                  id   scenario ...
Upcoming SlideShare
Loading in …5

Cassandra audio-video sensor fusion for aggression detection


Published on

J Gabriel Lima -

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cassandra audio-video sensor fusion for aggression detection

  1. 1. CASSANDRA: audio-video sensor fusion for aggression detection∗ W. Zajdel∗ J.D. Krijnders† T. Andringa† D.M. Gavrila∗ ∗ Intelligent Systems Laboratory, Faculty of Science, University of Amsterdam † Auditory Cognition Group, Artificial Intelligence, Rijksuniversiteit Groningen Abstract in this work we combine audio- and video-sensing. At the low level raw sensor data are processed to computeThis paper presents a smart surveillance system named ‘intermediate-level” events or features that summarize ac-CASSANDRA, aimed at detecting instances of aggressive tivities in the scene. Examples of such descriptors imple-human behavior in public environments. A distinguishing mented in the current system are “scream” (audio) or “trainaspect of CASSANDRA is the exploitation of the compli- passing” and “articulation energy” (video). At the top level,mentary nature of audio and video sensing to disambiguate a Dynamic Bayesian Network combines the visual and au-scene activity in real-life, noisy and dynamic environments. ditory events and incorporates any context-specific knowl-At the lower level, independent analysis of the audio and edge in order to produce an aggregate aggression indica-video streams yields intermediate descriptors of a scene tion. This is unlike previous work where audio-video fusionlike: ”scream”, ”passing train” or ”articulation energy”. At dealt with speaker localization for advanced user-interfaces,the higher level, a Dynamic Bayesian Network is used as i.e. using video information to direct a phased-array micro-a fusion mechanism that produces an aggregate aggression phone configuration [8].indication for the current scene. Our prototype system isvalidated on a set of scenarios performed by professionalactors at an actual train station to ensure a realistic audio 2 System Descriptionand video noise setting. 2.1 Audio Unit1 Introduction Audio processing is performed in the time-frequency do- main with an approach common in auditory scene analy-Surveillance technology is increasingly fielded to help safe- sis [11]. The transformation of the time-signal to the time-guard public spaces such as train stations, shopping malls, frequency domain is performed by a model1 of the humanstreet corners, in view of mounting concerns about public ear [3]. This model is a transmission-line model with itssafety. Traditional surveillance systems require human op- channels tuned according to the oscillatory properties of theerators who monitor a wall of CCTV screens for specific basilar membrane. Leaky-integration of the squared mem-events that occur rarely. Advanced systems have the po- brane displacement results in an energy-spectrum, called atential to automatically filter-out spurious information and cochleogram (see Fig. 1).present the operator only the security-relevant data. Ex- A signal component is defined as a coherent area of theisting systems have still limited capabilities; they typically cochleogram that is very likely to stem from a single source.perform video-based intrusion detection, and possibly some To obtain signal components, the cochleogram is first fil-trajectory analysis, in fairly static environments. tered with a matched filter which encodes the response of In the context of human activity recognition in dynamic the cochlea to a perfect sinusoid of the frequency applicableenvironments, we focus on the relatively unexplored prob- to that segment. Then the cochleogram is thresholded usinglem of aggression detection. Earlier work involved solely a cut-off value based on two times the standard deviation ofthe video domain and considered fairly controlled in-door the energy values. Signal components are obtained as theenvironments with static background and few (two) per- tracks formed by McAulay-Quatari tracking [7], applied onsons [2]. Because events associated with the build-up or this pre-processed version of the cochleogram. This entailsenactment of aggression are difficult to detect by a single stringing the energy maxima of connected components to-sensor modality (e.g. shouting versus hitting-someone), gether, over the successive frames. ∗ This research was supported by the Dutch Science Foundation NWO 1 We thank Sound Intelligence ( for con-under grant 634.000.432 within the ToKeN2000 program. tributing this model and cooperation on the aggression detection methods. 978-1-4244-1696-7/07/$25.00 ©2007 IEEE. 200
  2. 2. principled approach for detecting aggression involves de- tailed body-pose estimation, possibly in 3D, followed by ballistic analysis of movements of body parts. Unfortu- nately, at present pose estimation remains a significant com- putational challenge. Various approaches [4] operate at lim- ited rates and handle mostly a single person in a constrained setting (limited occlusions, pre-fitted body model). Simplified approaches rely on a coarser representation human body. An example is a system [2] that tracks a head of a person by analyzing body contour and correlates ag- gression with head’s “jerk” (derivative of acceleration). In practice, high-order derivatives related to body contours are difficult to estimate robustly in cases where the background is not static and there is a possibility of occlusion.Figure 1: Typical cochleograms of aggressive and normalspeech (top and bottom figure, respectively). Energy con- 2.2.1 Visual aggression featurestent is color-coded (increasing from blue to red). Note thehigher pitch (marked by white lines) and the more pro- Here we consider alternative cues based on an intuitive ob-nounced higher harmonics for the aggressive speech case. servation that aggressive behavior leads to highly energetic body articulation. We estimate (pseudo-) kinetic energy of body parts using a “bag-of-points” body model. The ap- Codeveloping sinusoids with a frequency development proach relies on simple image processing operations andequal to an integer multiple of a fundamental frequency yields features highly correlated with aggression.(harmonics) are subsequently combined into harmonic For detecting people our video subsystem employs adap-complexes. Note that these harmonics can be combined tive background/foreground subtraction technique [12].safely because the probability is small that uncorrelated The assumption of static background scene holds fairly wellsound sources show this measure of correlation by chance. in the center view area, where most of the people enter the Little or no literature exists on the influence of aggres- scene. After detection, people are represented as ellipsession on the properties of speech. However the Component and tracked with an extended version of the mean-shiftProcess Model from Scherer [9] and similarities with the tracking algorithm [13]. The extended tracker adapts po-Lombard reflex [6] suggest a couple of important cues for sition and shape of the ellipse (tilt, axes) and thus facilitatesaggression. The component process theory assumes that a close approximation of body area even for tilted/bendedanger and panic, emotions strongly related to aggression, poses. Additionally, the mean-shift tracker handles wellare seen as a ergotropic arousal. This form of arousal is partial occlusions and achieves near real-time performance.accompanied by an increase in heart frequency, blood pres- We consider human body as a collection of loosely con-sure, transpiration and associated hormonal activity. The nected points with identical mass. While such a model ispredictions given by the model show many similarities with clearly a simplification, it reflects well the non-rigid naturethe Lombard reflex.the vocal chords increases the pitch and of a body and facilitates fast computations. Assuming Qenhances the higher harmonics, which leads to an increase points attached to various body-parts, the average kineticin spectral tilt. These properties, pitch (fundamental fre- energy of an articulating body is given by the average ki-quency (f0 )) and spectral tilt (a measure of the slope of the netic energy of points,average energy distribution, calculated as the the energy ofthe harmonics above 500 Hz divided by the energy of the Q 1 1harmonics below 500 Hz) are calculated from the harmonic E= mi |vi − ve |2 (1) Q 2complexes. The audio detector uses these two properties as i=1input for a decision tree. An example of normal and aggres- where vi , mi denote, respectively, velocity vector and masssive speech can be seen in Fig. 1. of the ith point, and ve denotes the velocity vector of the ellipse representing the person. By discounting overall body2.2 Video Unit motion we capture only the articulation energy.Analysis of the video stream aims primarily at computing Due to the assumption of uniform mass distribution be-visual cues characteristic for physical aggression among hu- tween points, the total body mass becomes a scale factor.mans. Physical aggression is usually characterized by fast By omitting scale factors, we obtain a pseudo-kinetic en-articulation of body parts (i.e. arms, legs). Therefore, a ¯ 1 Q ergy estimate in the form E = Q i=1 |vi − ve |2 . Such 201
  3. 3. Given the noisy and ambiguous domain we resort to a probabilistic formulation. The fusion unit employs a proba- bilistic time-series model (a Dynamic Bayesian Network, DBN [5]), where aggression level can be estimated in a principled way by solving appropriate inference problem. 2.3.1 Basic model We denote the discrete-time index as k = 1, 2, . . ., and setFigure 2: (Left) Optical-flow features for detecting trains in the gap between discrete-time steps (clock ticks) to 50 ms.motion. (Right) Representing people: ellipses (for tracking) a At the kth step, yk ∈ {0, 1} denotes the output of audioand points (for articulation features). v aggression detector, and yj,k denotes the pseudo-kinetic en- ergy of the jth, j = 1, . . . , J, person. Our system can com-features are assumed to measure articulation and will be our prise several train detectors monitoring non-overlapping railprimary visual cues for aggression detection. sections. The binary output of the mth, m = 1, . . . , 4 = M , T Computation of energy features requires selecting image train detector will be denoted as ym,k ∈ {0, 1}. (We testedpoints that represent a person. Ideally, the points would a configuration with M = 4.)cover the limbs and the head since these parts are mostly ar- Our aim is to detect ”ambient” scene aggression, withoutticulated. Further, to estimate velocities, the selected points deciding precisely which persons are aggressive. Thereforemust be easy to track. Accordingly, we select Q = 100 we reason on the basis of a cumulative articulation mea- v vpoints within an extended bounding box of a person by find- surement yk = j yj,k over all persons. Additionally, theing pixels with the most local contrast [10]. Such points are cumulative is quite robust to (near-)occlusions when articu-easy to track and usually align well with edges in an image lation of one person could be wrongly attributed to another.(which in turn often coincide with limbs as in Fig. 2, right). In order to reason about aggression level, we use a 5-stepFor point tracking we use the KLT algorithm [10] (freely discrete scale 0, 1 : 0.0 (no activity), 0.2 (normal activity),available implementation from the OpenCV library). 0.4 (attention required), 0.6 (minor disturbance), 0.8 (major disturbance), and 1.0 (critical aggression).2.2.2 Train detection Importantly, the aggression level obeys specific corre- lations over time and should be represented as a processAn additional objective of the video unit is detecting trains (rather than an instantaneous quantity). We will denote ag-in motion. Trains moving in and out of a station produce gression level at step k as ak and define a stochastic processauditory noise that often leads to spurious audio-aggression {ak } with dynamics given by a 1st order model:detections. Therefore recognizing trains in video opens apossibility for later suppressing of such detections. p(ak+1 = i|ak = j) = CPTa (i, j), (2) A train usually appears as a large, rigid body and movesalong a constrained trajectory. For a given view and rail where CPTa (i, j), denotes a conditional probability table.section we define a mask that indicates the image regions In a sense, the first order model is a simplification as it cap-where a train typically appears. In this region we track tures only short-term dependencies.frame-to-frame motion of N = 100 image features with v a The measured visual (yk ) and auditory (yk ) features areKLT [10] tracker (Fig. 2, left). The features’ motion vec- treated as samples from an observation distribution (model)tors are classified as train/non-train by a pre-trained nearest- that depends on the aggression level ak . Since (later on) weneighbor classifier. A train in motion is detected when more will incorporate information about passing trains, we intro-then 50% of the features are classified positively. Due to the duce a latent train-noise indicator variable nk ∈ {0, 1} andconstrained movement of trains, our detector turns out quite assume that the observation modelrobust to occasional occlusions of the train area by people. v a p(yk , yk |ak , nk )2.3 Fusion Unit depends also on the train-noise indicator. The model takesThe fusion unit produces an aggregate aggression indication the form of a conditional probability table CPTo , where thegiven the features/events produced independently by the au- cumulative articulation feature is discretized.dio and video subsystems. A fundamental difficulty with fu-sion arises from inevitable ambiguities in human behavior 2.3.2 Train modelswhich make it difficult to separate normal from aggressiveactivities (even for a human observer). Additional problems The fusion DBN comprises several subnetworks — train Tfollow from various noise artifacts in the sensory data. models which couple train detections ym,k with the latent 202
  4. 4. train-noise indicator nk . Additionally, each train model en- T ym,k T ym,k+1 Mcodes prior information about duration of a train pass. For the mth rail section, we introduce a latent indicator ··· τm,k τm,k+1 ···im,k ∈ {0, 1} of a train passing at step k. We assume that Tthe train detections ym,k , the train-pass indicators im,k , andthe train noise nk obey a probabilistic relation ··· im,k im,k+1 ··· T T p(ym,k |im,k ) = CPTt (ym,k , im,k ) (3) p(nk |i1:M,k ) = CPTn (nk , i1:M,k ). (4) nk nk+1For each rail, the model (3) encodes inaccuracies of detector ··· ak ak+1 ···(mis-detections, false alarms). The model (4) represents thefact that passing trains usually induce noise, but also that yk yk+1sometimes noise is present without a passing train. Since a typical pass takes 5 − 10 seconds (100 − 200 Figure 3: Dynamic Bayesian Network representing thesteps) the pass indicator variable exhibits strong temporal probabilistic fusion model. The rectangular plate indicatescorrelations. We represent such correlations with a time- M = 4 replications of the train sub-network.series model based on a gamma distribution. A gamma pdfγ(τm , αm , βm ) is a convenient choice for modeling dura-tion τm of an event (αm , βm are parameters). To apply this Given the graphical structure of the model (Fig. 3),model in a time-series formulation, we replace the total du- the required distribution can be efficiently computed us-ration τm with a partial duration τm,k that indicates how ing a recursive, forward filtering procedure [5]. We im-long a train is already passing a scene at step k. plemented an approximate variant of the filtering proce- By considering a joint process {im,k , τm,k } temporal dure, known as the Boyen-Koller algorithm [1]. At a givencorrelations can be enforced by the following model step k, the algorithm maintains only marginal distributions v a t p(hk |y1:k , y1:k , y1:m,1:k ), where hk is any of the latent vari- p(im,k+1 = 1|τm,k , im,k = 0) = ηm ables. When new detector data arrive, the current-step p(im,k+1 = 1|τm,k , im,k = 1) = p(τm > τm,k ) = marginals are updated to represent the next-step marginals. +∞ An important modeling aspect are temporal develop- = γ(τm , αm , βm )dτm = 1 − F (τm,k , αm , βm ), ments of processes in the scene. Unlike the binary train- τm,k pass events, aggression level usually undergoes more sub- tle evolutions as the tension and anger among people buildwhere F () is a gamma cumulative density function. Param- up. Since the assumed (1st-order) model might not captureeter ηm denotes a probability of starting a new train pass. well long-term effects and a stronger model would be ratherAt the kth step, the probability of continuing a pass is func- complicated we enforce temporal correlations with a simpletion of the current duration of the pass. A configuration low-pass filter. The articulation measurements (before in-(im,k+1 = 1, τm,k , im,k = 1) implies that a pass does not ference) and the expected aggression level (after inference)finish yet and the total pass duration will be larger than τm,k , are low-pass filtered using a 10 s running-average filter.hence the integration. Further, the partial duration variable The parameters of our model: probability tables: CPTa ,obeys a deterministic regime CPTo , CPTn , CPTt and the parameters αm , βm of the 0 iff im,k+1 = 0 gamma pdf’s) are estimated by maximum-likelihood learn- τm,k+1 = , ing. The learning process relies on detector measurements τm,k+1 = τm,k + ǫ otherwise from training audio-video clips and ground-truth annota-where ǫ = 50 ms is the period between successive steps. tions. The annotations are particularly important for learn- ing the observation model CPTo . An increased probability2.3.3 Inference and Learning of a false auditory aggression in the presence of train noise, will suppress the contribution of audio data to the aggres-In a probabilistic framework, reasoning about aggression sion level when the video subsystem reports a passing train.corresponds to solving probabilistic inference problems. Inan online mode, the key quantity of interest is the posteriordistribution on aggression level given data collected at up 3 Experiments v a tto the current step, p(ak |y1:k , y1:k , y1:m,1:k ). From this dis-tribution we calculate the expected aggression value, which We evaluate the aggression detection system using a set ofwill be the basic output of the fusion system. 13 audio-video clips (scenarios) recorded at a train station. 203
  5. 5. audio−aggression detections 1 binary value 0.5 0 0 20 40 60 80 100 120 140 160 visual agrression features raw pseudo−energy 1000 low−pass 500 0 0 20 40 60 80 100 120 140 160 posterior expected value of aggression levelFigure 4: A screen-shot of the CASSANDRA prototype 1 aggression scalesystem. In the lower part of the right window, various in- 0.5 ground−truthtermediate quantities are shown: the probabilities for trains low−passon various tracks (currently zero), the output of the video- raw 0based articulation energy (”People”, currently mid-way), 0 20 40 60 80 100 120 140 160the output of the audio detector (currently at maximum). time [s]The slider covering the right side shows the overall esti-mated aggression level (currently ”major disturbance”).The clips (each 100s − 150s) feature 2-4 professional ac-tors who engage in a variety of activities ranging from nor-mal (walking) through slightly excited (shouting, running,hugging), moderate aggressive (pushing, hitting a vendingmachine) to critically aggressive (football-supporters clash-ing). The recording took place at a platform of an actualtrain station (between two rail tracks, partially outdoor) andtherefore incorporates realistic artifacts, like noise and vi-brations from trains, variable illumination, wind, etc. Scenarios have been manually annotated with a ground-truth aggression level by two independent observers using Figure 5: Aggression build-up scenario. (Top graph)the scale mentioned in Sect. 2.3.1. Aggression toward ob- Audio-aggression detections. (Middle graph) Visual ar-jects was rated approx. 25% lower than aggression toward ticulation measurements. (Bottom graph) Estimated andhumans, i.e. the former did not exceed a level of 0.8. ground-truth aggression level. The gray lines show un- Fig. 5 details the results of the CASSANDRA system certainty intervals (2× std. deviation) around raw (beforeon a scenario involving gradual aggression build-up. Here, filtering) expected level. (Images) Several frames (times-two pairs of competing supporters first start arguing, then tamps: 45 s, 77 s, 91 s, 95 s). Notice correspondence withget in a fight. The bottom panel shows some illustrative the articulation measurements.frames, with articulation features highlighted. We see fromFig. 5 that the raw (before low-pass filtering) articulationmeasurements are rather noisy, however the low-pass filter-ing reveals strong correlation with ground-truth aggressionlevel. The effect of low-pass filtering of the estimated ag-gression level is shown bottom plot of Fig. 5. A screen-shotof the CASSANDRA system in action is shown in Fig. 4. We considered two quantitative criteria to evaluate oursystem. The first is the deviation of the CASSANDRA es-timated aggression level from the ground-truth annotation. Figure 6: Selected frames from a scenario involving aggres-Here we obtained a deviation of mean 0.17 with standard sion toward a machine. 204
  6. 6. ground-truth detected events Cumulative system performace id scenario content positive true-pos. false-pos. 80 1 normal: walking, greeting 0 0 0 fusion 2 normal: walking, greeting 0 0 1 3 excited: lively argument 0 0 0 70 4 excited: lively argument 1 1 0 detection % video 5 aggression toward a vend. machine 1 0 1 6 aggression toward a vend. machine 1 0 0 60 7 happy football supporters 1 1 0 8 happy football supporters 0 0 1 9 supporters harassing a passenger 1 1 0 50 10 supporters harassing a passenger 1 1 0 11 two people fight, third intervenes 1 1 0 audio 12 four people fighting 1 1 0 40 13 four people fighting 1 1 1 0 5 10 15 20 false−positives / hour Table 1: Aggression detection results by scenario. The table indicates number of events (positive=aggressive). Figure 6Figure 7: Cumulative detection results by sensor modality. shows example frames from the 5th scenario.deviation 0.1. The second performance criterion considers realtime.aggression detection as a two-class classification problem of The present system is able to distinguish well betweendistinguishing between ”normal” and ”aggressive” events clear cases of aggression and normal behavior. In the future,(by thresholding aggression level at 0.5). Matching ground- we plan to focus increasingly on the ”borderline” cases. Fortruth with estimated events allows us to compute detection this, we expect to use more elaborate auditory cues (laugh-rate (%) and false-alarm rate (per hour). When matching ter vs scream), more detailed visual cues (indications ofevents we allowed a time deviation of 10 s. The cumula- body-contact, partial body-pose estimation), and stronger use of context information.tive results of leave-one-out tests on 13 scenarios (12 fortraining, 1 for testing) are given in Fig. 7. Comparing thetest results for three modalities (audio, video, fusion of au- Referencesdio+video), we notice that the auditory and visual features [1] X. Boyen and D. Koller. Tractable inference for complexindeed are complimentary; with fusion the overall detection stochastic processes. In UAI, 1998.rate increased without introducing additional false alarms. [2] A. Datta et al. Person-on-person violence detection in videoIt is important to note that our data set is heavily biased data. ICPR, 01:10433, 2002.toward occurrences of aggression, i.e. which put the sys- [3] H. Duifhuis et al. Cochlear Mechanisms: Structure, Func-tem to a difficult test. We expect CASSANDRA to pro- tion and Models, pages 395–404. 1985.duce much less false alarms in a typical surveillance setting, [4] D.M. Gavrila. The Visual analysis of human movement: Awhere most of the time nothing happens. survey. CVIU, 73(1):82–98, 1999. Table 1 gives an overview of the detection results on the [5] Z. Ghahramani. An introduction to hidden Markov modelsscenarios. We notice that the system performed well on and Bayesian networks. IJPRAI, 15(1):9–42, 2001.the clearly normal cases (scenarios 1-3) or aggressive cases(scenarios 9-13), while borderline scenarios were more dif- [6] J.-C. Junqua. The lombard reflex and its role on human lis- teners and automatic speech recognizers. The Journal of theficult to classify. The borderline behavior (e.g. scenarios Acoustical Society of America, 93(1):510–524, 1 1993.7-8) turns out also difficult to classify for human observersgiven the inconsistent ground-truth annotation in Tab. 1. [7] R. McAulay and T. Quatieri. Speech analysis / synthesis based on a sinusoidal representation. Acoustics, Speech, and The CASSANDRA system runs on two PCs (one with Signal Processing, 34(4):744–754, 1986.the video and fusion units, the other with the audio unit). [8] A. Pentland. Smart rooms. Scientific American, 274(4):54–The overall processing rate is approx. 5 Hz with 756×560× 62, 1996.20 Hz input video stream and 44 kHz input audio stream. [9] K. Scherer. Vocal affect expression: A review and a model for future research. Psychol. Bulletin, 99(2):143–165, 1986.4 Conclusions and Future Work [10] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994.We demonstrated a prototype system that uses a Dynamical [11] D. Wang and G. Brown. Computational Auditory SceneBayesian Network to fuse auditory and visual information Analysis. John Wiley and Sons, 2006.for detecting aggressive behavior. On the auditory side, the [12] Z. Zivkovic. Improved adaptive gausian mixture model forsystem relies on scream-like cues, and on the video side, background subtraction. In ICPR, 2004.the system uses motion features related to articulation. We [13] Z. Zivkovic and B. Krose. An EM-like algorithm for color-obtained a promising aggression detection performance in a histogram-based object tracking. In CVPR, 2004.complex, real-world train station setting, operating in near- 205