Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
A MKL Based Fusion Framework for Real-Time 
Multi-View Action Recognition 
Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso and 
Paolo Remagnino 
Digital Imaging Research Centre 
Kingston University, London, UK 
December 3rd, 2014 
1 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
1 Introduction 
2 Framework Overview 
3 Experimental Conditions 
4 Results and Analysis 
5 Conclusions and Future Work 
2 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Background and Motivations 
Real-time multi-view action recognition: 
Gain an increasing interest in video surveillance, human 
computer interaction, and multimedia retrieval etc. 
Provide complementary
eld of views (FOVs) of a monitored 
scene via multiple cameras 
Lead to a more robust decision making based on multiple 
heterogeneous video streams 
Real-time capacity enables continuous long-term monitoring 
If possible multiple cameras should be deployed to monitor 
human behaviour, where data fusion techniques can be applied. 
3 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Illustration of the Monitored Scenario 
C4 
C1 
C2 
C3 
4 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Motion-Based Person Detector 
We use a state-of-the-art motion-based tracker [6]: 
Each pixel modelled as a mixture of Gaussians in RGB space 
Background model to
nd foreground pixels in a new frame 
Found foreground pixels grouped to form large regions 
associated the person of interest 
Kalman
lters used to track foreground detections 
Person detections generated for every frame 
5 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Feature Representation of Videos 
Use of STIP and improved dense trajectories (IDT) [7] as 
local descriptor to extract visual features from a video 
Person detections and frame spans to de
ne a XYT cuboid 
associated with an action performed by the monitored person 
Apply bag of words (BOWs) to compute the feature vector of 
a cuboid, where K-Means clustering used for the generation 
of a codebook 
6 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Disciminative Models for Classi
cation 
Let xki 
2 RD, where i 2 f1; 2; : : : ;Ng is the index of a feature 
vector corresponding to a XYT cuboid and k 2 f1; 2; : : : ;Kg is the 
index of a camera view. We learn a SVM classi
er as 
f (x) = 
XN 
i=1 
i yik(xi ; x) + b (1) 
We then compute a classi
cation score via a sigmoid function as 
p(y = 1jx) = 
1 
1 + exp(f (x)) 
(2) 
7 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Simple Fusion Strategies 
Ki1i 
Concantenation of Features: concatenate the feature 
vectors of multiple views into one single feature vector such 
that ~xi = [x; : : : ; x] 
Sum of Classi
cation Scores: compute a classi
cation score 
for each camera view p(y = 1jx) as in 2, and then average 
them as 1K 
PK 
k=1 p(y = 1jxk ) 
Product of Classi
cation Scores: apply the product rule to 
tQhe classi
cation scores of all the camera views as K 
k=1 p(y = 1jxk ) 
8 / 22
Outline 
Introduction 
Framework Overview 
Experimental Conditions 
Results and Analysis 
Conclusions and Future Work 
Multiple Kernel Learning 
Combine of multiple kernels corresponding to dierent data 
sources (e.g. camera views) via a convex function such as 
K(xi ; xj ) = 
XK 
k=1
kkk (xi ; xj ) (3) 
where
k  0 and 
PK 
k=1

A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition

  • 1.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work A MKL Based Fusion Framework for Real-Time Multi-View Action Recognition Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso and Paolo Remagnino Digital Imaging Research Centre Kingston University, London, UK December 3rd, 2014 1 / 22
  • 2.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work 1 Introduction 2 Framework Overview 3 Experimental Conditions 4 Results and Analysis 5 Conclusions and Future Work 2 / 22
  • 3.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Background and Motivations Real-time multi-view action recognition: Gain an increasing interest in video surveillance, human computer interaction, and multimedia retrieval etc. Provide complementary
  • 4.
    eld of views(FOVs) of a monitored scene via multiple cameras Lead to a more robust decision making based on multiple heterogeneous video streams Real-time capacity enables continuous long-term monitoring If possible multiple cameras should be deployed to monitor human behaviour, where data fusion techniques can be applied. 3 / 22
  • 5.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Illustration of the Monitored Scenario C4 C1 C2 C3 4 / 22
  • 6.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Motion-Based Person Detector We use a state-of-the-art motion-based tracker [6]: Each pixel modelled as a mixture of Gaussians in RGB space Background model to
  • 7.
    nd foreground pixelsin a new frame Found foreground pixels grouped to form large regions associated the person of interest Kalman
  • 8.
    lters used totrack foreground detections Person detections generated for every frame 5 / 22
  • 9.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Feature Representation of Videos Use of STIP and improved dense trajectories (IDT) [7] as local descriptor to extract visual features from a video Person detections and frame spans to de
  • 10.
    ne a XYTcuboid associated with an action performed by the monitored person Apply bag of words (BOWs) to compute the feature vector of a cuboid, where K-Means clustering used for the generation of a codebook 6 / 22
  • 11.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Disciminative Models for Classi
  • 12.
    cation Let xki 2 RD, where i 2 f1; 2; : : : ;Ng is the index of a feature vector corresponding to a XYT cuboid and k 2 f1; 2; : : : ;Kg is the index of a camera view. We learn a SVM classi
  • 13.
    er as f(x) = XN i=1 i yik(xi ; x) + b (1) We then compute a classi
  • 14.
    cation score viaa sigmoid function as p(y = 1jx) = 1 1 + exp(f (x)) (2) 7 / 22
  • 15.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Simple Fusion Strategies Ki1i Concantenation of Features: concatenate the feature vectors of multiple views into one single feature vector such that ~xi = [x; : : : ; x] Sum of Classi
  • 16.
  • 17.
    cation score foreach camera view p(y = 1jx) as in 2, and then average them as 1K PK k=1 p(y = 1jxk ) Product of Classi
  • 18.
    cation Scores: applythe product rule to tQhe classi
  • 19.
    cation scores ofall the camera views as K k=1 p(y = 1jxk ) 8 / 22
  • 20.
    Outline Introduction FrameworkOverview Experimental Conditions Results and Analysis Conclusions and Future Work Multiple Kernel Learning Combine of multiple kernels corresponding to dierent data sources (e.g. camera views) via a convex function such as K(xi ; xj ) = XK k=1
  • 21.
    kkk (xi ;xj ) (3) where
  • 22.
    k 0and PK k=1