The document proposes a framework for recognizing actions across cameras by exploring correlation subspaces. It first learns a joint subspace using Canonical Correlation Analysis (CCA) on unlabeled multi-view data. It then trains a Support Vector Machine (SVM) in this subspace with a novel correlation regularizer that favors dimensions with higher correlation between views, improving generalization to target views. Experiments on the IXMAS dataset show the method outperforms baselines, with the regularizer successfully suppressing weights for less correlated dimensions.
Recently, WaveNet, which predicts the probability distribution of speech sample auto-regressively, provides a new paradigm in speech synthesis tasks.
Since the usage of WaveNet for speech synthesis varies by conditional vectors, it is very important to effectively design a baseline system structure.
In this talk, I would like to first introduce various types of WaveNet vocoders such as conventional speech-domain approach and recently proposed source-filter theory-based approach.
Then, I will explain a linear prediction (LP)-based WaveNet speech synthesis, i.e., LP-WaveNet, which overcomes the limitations of source-filter theory-based WaveNet vocoders caused by the mismatch between speech excitation signal and vocal tract filter.
While presenting experimental setups and results, I also would like to share some know-hows to successfully training the network.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli
I presented an interesting paper during the Vision and Multimedia Reading Group about DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition (pdf).
It is a complete evaluation about features extracted from the activation of a deep convolutional network trained with a large scale dataset.
This a work of Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell from Berkeley University
Recently, WaveNet, which predicts the probability distribution of speech sample auto-regressively, provides a new paradigm in speech synthesis tasks.
Since the usage of WaveNet for speech synthesis varies by conditional vectors, it is very important to effectively design a baseline system structure.
In this talk, I would like to first introduce various types of WaveNet vocoders such as conventional speech-domain approach and recently proposed source-filter theory-based approach.
Then, I will explain a linear prediction (LP)-based WaveNet speech synthesis, i.e., LP-WaveNet, which overcomes the limitations of source-filter theory-based WaveNet vocoders caused by the mismatch between speech excitation signal and vocal tract filter.
While presenting experimental setups and results, I also would like to share some know-hows to successfully training the network.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli
I presented an interesting paper during the Vision and Multimedia Reading Group about DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition (pdf).
It is a complete evaluation about features extracted from the activation of a deep convolutional network trained with a large scale dataset.
This a work of Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell from Berkeley University
Intro to selective search for object proposals, rcnn family and retinanet state of the art model deep dives for object detection along with MAP concept for evaluating model and how does anchor boxes make the model learn where to draw bounding boxes
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Corey Clark, Ph.D.
We are developing a parallel process particle swarm optimization (PSO) on an HTML5 based dynamically distributed system and assess its performance as applied to the multicommodity fixed charge (MCFC) network flow problem. The MCFC problem is motivated by a real-world cash management problem faced by large national banks and is NP-hard. We compare the performance of a serial and distributed parallel process PSO implementation and empirically evaluate the optimality gap for multiple instances.
We are currently in the process of converting JaHOVA OS into a high performance multithreaded game and simulation engine (GEn3CIS). One feature of GEn3CIS is its ability to distribute processing across any internet enabled device with a modern browser. Essentially this allows a user to take their phone, tablet, PC/Mac, etc and utilize there combined computing power to solve any complex simulation, learning, and/or optimization problem.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
The first part of this dissertation focuses on an analysis of the spatial context in semantic image segmentation. First, we review how spatial context has been tackled in the literature by local features and spatial aggregation techniques. From a discussion about whether the context is beneficial or not for object recognition, we extend a Figure-Border-Ground segmentation for local feature aggregation with ground truth annotations to a more realistic scenario where object proposals techniques are used instead. Whereas the Figure and Ground regions represent the object and the surround respectively, the Border is a region around the object contour, which is found to be the region with the richest contextual information for object recognition. Furthermore, we propose a new contour-based spatial aggregation technique of the local features within the object region by a division of the region into four subregions. Both contributions have been tested on a semantic segmentation benchmark with a combination of free and non-free context local features that allows the models automatically learn whether the context is beneficial or not for each semantic category.
The second part of this dissertation addresses the semantic segmentation for a set of closely-related images from an uncalibrated multiview scenario. State-of-the-art semantic segmentation algorithms fail on correctly segmenting the objects from some viewpoints when the techniques are independently applied to each viewpoint image. The lack of large annotations available for multiview segmentation do not allow to obtain a proper model that is robust to viewpoint changes. In this second part, we exploit the spatial correlation that exists between the dierent viewpoints images to obtain a more robust semantic segmentation. First, we review the state-of-the-art co-clustering, co-segmentation and video segmentation techniques that aim to segment the set of images in a generic way, i.e. without considering semantics. Then, a new architecture that considers motion information and provides a multiresolution segmentation is proposed for the co-clustering framework and outperforms state-of-the-art techniques for generic multiview segmentation. Finally, the proposed multiview segmentation is combined with the semantic segmentation results giving a method for automatic resolution selection and a coherent semantic multiview segmentation.
Intro to selective search for object proposals, rcnn family and retinanet state of the art model deep dives for object detection along with MAP concept for evaluating model and how does anchor boxes make the model learn where to draw bounding boxes
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Corey Clark, Ph.D.
We are developing a parallel process particle swarm optimization (PSO) on an HTML5 based dynamically distributed system and assess its performance as applied to the multicommodity fixed charge (MCFC) network flow problem. The MCFC problem is motivated by a real-world cash management problem faced by large national banks and is NP-hard. We compare the performance of a serial and distributed parallel process PSO implementation and empirically evaluate the optimality gap for multiple instances.
We are currently in the process of converting JaHOVA OS into a high performance multithreaded game and simulation engine (GEn3CIS). One feature of GEn3CIS is its ability to distribute processing across any internet enabled device with a modern browser. Essentially this allows a user to take their phone, tablet, PC/Mac, etc and utilize there combined computing power to solve any complex simulation, learning, and/or optimization problem.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
The first part of this dissertation focuses on an analysis of the spatial context in semantic image segmentation. First, we review how spatial context has been tackled in the literature by local features and spatial aggregation techniques. From a discussion about whether the context is beneficial or not for object recognition, we extend a Figure-Border-Ground segmentation for local feature aggregation with ground truth annotations to a more realistic scenario where object proposals techniques are used instead. Whereas the Figure and Ground regions represent the object and the surround respectively, the Border is a region around the object contour, which is found to be the region with the richest contextual information for object recognition. Furthermore, we propose a new contour-based spatial aggregation technique of the local features within the object region by a division of the region into four subregions. Both contributions have been tested on a semantic segmentation benchmark with a combination of free and non-free context local features that allows the models automatically learn whether the context is beneficial or not for each semantic category.
The second part of this dissertation addresses the semantic segmentation for a set of closely-related images from an uncalibrated multiview scenario. State-of-the-art semantic segmentation algorithms fail on correctly segmenting the objects from some viewpoints when the techniques are independently applied to each viewpoint image. The lack of large annotations available for multiview segmentation do not allow to obtain a proper model that is robust to viewpoint changes. In this second part, we exploit the spatial correlation that exists between the dierent viewpoints images to obtain a more robust semantic segmentation. First, we review the state-of-the-art co-clustering, co-segmentation and video segmentation techniques that aim to segment the set of images in a generic way, i.e. without considering semantics. Then, a new architecture that considers motion information and provides a multiresolution segmentation is proposed for the co-clustering framework and outperforms state-of-the-art techniques for generic multiview segmentation. Finally, the proposed multiview segmentation is combined with the semantic segmentation results giving a method for automatic resolution selection and a coherent semantic multiview segmentation.
The slides for the techniques used in the Temporal Segment Network (TSN), including the basic ideas, recall of BN-Inception, optical flow and tricks in application. Used in group paper reading in University of Sydney.
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Saimunur Rahman
This presentation was prepared for ViPr Reading group at Multimedia University, Cyberjaya. The goal of this presentation was to make aware the lab members about the recent advancements in action recognition.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
Graph Representation Learning with Deep Embedding Approach:
Graphs are commonly used data structure for representing the real-world relationships, e.g., molecular structure, knowledge graphs, social and communication networks. The effective encoding of graphical information is essential to the success of such applications. In this talk I’ll first describe a general deep learning framework, namely structure2vec, for end to end graph feature representation learning. Then I’ll present the direct application of this model on graph problems on different scales, including community detection and molecule graph classification/regression. We then extend the embedding idea to temporal evolving user-product interaction graph for recommendation. Finally I’ll present our latest work on leveraging the reinforcement learning technique for graph combinatorial optimization, including vertex cover problem for social influence maximization and traveling salesman problem for scheduling management.
Multimodal Residual Networks for Visual QAJin-Hwa Kim
Deep neural networks continue to advance the state-of-the-art of image recognition tasks with various methods. However, applications of these methods to multimodality remain limited. We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning. Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information. The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models in recent studies. Various alternative models introduced by multimodality are explored based on our study. We achieve the state-of-the-art results on the Visual QA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, we introduce a novel method to visualize the attention effect of the joint representations for each learning block using back-propagation algorithm, even though the visual features are collapsed without spatial information.
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
The project to participate in the Kaggle YouTube-8M video understanding competition. Four algorithms that can be run on a single machine are implemented, namely, multi-label k-nearest neighbor, multi-label radial basis function network (one-vs-rest), and multi-label logistic regression and on-vs-rest multi-layer neural network.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineSoma Boubou
Object recognition from RGB-D sensors has recently emerged as a renowned and challenging research topic. The current systems often require large amounts of time to train the models and to classify new data. We proposed an effective and fast object recognition approach from 3D data acquired from depth sensors such as Structure or Kinect sensors.
Our contribution in this work} is to present a novel fast and effective approach for real-time object recognition from 3D depth data:
- First, we extract simple but effective frame-level features, which we name as differential frames, from the raw depth data.
- Second, we build a recognition system based on Extreme Learning Machine classifier with a Local Receptive Field (ELM-LRF).
Optic Flow
Brightness Constancy Constraints
Aperture Problem
Regularization and Smoothness Constraints
Lucas-Kanade algorithm
Focus of Expansion (FOE)
Discrete Optimization for Optical Flow
Large Displacement Optical Flow: Descriptor Matching
DeepFlow: Large displ. optical flow with deep matching
EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow
Optical Flow with Piecewise Parametric Model
Flow Fields: Dense Correspondence Fields for Accurate Large Displacement Optical Flow Estimation
Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids
FlowNet: Learning Optical Flow with Convol. Networks
Deep Discrete Flow
Optical Flow Estimation using a Spatial Pyramid Network
A Large Dataset to Train ConvNets for Disparity, Optical Flow, and Scene Flow Estimation
DeMoN: Depth and Motion Network for Learning Monocular Stereo
Unsupervised Learning of Depth and Ego-Motion from Video
Appendix A: A Database and Evaluation Methodology for Optical Flow
Appendix B: Learning and optimization
1. Recognizing Actions Across Cameras
by Exploring the Correlation Subspace
4th International Workshop on Video Event Categorization,
Tagging and Retrieval (VECTaR), in conjunction with ECCV 2012
Chun-Hao Huang, Yi-Ren Yeh, and Yu-Chiang Frank Wang
Research Center for IT Innovation, Academia Sinica, Taiwan
Oct 12th, 2012
2. Outline
• Introduction
• Our Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with A Novel Correlation Regularizer
• Experiments
• Conclusion
2
3. Outline
• Introduction
• Our Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with A Novel Correlation Regularizer
• Experiments
• Conclusion
3
4. Representing an Action
4
• Actions are represented as high-dim vectors.
• Bag of spatio-temporal visual word model.
• State-of-the-art classifiers (e.g., SVM) are applied to
address the recognition task.
[Laptev, IJCV, 2005]
[Dollár et al., ICCV WS on VS-PETS, 2005]
• Spatio-temporal interest points
5. Cross-Camera Action Recognition
5
Source view Target view
• Models learned at source views typically do
not generalize well at target views.
check watch
punch
kick
1
s
v
2
s
v
3
s
v𝒳 𝑠
∈ ℝ 𝑑 𝑠
2
t
v
3
t
v
1
t
v
𝒳 𝑡
∈ ℝ 𝑑 𝑡
Colored: labeled data
: test data
6. Colored: labeled data
: test data
Gray: unlabeled data
Source view Target view
• An unsupervised strategy:
Only unlabeled data available at target views.
They are exploited to learn the relationship between
data at source and target views.
Cross-Camera Action Recognition (cont’d)
6
One branch of transfer learning
7. Approaches based on Transfer Learning
• To learn a common feature representation (e.g., a joint subspace)
for both source and target view data.
• Training/testing can be performed in terms of such representations.
• How to exploit unlabeled data from both views for determining this
joint subspace is the key issue.
• Previous approaches:
1. Splits-based feature transfer [Farhadi and Tabrizi, ECCV ‘08 ]
Requires frame-wise correspondence
2. Bag of bilingual words model (BoBW) [Liu et al., CVPR ‘11 ]
Considers each dimension of the derived representation to be equally important.
7
8. Outline
• Introduction
• Our Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with A Novel Correlation Regularizer
• Experiments
• Conclusion
8
9. 2. Project the source label data onto it
Source view Target view
Overview of Our Proposed Method
9
Correlation subspace 𝒳c ∈ ℝd
1
s
v
2
s
v
3
s
v𝒳s ℝ 𝑑 𝑠
2
t
v
3
t
v
1
t
v
𝒳t ℝ 𝑑 𝑡
,
2
s t
v
,
1
s t
v
4. Prediction
3. Learn a new SVM
with constraints on
domain transfer ability
1. Learn a joint subspace via canonical
correlation analysis (CCA)
10. Requirements of CCA
10
Source view Target view
: unlabeled data pairs
(observed at both views)
unlabeled actions observed by both cameras
Colored: labeled data
: test data
Gray: unlabeled data
11. Learning the Correlation Subspace via CCA
• CCA aims at maximizing the correlation between two variable sets.
11
• Given two sets of n centered unlabeled observations :
• CCA learns two projection vectors us and ut, maximizing the
correlation coefficient ρ between projected data, i.e.,
where are
covariance matrices.
,
maxs t
s ts s t t
st
s s s s t t t t s s t t
ss tt
u u
u Σ uu X X u
u X X u u X X u u Σ u u Σ u
•• •
• • • • • •
, ,t t s t s s
tt st ss Σ X X Σ X X Σ X X• • •
1 1, ... , and , ... ,s td n d ns s s t t t
n n
X x x X x xR R
12. CCA Subspace as Common Feature Representation
12
Source view Target view
correlation subspace 𝒳c ℝd
1
s
v
2
s
v
3
s
v𝒳s ℝ 𝑑 𝑠
2
t
v
3
t
v
1
t
v
𝒳t ℝ 𝑑 𝑡
s s
P x• t t
P x•
,
1
s t
v(ρ1,u1
𝑠
, u1
𝑡
)
,
2
s t
v (ρ2,u2
𝑠
, u2
𝑡
)
u1
𝑠
u1
𝑡
⋯ u 𝑑
𝑠
⋯ u 𝑑
𝑡
]
]
[
[
P 𝑠 =
P 𝑡 =
∈ ℝ 𝑑 𝑠×𝑑
∈ ℝ 𝑑 𝑡×𝑑
13. Outline
• Introduction
• The Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with A Novel Correlation Regularizer
• Experiments
• Conclusion
13
14. Domain Transfer Ability of CCA
• Learn SVMs in the derived CCA subspace…Problem solved?
- Yes and No!
• Domain Transfer Ability:
- In CCA subspace, each dimension Vi
s,t is associated with a different ρi
- How well can the classifiers learned (in this subspace) from the
projected source view data generalize to those from the target view?
• See the example below…
14
15. Outline
• Introduction
• The Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with a Novel Correlation Regularizer
• Experiments
• Conclusion
15
16. • Proposed SVM formulation:
• The introduced correlation regularizer r⊤
Abs(w) :
and
• Larger/Smaller ρi
→ Stronger/smaller correlation between source & target view data
→ SVM model wi is more/less reliable at that dimension in the CCA space.
• Our regularizer favors SVM solution to be dominant in reliable CCA dimensions
(i.e., larger correlation coefficents ρi imply larger |wi| values).
• Classification of (projected) target view test data:
16
2
2
1
1 1
min Abs
2 2
s.t. , 1, 0, ,
N
i
i
s s s s
i i i i i i l
C
y b y D
w
w r w
w P x x
•
•
( ) sgn , t t
f b x w P x•
1 2Abs , , ... , dw w w w 1 2, , ... , d r
Our Proposed SVM with Domain Transfer Ability
17. An Approximation for the Proposed SVM
• It is not straightforward to solve the previous formulation with Abs(w).
• An approximated solution can be derived by relaxing Abs(w):
where ⨀ indicates the element-wise multiplication.
• We can further simplify the approximated problem as:
• We apply SSVM* to solve the above optimization problem.
17
2 2
1 1
1
min 1
2
s.t. , 1, 0, ,
d N
i i i
i i
s s s s
i i i i i i l
w C
y b y D
w
w P x x•
2
2
1
1 1
min
2 2
s.t. , 1, 0, ,
N
i
i
s s s s
i i i i i i l
C
y b y D
w
w r r w w
w P x x
•
•
⨀ ⨀
*: Lee et al., Computational Optimization and Applications, 2001
18. Outline
• Introduction
• The Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with a Novel Correlation Regularizer
• Experiments
• Conclusion
18
19. Dataset
• IXMAS multi-view action dataset
Action videos of eleven action classes
Each action video is performed three times by twelve actors
The actions are captured simultaneously by five cameras
19
20. Experiment Setting
2/3 as unlabeled data: Learning correlation subspaces via CCA
20
Source view Target view
Check-watch Scratch-head
Sit-down
Kick
Kick
1/3 as labeled data: Training and testing
⋯
Leave-one-class-out protocol (LOCO)
Without Kick action
21. Experimental Results
• A: BoW from source view directly
• B: BoBW + SVM [Liu et al. CVPR’11]
• C: BoBW + our SVM
21
(%)
camera0 camera1 camera2
A B C D E A B C D E A B C D E
c0 - 9.29 60.96 63.03 63.18 64.90 11.62 41.21 50.76 56.97 60.61
c1 10.71 58.08 59.70 66.72 70.25 - 7.12 33.54 38.03 57.83 59.34
c2 8.79 52.63 49.34 57.37 62.47 6.67 50.86 45.79 59.19 61.87 -
c3 6.31 40.35 44.44 65.30 66.01 9.75 33.59 33.27 46.77 52.68 5.96 41.26 43.99 61.36 61.36
c4 5.35 38.59 40.91 54.39 55.76 9.44 37.53 37.00 53.59 55.00 9.19 34.80 38.28 57.88 60.15
avg. 7.79 47.41 48.60 60.95 63.62 8.79 45.73 44.77 55.68 58.61 8.47 37.70 42.77 58.51 60.37
camera3 camera4
A B C D E A B C D E
c0 7.78 39.65 41.36 63.64 62.17 7.12 24.60 37.02 43.69 48.23
c1 12.02 35.91 39.14 48.59 54.85 8.89 26.87 22.22 44.24 49.29
c2 6.46 41.46 42.78 60.00 61.46 10.35 28.03 33.43 45.05 51.82
c3 - 8.89 27.53 28.28 40.66 41.06
c4 9.60 27.68 34.60 48.03 48.89 -
avg. 8.96 36.17 39.47 55.06 56.84 8.81 26.76 30.24 43.41 47.60
• D: CCA + SVM
• E: our proposed framework (CCA + our SVM).
22. Effects on The Correlation Coefficient ρ
22
• Recognition rates for the two models were 47.22% and 77.78%, respectively.
(a) Averaged |wi| of standard SVM (b) Averaged |wi| of our SVM
• We successfully suppress the SVM model |wi| when lower ρ is resulted.
• Ex: source: camera 3, target: camera 2, left-out action: get-up
dimension index dimension index
wiwi
23. Outline
• Introduction
• The Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with A Novel Correlation Regularizer
• Experiments
• Conclusion
23
24. Conclusions
• We presented a transfer-learning based approach to cross-
camera action recognition.
• We considered the domain transfer ability of CCA, and proposed
a novel SVM formulation with a correlation regularizer.
• Experimental results on the IXMAS dataset confirmed
performance improvements using our proposed method.
24
27. Representing an action
27
[Blank et al., ICCV, 2005]
[Weinland et al., CVIU, 2006]
Motion history volume
Space-time shapes
spatio-temporal volumes
28. ℝ276
Source view Target view
Split-based feature transfer (ECCV ‘08)
28
1
s
v
2
s
v
3
s
v𝒳 𝑠
∈ ℝ40
2
t
v
𝒳 𝑡
∈ ℝ40
ℝ276
K-means K-means
Target instance in the
source representation
frame
action video
Matching according to
split-based feature
29. Source view
How to construct split-based feature
29
ℝ30
ℝ30
ℝ30
⋮
1000 different
random projections
ℝ276
ℝ30
Max Margin Clustering
25 1
1
1
1
Split-based feature ℝ25
Pick the best 25
random projections
+
ℝ30
-
Target view
ℝ276
25 1
1
1
1
Split-based feature ℝ25
ℝ30
Train SVM using split-based
feature as labels
+
ℝ30
ℝ30
ℝ30
⋮
Same best 25
random projections
unlabeled frame
30. Source view Target view
4. Train models and
predict with this
representation
3. Construct the codebook of
bilingual words
30
1. Exploit unlabeled data to model the
two codebooks as a bipartite graph
⋯
⋯
1
s
v 2
s
v 2
s
v
s
s
dv 2
s
v 2
s
v
2. Perform spectral clustering
s
s
dv s
s
dv 2
t
v
Bag of Bilingual Words (CVPR ‘11)
31. Learning correlation subspace via CCA
• The projection vector us can be solved by a generalized
eigenvalue decomposition problem:
31
• Largest η corresponds to largest ρ.
• Once us is obtained, ut can be calculated by
1 s
t tt st
Σ Σ u
u
⋯
⋯
⋯
⋯
]
]
[
[
P 𝑠 =
P 𝑡 =
eigenvalues η1
correlation
coefficient ρ1
u1
𝑠
u1
𝑡
∈ ℝ 𝑑 𝑠×𝑑
∈ ℝ 𝑑 𝑡×𝑑
1 s s
st tt st ss
Σ Σ Σ u Σ u•
1 s s
st tt t st ss s
Σ Σ I Σ u Σ I u•
> ⋯ > ηd
> ⋯ > ρd
⋯ u 𝑑
𝑠
⋯ u 𝑑
𝑡
32. Source view Target view
32
Colored: labeled data
: test data
Gray: unlabeled data
LOCO protocol in real application: new action
class
Editor's Notes
Among these approaches,….
Partly inspired by the progress of feature extraction in image classification/ tracking and detection, …etc
In our work, we adopt the… to represent an action
This is regarded as “cross-camera action recognition”
Traditional learning methods fail to predict test data in another view successfully. Not only because of different distribution, but sometimes the even dimension of two view can be different.
Since test data are not available beforehand, one has to assume there are some other data in target view, in order to facilitate the recognition task.
One scenario is introducing unlabeled data, whose label is not of our interest for the time being.
This scenario is often called “unsupervised cross-camera action recognition”because there are no labeled data in the target view.
And it belongs to a branch of transfer learning.
Specifically, transfer learning
Note that we only have labeled source data for training
standard SVM: aims at separating data the in the correlated subspace without considering the domain transfer ability (i.e., the correlation between projected data), and thus we still observe prominent |wi| values at non-dominant feature dimensions (i.e., the 11th dimension)
our proposed SVM: suppresses the contributions of non-dominant feature dimensions in the correlated subspace, and thus only results in large |wi| values for dominant feature dimensions.
To recognize an action, one has to decide how to represent it.
Some authors utilized human body model. They determined body poses by tracking limbs and torso, and recognizing actions accordingly.
Besides that, some researchers focused more on the action itself rather than human body.
They proposed spatio-temporal volumes which encode not only the spatial shape of silhouette but its change in temporal domain
After short derivation, the maximization problem is reduced to be a GED problem.
Usually one introduces a regularization term to alleviate singularity and overfit issue.