This document proposes methods for action recognition using context and appearance distribution features. It first extracts interest points from videos and represents each video using two types of distribution features: 1) multiple Gaussian mixture models (GMMs) that characterize the distribution of relative coordinates (context) between interest point pairs at different space-time scales, and 2) a GMM that models the distribution of local appearance features around interest points. It then introduces a new learning method called Multiple Kernel Learning with Augmented Features (AFMKL) to effectively fuse the two heterogeneous distribution features for classification. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed approach.
Research on Image Classification Model of Probability Fusion Spectrum-Spatial...CSCJournals
For insufficient information of imaging spectrum with high spatial resolution, detailed imaging information, reduction of mixed pixels, increase of pure pixels and problems of image characteristic extraction and model classification produced from this, we provide a classifier model of a united spectrum-spatial multi-characteristic based on SVM, and use this model to finish the image classification. The model completely uses the multi-characteristic information, and overcomes the over-fitting problems produced by accumulating high-dimensional characteristics. The model includes three classifications of spectrum-spatial characteristics, namely spectral characteristics-spectral characteristic of multi-scale morphology, spectral characteristics-physical characteristics of underlaying surfaces of multi-scale morphology and spectral characteristics-features spatial extension characteristics of multi-scale morphology. Firstly the three classifications of spectrum-spatial characteristics are classified through SVM, then carries out the probability fusion for the classification results based on the pixels to obtain the final image classification results. This article respectively uses WorldView-2 image and ROSIS image to experiment, and the results show that the model has better classification effect compared with VS-SVM algorithm.
The document presents a method for detecting copy-move forgery in digital images using center-symmetric local binary pattern (CS-LBP). The key steps are:
1. The input image is converted to grayscale and divided into overlapping blocks.
2. CS-LBP features are extracted from each block to represent textures while being invariant to illumination and rotation. This reduces the feature dimensions compared to local binary pattern (LBP).
3. The distances between feature vectors are calculated and sorted lexicographically to group similar blocks together. Distances below a threshold indicate copied regions.
4. Post-processing with morphological operations fills holes and removes outliers to generate a mask of the forged regions.
The
Multiple Person Tracking with Shadow Removal Using Adaptive Gaussian Mixture ...IJSRD
Person detection in a video surveillance system is major concern in real world. Several application likes abnormal event detection, congestion analysis, human gait characterization, fall detection, person identification, gender classification and for elderly people. In this algorithm, we use GMM method in background subtraction for multi person detection because of Gaussian Mixture Model (GMM) model is one such popular method this give a real time object detection. There is still not robustly but Multi person tracking with shadow removal fill this gap, in this work, HOG-LBP hybrid approach with GMM algorithm is presented for Multi person tracking with Shadow removal.
The development of multimedia system technology in Content based Image Retrieval (CBIR) System is
one in every of the outstanding area to retrieve the images from an oversized collection of database. The feature
vectors of the query image are compared with feature vectors of the database images to get matching images.It is
much observed that anyone algorithm isn't beneficial in extracting all differing kinds of natural images. Thus an
intensive analysis of certain color, texture and shape extraction techniques are allotted to spot an efficient CBIR
technique that suits for a selected sort of images. The Extraction of an image includes feature description and
feature extraction. During this paper, we tend to projected Color Layout Descriptor (CLD), grey Level Co-
Occurrences Matrix (GLCM), Marker-Controlled Watershed Segmentation feature extraction technique that
extract the matching image based on the similarity of Color, Texture and shape within the database. For
performance analysis, the image retrieval timing results of the projected technique is calculated and compared
with every of the individual feature.
adaptive metric learning for saliency detection base paperSuresh Nagalla
This document proposes an adaptive metric learning algorithm (AML) for visual saliency detection. AML learns two complementary distance metrics: 1) a generic metric (GML) that considers the global distribution of training data, and 2) a specific metric (SML) that considers the structure of individual images. GML and SML are combined to better distinguish salient objects from background. The algorithm also uses superpixel-wise Fisher vector coding of low-level features to enhance saliency detection performance. Experimental results show the proposed AML approach outperforms other state-of-the-art saliency detection methods.
This document presents a method for interactive image segmentation using constrained active contours. It begins with an overview of existing interactive segmentation techniques, including boundary-based methods like active contours/snakes and region-based methods like random walks and graph cuts. The proposed method initializes a contour using region-based segmentation then refines it using a convex active contour model that incorporates both regional information from seed pixels and boundary smoothness. This allows the contour to globally evolve to object boundaries while handling topology changes.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Performance of Efficient Closed-Form Solution to Comprehensive Frontier Exposureiosrjce
This document discusses boundary detection techniques for images. It proposes a generalized boundary detection method (Gb) that combines low-level and mid-level image representations in a single eigenvalue problem to detect boundaries. Gb achieves state-of-the-art results at low computational cost. Soft segmentation and contour grouping methods are also introduced to further improve boundary detection accuracy with minimal extra computation. The document presents outputs of Gb on sample images and concludes that Gb effectively detects boundaries in a principled manner by jointly resolving constraints from multiple image interpretation layers in closed form.
Research on Image Classification Model of Probability Fusion Spectrum-Spatial...CSCJournals
For insufficient information of imaging spectrum with high spatial resolution, detailed imaging information, reduction of mixed pixels, increase of pure pixels and problems of image characteristic extraction and model classification produced from this, we provide a classifier model of a united spectrum-spatial multi-characteristic based on SVM, and use this model to finish the image classification. The model completely uses the multi-characteristic information, and overcomes the over-fitting problems produced by accumulating high-dimensional characteristics. The model includes three classifications of spectrum-spatial characteristics, namely spectral characteristics-spectral characteristic of multi-scale morphology, spectral characteristics-physical characteristics of underlaying surfaces of multi-scale morphology and spectral characteristics-features spatial extension characteristics of multi-scale morphology. Firstly the three classifications of spectrum-spatial characteristics are classified through SVM, then carries out the probability fusion for the classification results based on the pixels to obtain the final image classification results. This article respectively uses WorldView-2 image and ROSIS image to experiment, and the results show that the model has better classification effect compared with VS-SVM algorithm.
The document presents a method for detecting copy-move forgery in digital images using center-symmetric local binary pattern (CS-LBP). The key steps are:
1. The input image is converted to grayscale and divided into overlapping blocks.
2. CS-LBP features are extracted from each block to represent textures while being invariant to illumination and rotation. This reduces the feature dimensions compared to local binary pattern (LBP).
3. The distances between feature vectors are calculated and sorted lexicographically to group similar blocks together. Distances below a threshold indicate copied regions.
4. Post-processing with morphological operations fills holes and removes outliers to generate a mask of the forged regions.
The
Multiple Person Tracking with Shadow Removal Using Adaptive Gaussian Mixture ...IJSRD
Person detection in a video surveillance system is major concern in real world. Several application likes abnormal event detection, congestion analysis, human gait characterization, fall detection, person identification, gender classification and for elderly people. In this algorithm, we use GMM method in background subtraction for multi person detection because of Gaussian Mixture Model (GMM) model is one such popular method this give a real time object detection. There is still not robustly but Multi person tracking with shadow removal fill this gap, in this work, HOG-LBP hybrid approach with GMM algorithm is presented for Multi person tracking with Shadow removal.
The development of multimedia system technology in Content based Image Retrieval (CBIR) System is
one in every of the outstanding area to retrieve the images from an oversized collection of database. The feature
vectors of the query image are compared with feature vectors of the database images to get matching images.It is
much observed that anyone algorithm isn't beneficial in extracting all differing kinds of natural images. Thus an
intensive analysis of certain color, texture and shape extraction techniques are allotted to spot an efficient CBIR
technique that suits for a selected sort of images. The Extraction of an image includes feature description and
feature extraction. During this paper, we tend to projected Color Layout Descriptor (CLD), grey Level Co-
Occurrences Matrix (GLCM), Marker-Controlled Watershed Segmentation feature extraction technique that
extract the matching image based on the similarity of Color, Texture and shape within the database. For
performance analysis, the image retrieval timing results of the projected technique is calculated and compared
with every of the individual feature.
adaptive metric learning for saliency detection base paperSuresh Nagalla
This document proposes an adaptive metric learning algorithm (AML) for visual saliency detection. AML learns two complementary distance metrics: 1) a generic metric (GML) that considers the global distribution of training data, and 2) a specific metric (SML) that considers the structure of individual images. GML and SML are combined to better distinguish salient objects from background. The algorithm also uses superpixel-wise Fisher vector coding of low-level features to enhance saliency detection performance. Experimental results show the proposed AML approach outperforms other state-of-the-art saliency detection methods.
This document presents a method for interactive image segmentation using constrained active contours. It begins with an overview of existing interactive segmentation techniques, including boundary-based methods like active contours/snakes and region-based methods like random walks and graph cuts. The proposed method initializes a contour using region-based segmentation then refines it using a convex active contour model that incorporates both regional information from seed pixels and boundary smoothness. This allows the contour to globally evolve to object boundaries while handling topology changes.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Performance of Efficient Closed-Form Solution to Comprehensive Frontier Exposureiosrjce
This document discusses boundary detection techniques for images. It proposes a generalized boundary detection method (Gb) that combines low-level and mid-level image representations in a single eigenvalue problem to detect boundaries. Gb achieves state-of-the-art results at low computational cost. Soft segmentation and contour grouping methods are also introduced to further improve boundary detection accuracy with minimal extra computation. The document presents outputs of Gb on sample images and concludes that Gb effectively detects boundaries in a principled manner by jointly resolving constraints from multiple image interpretation layers in closed form.
This document summarizes a method for enhancing fingerprint images using short-time Fourier transform (STFT) analysis. The key steps are:
1. Performing STFT analysis on overlapping windows of the fingerprint image to estimate the local ridge orientation, frequency, and region mask in each window.
2. Using the estimated orientation and frequency values to filter each window of the fingerprint image in the Fourier domain for enhancement.
3. Reconstructing the enhanced fingerprint image by combining the results from each analyzed window.
A feature selection method for automatic image annotationinventionjournals
ABSTRACT: Automatic image annotation (AIA) is the bridge of high-level semantic information and the low-level feature. AIA is an effective method to resolve the problem of “Semantic Gap”. According to the intrinsic character of AIA, some common features are selected from the labeled images by multiple instance learning method. The feature selection method is applied into the task of automatic image annotation in this paper. Each keyword is analyzed hierarchically in low-granularity-level under the framework of feature selection. Through the common representative instances are mined, the semantic similarity of images can be effectively expressed and the better annotation results are able to be acquired, which testifies the effectiveness of the proposed annotation method. Gaussian mixture model is built by the selected feature method to characterize the labeled keyword. The experimental results illustrate the good perfromance of AIA.
This project is to retrieve the similar geographic images from the dataset based on the features extracted.
Retrieval is the process of collecting the relevant images from the dataset which contains more number of
images. Initially the preprocessing step is performed in order to remove noise occurred in input image with
the help of Gaussian filter. As the second step, Gray Level Co-occurrence Matrix (GLCM), Scale Invariant
Feature Transform (SIFT), and Moment Invariant Feature algorithms are implemented to extract the
features from the images. After this process, the relevant geographic images are retrieved from the dataset
by using Euclidean distance. In this, the dataset consists of totally 40 images. From that the images which
are all related to the input image are retrieved by using Euclidean distance. The approach of SIFT is to
perform reliable recognition, it is important that the feature extracted from the training image be
detectable even under changes in image scale, noise and illumination. The GLCM calculates how often a
pixel with gray level value occurs. While the focus is on image retrieval, our project is effectively used in
the applications such as detection and classification.
Adaptive Multiscale Stereo Images Matching Based on Wavelet Transform Modulus...CSCJournals
In this paper we propose a multiscale stereo correspondence matching method based on wavelets transform modulus maxima. Exploitation of maxima modulus chains has given us the opportunity to refine the search for corresponding. Based on the wavelet transform we construct maps of modules and phases for different scales, then extracted the maxima and then we build chains of maxima. Points constituents maxima modulus chains will be considered as points of interest in matching processes. The availability of all its multiscale information, allows searching under geometric constraints, for each point of interest in the left image corresponding one of the best points of constituent chains of the right image. The experiment results demonstrate that the number of corresponding has a very clear decrease when the scale increases. In several tests we obtained the uniqueness of the corresponding by browsing through the fine to coarse scales and calculations remain very reasonable. Abdelhak EZZINE aezzine@uae.ac.ma 39 imm serghiniya Rue liban ENSAT/ SIC/LABTIC Abdelmalek ESSAADI University Tangier, 99000, Morocco
An ensemble classification algorithm for hyperspectral imagessipij
Hyperspectral image analysis has been used for many purposes in environmental monitoring, remote
sensing, vegetation research and also for land cover classification. A hyperspectral image consists of many
layers in which each layer represents a specific wavelength. The layers stack on top of one another making
a cube-like image for entire spectrum. This work aims to classify the hyperspectral images and to produce
a thematic map accurately. Spatial information of hyperspectral images is collected by applying
morphological profile and local binary pattern. Support vector machine is an efficient classification
algorithm for classifying the hyperspectral images. Genetic algorithm is used to obtain the best feature
subjected for classification. Selected features are classified for obtaining the classes and to produce a
thematic map. Experiment is carried out with AVIRIS Indian Pines and ROSIS Pavia University. Proposed
method produces accuracy as 93% for Indian Pines and 92% for Pavia University.
A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITIONsipij
Graph convolutional networks (GCNs) have been proven to be effective for processing structured data, so
that it can effectively capture the features of related nodes and improve the performance of model. More
attention is paid to employing GCN in Skeleton-Based action recognition. But there are some challenges
with the existing methods based on GCNs. First, the consistency of temporal and spatial features is ignored
due to extracting features node by node and frame by frame. We design a generic representation of
skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks
(TGN), which can obtain spatiotemporal features simultaneously. Secondly, the adjacency matrix of graph
describing the relation of joints are mostly depended on the physical connection between joints. We
propose a multi-scale graph strategy to appropriately describe the relations between joints in skeleton
graph, which adopts a full-scale graph, part-scale graph and core-scale graph to capture the local features
of each joint and the contour features of important joints. Extensive experiments are conducted on two
large datasets including NTU RGB+D and Kinetics Skeleton. And the experiments results show that TGN
with our graph strategy outperforms other state-of-the-art methods.
This document describes a narrow band region approach for 2D and 3D image segmentation using deformable curves and surfaces. Specifically, it develops a region energy term involving a fixed-width band around the evolving curve or surface. This energy achieves a balance between local gradient features and global region statistics. The region energy is formulated to allow efficient computation on explicit parametric models and implicit level set models. Two different region terms are introduced, each suited to different configurations of the target object and its surroundings. The document derives the mathematical framework for computing the region energies and their gradients to allow minimization via gradient descent. It then discusses numerical implementations and provides experiments segmenting medical and natural images.
The term "morphing" is used to describe the combination of generalized image warping with a crossdissolve
between image elements. The term is derived from "image metamorphosis". Morphing is an image
processing technique typically used as an animation tool for the metamorphosis from one image to another. The
idea is to specify a warp that distorts the first image into the second. Its inverse will distort the second image into
the first..The morph process consists of warping two images so that they have the same "shape", and then cross
dissolving the resulting images. Cross-dissolving is the major problem as how to warp an image. In this paper I
have concentrated on one of the methodology of image morphing named “Mesh Warping”.
Keywords: Morphing, Warping, Cross-dissolve, Metamorphosis.
A MORPHOLOGICAL MULTIPHASE ACTIVE CONTOUR FOR VASCULAR SEGMENTATIONijbbjournal
This paper presents a morphological active contour ideal for vascular segmentation in biomedical images.
The unenhanced images of vessels and background are successfully segmented using a two-step
morphological active contour based upon Chan and Vese’s Active Contour without Edges. Using dilation
and erosion as an approximation of curve evolution, the contour provides an efficient, simple, and robust
alternative to solving partial differential equations used by traditional level-set Active Contour models. The
proposed method is demonstrated with segmented data set images and compared to results garnered from
multiphase Active Contour without Edges, morphological watershed, and Fuzzy C-means segmentations.
This document discusses approaches for video segmentation. It describes tracking particles across frames to identify motion patterns, then clustering the particles to obtain a pixel-wise segmentation over space and time. This addresses limitations of segmentation based on motion boundaries. Reality-based 3D models can help address complex spatial motions by representing objects and their relationships in 3D space. The document also reviews direct and feature-based motion estimation methods, variational and level-set segmentation frameworks, and challenges including fitting motion models to data and handling outliers.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...IOSR Journals
1) The document discusses extracting the medial axis transform (MAT) of an image pattern using the Euclidean distance transform. The image is first converted to binary, then the Euclidean distance transform is used to compute the distance of each non-zero pixel to the closest zero pixel.
2) The medial axis transform represents the core or skeleton of an image pattern. There are different algorithms for extracting the skeleton or medial axis, including sequential and parallel algorithms. The skeleton provides a simple representation that preserves topological and size characteristics of the original shape.
3) The document provides background on medial axis transforms and different skeletonization algorithms. It then describes preparing the binary image and applying the Euclidean distance transform to extract the MAT and skeleton
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...IDES Editor
In this paper a method is proposed to discriminate
real world scenes in to natural and manmade scenes of similar
depth. Global-roughness of a scene image varies as a function
of image-depth. Increase in image depth leads to increase in
roughness in manmade scenes; on the contrary natural scenes
exhibit smooth behavior at higher image depth. This particular
arrangement of pixels in scene structure can be well explained
by local texture information in a pixel and its neighborhood.
Our proposed method analyses local texture information of a
scene image using texture unit matrix. For final classification
we have used both supervised and unsupervised learning using
K-Nearest Neighbor classifier (KNN) and Self Organizing
Map (SOM) respectively. This technique is useful for online
classification due to very less computational complexity.
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
DOMAIN SPECIFIC CBIR FOR HIGHLY TEXTURED IMAGEScseij
It is A Challenging Task To Build A Cbir System Which Primarily Works On Texture Values As There
Meaning And Semantics Needs A Special Care To Be Mapped With Human Based Languages. We Have
Consider Highly Textured Images Having Properties(Entropy, Homogeneity, Contrast, Cluster Shade, Auto
Correlation)And Have Mapped Using A Fuzzy Minmax Scale W.R.T. Their Degree(High, Low,
Medium)And Technical Interpetation.This Developed System Is Performing Well In Terms Of Precision
And Recall Value Showing That Semantic Gap Has Been Reduced For Highly Textured Images Based Cbir.
Rapport Projet de fin d'etude sur le parc informatiqueHicham Ben
C'est mon rapport du mon projet de fin d’études qu'il s’agit du développement d'une application de gestion du parc informatique
autant qu'un étudiant 5 eme année du l’école nationale des sciences appliquées de tetouan (ENSAT) au maroc
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
In an ever-changing landscape of one digital disruption after another, companies and organisations are looking for new ways to understand their target markets and engage them better. Increasingly they invest in user experience (UX) and customer experience design (CX) capabilities by working with a specialist UX agency or developing their own UX lab. Some UX practitioners are touting leaner and faster ways of developing customer-centric products and services, via methodologies such as guerilla research, rapid prototyping and Agile UX. Others seek innovation and fulfilment by spending more time in research, being more inclusive, and designing for social goods.
Experience is more than just an interface. It is a relationship, as well as a series of touch points between your brand and your customer. Here are our top 10 highlights and takeaways from the recent UX Australia conference to help you transform your customer experience design.
For full article, continue reading at https://yump.com.au/10-ways-supercharge-customer-experience-design/
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
This document summarizes a method for enhancing fingerprint images using short-time Fourier transform (STFT) analysis. The key steps are:
1. Performing STFT analysis on overlapping windows of the fingerprint image to estimate the local ridge orientation, frequency, and region mask in each window.
2. Using the estimated orientation and frequency values to filter each window of the fingerprint image in the Fourier domain for enhancement.
3. Reconstructing the enhanced fingerprint image by combining the results from each analyzed window.
A feature selection method for automatic image annotationinventionjournals
ABSTRACT: Automatic image annotation (AIA) is the bridge of high-level semantic information and the low-level feature. AIA is an effective method to resolve the problem of “Semantic Gap”. According to the intrinsic character of AIA, some common features are selected from the labeled images by multiple instance learning method. The feature selection method is applied into the task of automatic image annotation in this paper. Each keyword is analyzed hierarchically in low-granularity-level under the framework of feature selection. Through the common representative instances are mined, the semantic similarity of images can be effectively expressed and the better annotation results are able to be acquired, which testifies the effectiveness of the proposed annotation method. Gaussian mixture model is built by the selected feature method to characterize the labeled keyword. The experimental results illustrate the good perfromance of AIA.
This project is to retrieve the similar geographic images from the dataset based on the features extracted.
Retrieval is the process of collecting the relevant images from the dataset which contains more number of
images. Initially the preprocessing step is performed in order to remove noise occurred in input image with
the help of Gaussian filter. As the second step, Gray Level Co-occurrence Matrix (GLCM), Scale Invariant
Feature Transform (SIFT), and Moment Invariant Feature algorithms are implemented to extract the
features from the images. After this process, the relevant geographic images are retrieved from the dataset
by using Euclidean distance. In this, the dataset consists of totally 40 images. From that the images which
are all related to the input image are retrieved by using Euclidean distance. The approach of SIFT is to
perform reliable recognition, it is important that the feature extracted from the training image be
detectable even under changes in image scale, noise and illumination. The GLCM calculates how often a
pixel with gray level value occurs. While the focus is on image retrieval, our project is effectively used in
the applications such as detection and classification.
Adaptive Multiscale Stereo Images Matching Based on Wavelet Transform Modulus...CSCJournals
In this paper we propose a multiscale stereo correspondence matching method based on wavelets transform modulus maxima. Exploitation of maxima modulus chains has given us the opportunity to refine the search for corresponding. Based on the wavelet transform we construct maps of modules and phases for different scales, then extracted the maxima and then we build chains of maxima. Points constituents maxima modulus chains will be considered as points of interest in matching processes. The availability of all its multiscale information, allows searching under geometric constraints, for each point of interest in the left image corresponding one of the best points of constituent chains of the right image. The experiment results demonstrate that the number of corresponding has a very clear decrease when the scale increases. In several tests we obtained the uniqueness of the corresponding by browsing through the fine to coarse scales and calculations remain very reasonable. Abdelhak EZZINE aezzine@uae.ac.ma 39 imm serghiniya Rue liban ENSAT/ SIC/LABTIC Abdelmalek ESSAADI University Tangier, 99000, Morocco
An ensemble classification algorithm for hyperspectral imagessipij
Hyperspectral image analysis has been used for many purposes in environmental monitoring, remote
sensing, vegetation research and also for land cover classification. A hyperspectral image consists of many
layers in which each layer represents a specific wavelength. The layers stack on top of one another making
a cube-like image for entire spectrum. This work aims to classify the hyperspectral images and to produce
a thematic map accurately. Spatial information of hyperspectral images is collected by applying
morphological profile and local binary pattern. Support vector machine is an efficient classification
algorithm for classifying the hyperspectral images. Genetic algorithm is used to obtain the best feature
subjected for classification. Selected features are classified for obtaining the classes and to produce a
thematic map. Experiment is carried out with AVIRIS Indian Pines and ROSIS Pavia University. Proposed
method produces accuracy as 93% for Indian Pines and 92% for Pavia University.
A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITIONsipij
Graph convolutional networks (GCNs) have been proven to be effective for processing structured data, so
that it can effectively capture the features of related nodes and improve the performance of model. More
attention is paid to employing GCN in Skeleton-Based action recognition. But there are some challenges
with the existing methods based on GCNs. First, the consistency of temporal and spatial features is ignored
due to extracting features node by node and frame by frame. We design a generic representation of
skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks
(TGN), which can obtain spatiotemporal features simultaneously. Secondly, the adjacency matrix of graph
describing the relation of joints are mostly depended on the physical connection between joints. We
propose a multi-scale graph strategy to appropriately describe the relations between joints in skeleton
graph, which adopts a full-scale graph, part-scale graph and core-scale graph to capture the local features
of each joint and the contour features of important joints. Extensive experiments are conducted on two
large datasets including NTU RGB+D and Kinetics Skeleton. And the experiments results show that TGN
with our graph strategy outperforms other state-of-the-art methods.
This document describes a narrow band region approach for 2D and 3D image segmentation using deformable curves and surfaces. Specifically, it develops a region energy term involving a fixed-width band around the evolving curve or surface. This energy achieves a balance between local gradient features and global region statistics. The region energy is formulated to allow efficient computation on explicit parametric models and implicit level set models. Two different region terms are introduced, each suited to different configurations of the target object and its surroundings. The document derives the mathematical framework for computing the region energies and their gradients to allow minimization via gradient descent. It then discusses numerical implementations and provides experiments segmenting medical and natural images.
The term "morphing" is used to describe the combination of generalized image warping with a crossdissolve
between image elements. The term is derived from "image metamorphosis". Morphing is an image
processing technique typically used as an animation tool for the metamorphosis from one image to another. The
idea is to specify a warp that distorts the first image into the second. Its inverse will distort the second image into
the first..The morph process consists of warping two images so that they have the same "shape", and then cross
dissolving the resulting images. Cross-dissolving is the major problem as how to warp an image. In this paper I
have concentrated on one of the methodology of image morphing named “Mesh Warping”.
Keywords: Morphing, Warping, Cross-dissolve, Metamorphosis.
A MORPHOLOGICAL MULTIPHASE ACTIVE CONTOUR FOR VASCULAR SEGMENTATIONijbbjournal
This paper presents a morphological active contour ideal for vascular segmentation in biomedical images.
The unenhanced images of vessels and background are successfully segmented using a two-step
morphological active contour based upon Chan and Vese’s Active Contour without Edges. Using dilation
and erosion as an approximation of curve evolution, the contour provides an efficient, simple, and robust
alternative to solving partial differential equations used by traditional level-set Active Contour models. The
proposed method is demonstrated with segmented data set images and compared to results garnered from
multiphase Active Contour without Edges, morphological watershed, and Fuzzy C-means segmentations.
This document discusses approaches for video segmentation. It describes tracking particles across frames to identify motion patterns, then clustering the particles to obtain a pixel-wise segmentation over space and time. This addresses limitations of segmentation based on motion boundaries. Reality-based 3D models can help address complex spatial motions by representing objects and their relationships in 3D space. The document also reviews direct and feature-based motion estimation methods, variational and level-set segmentation frameworks, and challenges including fitting motion models to data and handling outliers.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...IOSR Journals
1) The document discusses extracting the medial axis transform (MAT) of an image pattern using the Euclidean distance transform. The image is first converted to binary, then the Euclidean distance transform is used to compute the distance of each non-zero pixel to the closest zero pixel.
2) The medial axis transform represents the core or skeleton of an image pattern. There are different algorithms for extracting the skeleton or medial axis, including sequential and parallel algorithms. The skeleton provides a simple representation that preserves topological and size characteristics of the original shape.
3) The document provides background on medial axis transforms and different skeletonization algorithms. It then describes preparing the binary image and applying the Euclidean distance transform to extract the MAT and skeleton
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...IDES Editor
In this paper a method is proposed to discriminate
real world scenes in to natural and manmade scenes of similar
depth. Global-roughness of a scene image varies as a function
of image-depth. Increase in image depth leads to increase in
roughness in manmade scenes; on the contrary natural scenes
exhibit smooth behavior at higher image depth. This particular
arrangement of pixels in scene structure can be well explained
by local texture information in a pixel and its neighborhood.
Our proposed method analyses local texture information of a
scene image using texture unit matrix. For final classification
we have used both supervised and unsupervised learning using
K-Nearest Neighbor classifier (KNN) and Self Organizing
Map (SOM) respectively. This technique is useful for online
classification due to very less computational complexity.
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
DOMAIN SPECIFIC CBIR FOR HIGHLY TEXTURED IMAGEScseij
It is A Challenging Task To Build A Cbir System Which Primarily Works On Texture Values As There
Meaning And Semantics Needs A Special Care To Be Mapped With Human Based Languages. We Have
Consider Highly Textured Images Having Properties(Entropy, Homogeneity, Contrast, Cluster Shade, Auto
Correlation)And Have Mapped Using A Fuzzy Minmax Scale W.R.T. Their Degree(High, Low,
Medium)And Technical Interpetation.This Developed System Is Performing Well In Terms Of Precision
And Recall Value Showing That Semantic Gap Has Been Reduced For Highly Textured Images Based Cbir.
Rapport Projet de fin d'etude sur le parc informatiqueHicham Ben
C'est mon rapport du mon projet de fin d’études qu'il s’agit du développement d'une application de gestion du parc informatique
autant qu'un étudiant 5 eme année du l’école nationale des sciences appliquées de tetouan (ENSAT) au maroc
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
In an ever-changing landscape of one digital disruption after another, companies and organisations are looking for new ways to understand their target markets and engage them better. Increasingly they invest in user experience (UX) and customer experience design (CX) capabilities by working with a specialist UX agency or developing their own UX lab. Some UX practitioners are touting leaner and faster ways of developing customer-centric products and services, via methodologies such as guerilla research, rapid prototyping and Agile UX. Others seek innovation and fulfilment by spending more time in research, being more inclusive, and designing for social goods.
Experience is more than just an interface. It is a relationship, as well as a series of touch points between your brand and your customer. Here are our top 10 highlights and takeaways from the recent UX Australia conference to help you transform your customer experience design.
For full article, continue reading at https://yump.com.au/10-ways-supercharge-customer-experience-design/
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
http://inarocket.com
Learn BEM fundamentals as fast as possible. What is BEM (Block, element, modifier), BEM syntax, how it works with a real example, etc.
The document discusses how personalization and dynamic content are becoming increasingly important on websites. It notes that 52% of marketers see content personalization as critical and 75% of consumers like it when brands personalize their content. However, personalization can create issues for search engine optimization as dynamic URLs and content are more difficult for search engines to index than static pages. The document provides tips for SEOs to help address these personalization and SEO challenges, such as using static URLs when possible and submitting accurate sitemaps.
Announcing the Final Examination of Mr. Paul Smith for the ...butest
Mr. Paul Smith will defend his dissertation titled "Multizoom Activity Recognition Using Machine Learning" on November 21, 2005 at 10:00 am in room CSB 232. His dissertation presents a system for detecting events in video using a multiview approach to detect and track heads and hands across cameras. It then demonstrates a machine learning algorithm called TemporalBoost that can recognize events using activity features extracted from multiple zoom levels. Mr. Smith received his B.Sc. from the University of Central Florida in 2000 and M.Sc. from the same institution in 2002. His dissertation committee is chaired by Drs. Mubarak Shah and Niels da Vitoria Lobo.
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
Human action recognition is still a challenging problem and researchers are focusing to investigate this
problem using different techniques. We propose a robust approach for human action recognition. This is
achieved by extracting stable spatio-temporal features in terms of pairwise local binary pattern (P-LBP)
and scale invariant feature transform (SIFT). These features are used to train an MLP neural network
during the training stage, and the action classes are inferred from the test videos during the testing stage.
The proposed features well match the motion of individuals and their consistency, and accuracy is higher
using a challenging dataset. The experimental evaluation is conducted on a benchmark dataset commonly
used for human action recognition. In addition, we show that our approach outperforms individual features
i.e. considering only spatial and only temporal feature.
Silhouette analysis based action recognition via exploiting human posesAVVENIRE TECHNOLOGIES
We propose a novel scheme for human action recognition that combines the advantages of both local and global representations.
We explore human silhouettes for human action representation by taking into account the correlation between sequential poses in an action.
Effective Object Detection and Background Subtraction by using M.O.IIJMTST Journal
This paper proposes efficient motion detection and people counting based on background subtraction using dynamic threshold approach with mathematical morphology. Here these different methods are used effectively for object detection and compare these performance based on accurate detection. Here the techniques frame differences, dynamic threshold based detection will be used. After the object foreground detection, the parameters like speed, velocity motion will be determined. For this, most of previous methods depend on the assumption that the background is static over short time periods. In dynamic threshold based object detection, morphological process and filtering also used effectively for unwanted pixel removal from the background. The background frame will be updated by comparing the current frame intensities with reference frame. Along with this dynamic threshold, mathematical morphology also used which has an ability of greatly attenuating color variations generated by background motions while still highlighting moving objects. Finally the simulated results will be shown that used approximate median with mathematical morphology approach is effective rather than prior background subtraction methods in dynamic texture scenes and performance parameters of moving object such sensitivity, speed and velocity will be evaluated by using MOI.
Complex Background Subtraction Using Kalman FilterIJERA Editor
Background subtraction from dynamic background, At any location of the scene, this system extract a sequence of regular video bricks, i.e., video volumes spanning over both spatial and temporal domain. The background modeling is thus posed as pursuing subspaces within the video bricks while adapting the scene variations. For each sequence of video bricks, it pursues the subspace by employing the auto regressive moving average model that jointly characterizes the appearance consistency and temporal coherence of the observations. During online processing, it use tracking algorithm kalman’s filter for background/foreground classification and incrementally update the subspaces to cope with disturbances from foreground objects and scene changes.
A Survey On Tracking Moving Objects Using Various AlgorithmsIJMTST Journal
Sparse representation has been applied to the object tracking problem. Mining the self- similarities between particles via multitask learning can improve tracking performance. How-ever, some particles may be different from others when they are sampled from a large region. Imposing all particles share the same structure may degrade the results. To overcome this problem, we propose a tracking algorithm based on robust multitask sparse representation (RMTT) in this letter. When we learn the particle representations, we decompose the sparse coefficient matrix into two parts in our algorithm. Joint sparse regularization is imposed on one coefficient matrix while element-wise sparse regularization is imposed on another matrix. The former regularization exploits self-similarities of particles while the later one considers the differences between them.
This document presents a method for tracking moving objects in video sequences using affine flow parameters combined with illumination insensitive template matching. The method extracts affine flow parameters from frames to model local object motion using affine transformations. It then applies template matching with illumination compensation to track objects across frames while being robust to illumination changes. The method is evaluated on various indoor and outdoor database videos and is shown to effectively track objects without false detections, handling issues like illumination variations, camera motion and dynamic backgrounds better than other methods.
This document summarizes a research paper on background subtraction techniques for motion detection in video. It describes a proposed technique that stores and compares past pixel values to the current value to determine if a pixel belongs to the background or foreground. It also discusses using a k-means algorithm and Gaussian mixture model to build a probabilistic background model and classify pixels. The paper evaluates different shadow detection approaches and finds RGB color spaces perform best for segmentation and shadow removal.
This document summarizes a research paper on background subtraction techniques for motion detection in video. It describes a proposed technique that stores and compares past pixel values to the current value to determine if a pixel belongs to the background or foreground. It also discusses using a k-means algorithm and Gaussian mixture model to build a probabilistic background model and classify pixels. The paper evaluates different shadow detection approaches and finds RGB color spaces perform best for segmentation and shadow removal.
This document summarizes an approach for video segmentation based on motion coherence. It uses tracking to identify 2-D motion patterns with clustering. Pixels are clustered to obtain a segmentation in both space and time domains. The limitations of segmentation based solely on spatial motion can be addressed using 3-D models generated from range, image, or CAD data. These models can be semantically segmented and organized in different ways depending on their geometry and detail levels.
A New Algorithm for Tracking Objects in Videos of Cluttered ScenesZac Darcy
The work presented in this paper describes a novel algorithm for automatic video object tracking based on
a process of subtraction of successive frames, where the prediction of the direction of movement of the
object being tracked is carried out by analyzing the changing areas generated as result of the object’s
motion, specifically in regions of interest defined inside the object being tracked in both the current and the
next frame. Simultaneously, it is initiated a minimization process which seeks to determine the location of
the object being tracked in the next frame using a function which measures the grade of dissimilarity
between the region of interest defined inside the object being tracked in the current frame and a moving
region in a next frame. This moving region is displaced in the direction of the object’s motion predicted on
the process of subtraction of successive frames. Finally, the location of the moving region of interest in the
next frame that minimizes the proposed function of dissimilarity corresponds to the predicted location of
the object being tracked in the next frame. On the other hand, it is also designed a testing platform which is
used to create virtual scenarios that allow us to assess the performance of the proposed algorithm. These
virtual scenarios are exposed to heavily cluttered conditions where areas which surround the object being
tracked present a high variability. The results obtained with the proposed algorithm show that the tracking
process was successfully carried out in a set of virtual scenarios under different challenging conditions.
IRJET- Behavior Analysis from Videos using Motion based Feature ExtractionIRJET Journal
This document proposes a technique for analyzing human behavior in videos using motion-based feature extraction. It discusses how previous approaches have used spatial and temporal features to detect abnormal behaviors. The proposed approach extracts motion features from videos to represent each video with a single feature vector, rather than extracting features from each individual frame. This reduces the feature space and unnecessary information. The technique involves preprocessing videos into frames, extracting motion features, using KNN classification on the features to classify behaviors as normal or abnormal, and evaluating the method's performance on various metrics like accuracy, recall, and precision. Testing on fight and riot datasets showed the motion-based approach achieved higher accuracy, recall, precision and F-measure than a non-motion based approach.
This summarizes a document describing a modular approach for retrieving moments from lifelogs. The approach uses both textual analysis and semantic enrichment of queries as well as visual concept augmentation of images. It relies on an inverted index of visual concepts and metadata. The paper describes the main components, including data used, query interpretation methods, ranking approaches, filtering techniques, and clustering algorithms. It then outlines 12 runs using different combinations of these methods, focusing on query interpretation, subset selection, scoring, and refinement through weighting, thresholding, and visual or temporal clustering.
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operatorQUESTJOURNAL
ABSTRACT:Tracking of moving objects that is called video tracking is used for measuring motion parameters and obtaining a visual record of the moving objects, it is an important area of application in image processing. In general there are two different approaches to obtain object tracking: the first is Recognition-based Tracking, and the second is the Motion-based Tracking. Video tracking system raises a wide possibility in today’s society. This system is used in various applications such as military, security, monitoring, robotic, and nowadays in dayto-day applications. However the video tracking systems still have many open problems and various research activities in a video tracking system are explores. This paper presents an algorithm for video tracking of any moving targets with the uses of contour based detection technique that depends on the sobel operator. The proposed system is suitable for indoor and outdoor applications. Our approach has the advantage of extending the applicability of tracking system and also, as presented here improves the performance of the tracker making feasible high frame rate video tracking. The goal of the tracking system is to analyze the video frames and estimate the position of a part of the input video frame (usually a moving object), our approach can detect, tracked any object more than one object and calculate the position of the moving objects. Therefore, the aim of this paper is to construct a motion tracking system for moving objects. Where, at the end of this paper, the detail outcome and result are discussed using experiments results of the proposed technique
Implementation of Object Tracking for Real Time VideoIDES Editor
Real-time tracking of object boundaries is an
important task in many vision applications. Here we propose
an approach to implement the level set method. This approach
does not need to solve any partial differential equations (PDFs),
thus reducing the computation dramatically compared with
optimized narrow band techniques proposed before. With our
approach, real-time level-set based video tracking can be
achieved.
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...csandit
This paper proposes a robust tracking method which concatenates appearance and geometrical
features to re-identify human in non-overlapping views. A uniformly-partitioning method is
proposed to extract local HSV(Hue, Saturation, Value) color features in upper and lower
portion of clothing. Then adaptive principal view selecting algorithm is presented to locate
principal view which contains maximum appearance feature dimensions captured from different
visual angles. For each appearance feature dimension in principal view, all its inner frames get
involved in training a support vector machine (SVM). In matching process, human candidate
filtering is first operated with an integrated geometrical feature which connects height estimate
with gait feature. The appearance features of the remaining human candidates are later tested
by SVMs to determine the object’s existence in new cameras. Experimental results show the
feasibility and effectiveness of this proposal and demonstrate the real-time in appearance
feature extraction and robustness to illumination and visual angle change.
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...cscpconf
This paper proposes a robust tracking method which concatenates appearance and geometrical
features to re-identify human in non-overlapping views. A uniformly-partitioning method is
proposed to extract local HSV(Hue, Saturation, Value) color features in upper and lower
portion of clothing. Then adaptive principal view selecting algorithm is presented to locate
principal view which contains maximum appearance feature dimensions captured from different
visual angles. For each appearance feature dimension in principal view, all its inner frames get
involved in training a support vector machine (SVM). In matching process, human candidate
filtering is first operated with an integrated geometrical feature which connects height estimate
with gait feature. The appearance features of the remaining human candidates are later tested
by SVMs to determine the object’s existence in new cameras. Experimental results show the
feasibility and effectiveness of this proposal and demonstrate the real-time in appearance
feature extraction and robustness to illumination and visual angle change.
The Extraction of Spatial Features from remotely sensed data and the use of this information as input into further decision making systems such as geographical information systems (GIS) has received considerable attention over the few decades. The successful use of GIS as a decision support tool can only be achieved, if it becomes possible to attach a quality label to the output of each spatial analysis operation. Thus the accuracy of Spatial Feature Extraction gained more attention as geographic features can hardly formulated in a certain pattern due to intra-class variation and inter-class similarity. Besides these Spatial Feature Extraction further include positional uncertainty, attribute uncertainty, topological uncertainty, inaccuracy, imprecision/inexactitude, inconsistency, incompleteness, repetition, vagueness, noisy, omittance, misinterpretation, misclassification, abnormalities and knowledge uncertainty. To control and reduce uncertainty in an acceptable degree, a Probabilistic shape model is described for Extracting Spatial Features from multi-spectral image. The advantages of this, as opposed to the conventional approaches, are greater accuracy and efficiency, and the results are in a more desirable form for most purposes.
This document proposes a convolutional neural network (CNN) to automatically classify aerial and remote sensing images. The CNN has six layers - three convolutional layers to extract visual features from the images at different levels of abstraction, two fully-connected layers to integrate the extracted features, and a final softmax classifier layer to classify the images. The CNN is evaluated on two datasets and is shown to outperform state-of-the-art baselines in terms of classification accuracy, demonstrating its ability to learn spatial features directly from images without relying on handcrafted features or descriptors.
1. Action Recognition using Context and Appearance Distribution Features
Xinxiao Wu Dong Xu Lixin Duan
School of Computer Engineering
Nanyang Technological University
Singapore
Jiebo Luo
Kodak Research Laboratories
Eastman Kodak Company
Rochester,USA
{xinxiaowu,DongXu,S080003}@ntu.edu.sg
Jiebo.Luo@kodak.com
Abstract
methods to fuse different types of features have become two
important issues in action recognition.
Human action recognition algorithms can be roughly categorized into model-based methods and appearance-based
approaches. Model-based methods [5, 29] usually rely on
human body tracking or pose estimation in order to model
the dynamics of individual body parts for action recognition. However, it is still a non-trivial task to accurately detect and track the body parts in unrestricted scenarios.
Appearance-based approaches mainly employ appearance features for action recognition. For example, global
space-time shape templates from image sequences are used
in [7, 30] to describe an action. However, with these methods, highly detailed silhouettes need to be extracted, which
may be very difficult in a realistic video. Recently, approaches [3, 21, 16] based on local spatio-temporal interest points have shown much success in action recognition.
Compared to the space-time shape and tracking based approaches, these methods do not require foreground segmentation or body parts tracking, so they are more robust to
camera movement and low resolution. Each interest point
is represented by its location (i.e., XYT coordinates) in the
3D space-time volume and its spatio-temporal appearance
information (e.g., the gray-level pixel values and 3DHoG).
Using only the appearance information of interest points,
many methods [3, 16, 21] model an action as a bag of independent and orderless visual words without considering the
spatio-temporal contextual information of interest points.
In Section 3, we first propose a new visual feature by
using multiple GMMs to characterize the spatio-temporal
context distributions about the relative coordinates between
pairwise interest points over multiple space-time scales.
Specifically, for each local region (i.e., sub-volume) in a
video, the relative coordinates between a pair of interest
points in XYT space is considered as the spatio-temporal
context feature. Then each action is represented by a set of
context features extracted from all pairs of interest points
over all the local regions in a video volume. Gaussian Mixture Model (GMM) is adopted to model the distribution of
We first propose a new spatio-temporal context distribution feature of interest points for human action recognition. Each action video is expressed as a set of relative
XYT coordinates between pairwise interest points in a local region. We learn a global GMM (referred to as Universal Background Model, UBM) using the relative coordinate features from all the training videos, and then represent each video as the normalized parameters of a videospecific GMM adapted from the global GMM. In order to
capture the spatio-temporal relationships at different levels, multiple GMMs are utilized to describe the context distributions of interest points over multi-scale local regions.
To describe the appearance information of an action video,
we also propose to use GMM to characterize the distribution of local appearance features from the cuboids centered
around the interest points. Accordingly, an action video
can be represented by two types of distribution features: 1)
multiple GMM distributions of spatio-temporal context; 2)
GMM distribution of local video appearance. To effectively
fuse these two types of heterogeneous and complementary
distribution features, we additionally propose a new learning algorithm, called Multiple Kernel Learning with Augmented Features (AFMKL), to learn an adapted classifier
based on multiple kernels and the pre-learned classifiers of
other action classes. Extensive experiments on KTH, multiview IXMAS and complex UCF sports datasets demonstrate
that our method generally achieves higher recognition accuracy than other state-of-the-art methods.
1. Introduction
Recognizing human actions from videos still remains
a challenging problem due to the large variations in human appearance, posture and body size within the same
class. It also suffers from various factors such as cluttered
background, occlusion, camera movement and illumination
change. How to extract discriminative and robust image features to describe actions and design new effective learning
489
2. context features for each video. However, the context features from one video may not contain sufficient information
to robustly estimate the parameters of GMM. Therefore, we
first learn a global GMM (referred to as Universal Background Model, UBM) by using the context features from all
the training videos and then describe each video as a videospecific GMM adapted from the global GMM via a Maximum A Posteriori (MAP) adaptation process. In this work,
multiple GMMs are exploited to cope with different levels
of spatio-temporal contexts of interest points from different
space-time scales. GMM is also used to characterize the
appearance distribution of local cuboids that are centered
around the interest points.
Multi-scale spatio-temporal context distribution feature
and appearance distribution feature characterize the properties of “where” and “what” of interest points, respectively.
In Section 4, we propose a new learning method called Multiple Kernel Learning with Augmented Features (AFMKL)
to effectively integrate two types of complementary features. Specifically, we propose to learn an adapted classifier based on multiple kernels constructed from different
types of features and the pre-learned classifiers from other
action classes. AFMKL is motivated by the observation
that some actions share similar motion patterns (e.g., the
actions of “walking”, “jogging” and “running” may share
some similar motions of hands and legs). It is beneficial
to learn an adapted classifier for “jogging” by leveraging
the pre-learned classifiers from “walking” and “running”.
It is interesting to observe that the dual of our new objective function is similar to that of Generalized Multiple Kernel Learning (GMKL) [23] except that the kernel matrix is
computed based on the Augmented Features, which combines the original context/appearance distribution feature
and the decision values from the pre-learned SVM classifiers of all the classes.
In Section 5, we conduct comprehensive experiments
on the KTH, multi-view IXMAS and UCF sports datasets,
and the results demonstrate the effectiveness of our method.
The main contributions of this work include: 1) we propose a new multi-scale spatio-temporal context distribution
feature; 2) we propose a new learning method AFMKL to
effectively fuse the context distribution feature and the appearance distribution feature; 3) our method generally outperforms the existing methods on three benchmark datasets,
demonstrating promising results for realistic action recognition.
of interest points from the same visual word in a local
region, and concatenated all the local histograms into a
lengthy descriptor. Ryoo et al. [19] proposed a so-called
“featuretype×featuretype×relationship” histogram to capture both appearance and relationship information between
pairwise visual words. All these methods first utilized the
vector quantization process to generate a codebook and then
adopted the “bag-of-words” representation. The quantization error may be propagated to the spatio-temporal context
features and may degrade the final recognition performance.
In our work, the spatio-temporal context feature is directly
extracted from the detected interest points rather than visual
words, so our method dose not suffer from the vector quantization error. Moreover, the global GMM represents the visual content of all the action classes including possible variations on human appearances, motion styles as well as environment conditions, and the video-specific GMM adapted
from global GMM provides additional information to distinguish the videos of different classes.
Sun et al. [22] presented a hierarchical structure to model
the spatio-temporal context information of SIFT points, and
their model consists of point-level context, intra-trajectory
context and inter-trajectory context. Bregonzio et al. [1]
created the clouds of interest points accumulated over multiple temporal scales, and extracted holistic features of
the clouds as the spatio-temporal information of interest
points. Zhang et al. [31] extracted the motion words from
the motion images and utilized the relative locations between the motion words and a reference point in a local
region to establish the spatio-temporal context information.
These methods mentioned above are based on various preprocessing steps, such as feature tracking, human body detection and foreground segmentation. As will be shown
without requiring any pre-processing step, our method can
still achieve promising performance even in complex environments with changing lighting and moving cameras.
Vega et al. [24] and Nayak et al. [15] proposed to use
the distribution of pairwise relationships between the edge
primitivies to capture the shape of action in 2D image. In
contrast to [24] [15], the relationship between the interest
points in this paper is defined in 3D video volume to capture
both motion and shape of action. Moreover, they described
the rational distribution just within the whole image, while
we used multiple GMMs to model the context distribution
of interest points at multiple space-time scales.
3. Spatio-Temporal Context Distribution Feature and Appearance Distribution Feature
2. Related Work
Recently, researchers have exploited the spatial and temporal context as another type of information for describing
interest points. Kovashka and Grauman [11] exploited multiple “bag-of-words” models to represent the hierarchy of
space-time configurations at different scales. Savarese et
al. [20] used a local histogram to capture co-occurrences
3.1. Detection of interest points
We use the interest point detector proposed by Dollar et
al. [3] which respectively employs 2D Gaussian filter in the
spatial direction and 1D Gabor filter in the temporal direction. The two separate filters can produce high response at
490
3. 70
height(Y)
60
50
40
120
30
110
20
160
100
140
90
120
100
80
80
time(T)
60
40
width(X)
70
20
0
60
Figure 1. Samples of detected interest points on the KTH dataset.
(a)
(b)
Figure 2. The multi-scale spatio-temporal context extraction from
interest points of one “handwaving” video in a 3D video volume
(a) and in a local region (b).
points with significant spatio-temporal intensity variations.
The response function has the form:
(I(x, y, t)∗g(x, y)∗hev (t))2 +(I(x, y, t)∗g(x, y)∗hod (t))2 ,
the mean of distances between all the interest points and
the center interest point. A large number of relative coordinates extracted from all the local regions over the entire
video collectively describe the spatio-temporal context information of interest points for an action.
To capture the spatio-temporal context of interest points
at different space-time scales, we use multi-scale local regions across multiple space-time scales to generate multiple sets of local context features (i.e., XYT relative coordinates). Each set represents the spatio-temporal context at
one space-time scale. Figure 2(a) illustrates the detected
interest points of one video from the “handwaving” action,
where the blue dots are the interest points in 3D video volume and the red bounding boxes represent certain local regions at different space-time scales. In our experiments, for
computational simplicity and efficiency, we set the spatial
size of the local region the same as that of each frame, and
we use multiple temporal scales represented by different
numbers of frames. Consequently, the local regions are generated by simply moving a spatio-temporal window frame
by frame through the video. Suppose there are T local regions at one scale and R relative coordinates in each local
region, then the total number of relative coordinates in the
entire video at this scale is N = RT .
x2 +y 2
1
where g(x, y) = 2πσ2 e− 2σ2 is the 2D Gaussian smoothing kernel along the spatial dimension, and hev (t) =
2
2
2
2
−cos(2πtω)e−t /τ and hod (t) = −sin(2πtω)e−t /τ are
a pair of 1D Gabor filters along the temporal dimension.
With the constraints ω = 4/τ , σ and τ are two parameters
of the spatial scale and temporal scale, respectively. Figure 1 shows some examples of detected interest points (depicted by red dots) of human actions. It is evident that most
detected interest points are near the body parts that have
major contribution to the action classification.
3.2. Multi-scale spatio-temporal context extraction
As an important type of action representation, the spatiotemporal context information of interest points characterizes both spatial relative layout of human body parts and
temporal evolution of human poses. In order to represent
the spatio-temporal context between interest points, we propose a new local spatio-temporal context feature using a set
of XYT relative coordinates between any pairs of interest
points in a local region. Suppose there are R interest points
in a local region, then the number of pairwise relative coordinates is R(R − 1). For efficient computation and compact
description, we define a center interest point by:
[Xc Yc Tc ] = arg
min
[Xc ,Yc ,Tc ]
R
∑
∥[Xi Yi Ti ]T −[Xc Yc Tc ]T ∥2 ,
i=1
3.3. Multi-scale spatio-temporal context distribution feature
For each action video, a video-specific Gaussian Mixture
Model (GMM) is employed to characterize the distribution
of the set of spatio-temporal context features at one spacetime scale. Considering the spatio-temporal context features extracted from one video may not contain sufficient information to robustly estimate the parameters of the videospecific GMM, we therefore propose a two-step approach.
We first train a global GMM (also referred to as Universal Background Modeling, UBM) using all the spatiotemporal context features from all the training videos.
where [Xc Yc Tc ]T and [Xi Yi Ti ]T respectively represent
the coordinates of the center and the i-th interest point in a
local video region. Consequently, the spatio-temporal contextual information of interest points is characterized by R
relative coordinates between all the interest points and the
center interest point, i.e., si = [Xi − Xc Yi − Yc Ti − Tc ]T ,
i = 1, 2, ..., R. As shown in Figure 2(b), the red pentacle
represents the center interest point of all interest points in
a local region. The relative coordinates are normalized by
491
4. ture x of video V is represented by
The global GMM can be represented as its parameter set
{(mk , µk , Σk )|K } where K is the total number of GMM
k=1
components. mk , µk and Σk are the weight, the mean vector and the covariance matrix of the k-th Gaussian compo∑K
nent, respectively. Note we have the constraint k=1 mk =
1. As suggested in [32], the covariance matrix Σk is set to
be a diagonal matrix for computational efficiency. We adopt
the well-known Expectation-Maximization (EM) algorithm
to iteratively update the weight, the mean and the covariance matrix. mk is initialized to the uniform weights. We
partition all the training context features into K clusters and
use the samples in each cluster to initialize µk and Σk .
The video-specific GMM for each video can be generated from the global GMM via a Maximum A Posterior
(MAP) adaption process. Given the set of spatio-temporal
context features {s1 , s2 , ..., sN } extracted from an action
video V where si ∈ R3 denotes the i-th context feature vector (i.e., XYT relative coordinates), we introduce an intermediate variable η(k|si ) to indicate the membership probability of si belonging to the k-th GMM component:
√
T
T
T
x = [v1 , v2 , ..., vK ]T ∈ RD , vk =
where D = 3K is the feature dimension of x.
To capture the context distributions at different spatiotemporal levels, we propose to use multiple GMMs with
each GMM representing the context distribution of interest
points at one space-time scale.
3.4. Local video appearance distribution feature
After detecting the interest points, we also extract the appearance information from the cuboids around the interest
points. For simplicity, we flatten each normalized cuboid
and extract the gray-level pixel values from each normalized cuboid. Principle Component Analysis (PCA) is used
to reduce the dimension of appearance feature vector by
preserving 98% energy. Similar to the spatio-temporal context distribution feature, we also employ GMM to describe
the distribution of local appearance for each action video by
using the two-step approach discussed in Section 3.3.
Discussion: In [32], the GMM is adopted to represent
the distribution of SIFT features in a video for event recognition. While SIFT is a type of static feature extracted from
2D image patches, our cuboid feature captures both static
and motion information from 3D video sub-volume which
is more effective for describing actions.
mk Pk (si |θk )
,
η(k|si ) = ∑K
j=1 mj Pj (si |θj )
where Pk (si |θk ) represents the Gaussian probability density function with θk = {µk , Σk }. Note we have the con∑K
∑N
straint k=1 η(k|si ) = 1. Let ζk = i=1 η(k|si ) be the
soft count of all the context features si |N belonging to the
i=1
k-th GMM component. Then, the k-th component of the
video-specific GMM of any video V can be adapted as follows:
4. Multiple Kernel Learning with Augmented
Features (AFMKL) for Action Recognition
∑N
µk
¯
µk
ˆ
mk
ˆ
1
mk − 2
ˆ
Σk µk ∈ R3 ,
ˆ
2
To fuse the spatio-temporal context distribution and local appearance distribution features for action recognition,
we propose a new learning method called Multiple Kernel
Learning with Augmented Features (AFMKL) to learn a
robust classifier by using multiple base kernels and a set
of pre-learned SVM classifiers from other action classes.
The introduction of the pre-learned classifiers from other
classes is based on the observation that some actions may
share common motion patterns. For example, the actions of
“walking” , “jogging” and “running” may share some typical motions of hands and legs, therefore it is beneficial to
learn an adapted classifier for “jogging” by leveraging the
pre-learned classifiers of “walking” and “running”.
Specially, in AFMKL, we use M base kernel functions
km (xi , xj ) = φm (xi )T φm (xj ), where m = 1, 2, ..., M ,
and xi , xj can be the context or appearance distribution feature extracted from video Vi and Vj , respectively. In this
work, we use linear kernels, i.e., km (xi , xj ) = xT xj , and
i
we have M = H + 1 where H is the number of spacetime scales in the context distribution features. Multiple
kernel functions are linearly combined to determine an opti∑M
mal kernel function k, i.e., k = m=1 dm km , where dm ’s
i=1 η(k|si )si
=
ηk
= (1 − ξk )µk + ξk µk
¯
ζk
=
,
N
where µk is the expected mean of the k-th component based
¯
on the training samples si |N and µk is the adapted mean of
ˆ
i=1
ζk
the k-th component. The weighting coefficient ξk = ζk +r
is introduced to improve the estimation accuracy of the
mean vector. If a Gaussian component has a high soft count
ηk , then ξk approaches 1 and the adapted mean is mainly
decided by the statistics of training samples; otherwise, the
adapted mean is mainly determined by the global model.
Following [32], only the means and weighs of GMM are
adapted to better cope with the instability problem during
the parameter estimation and reduce the computational cost.
Therefore, the video V is represented by the video-specific
GMM parameter set {(mk , µk , Σk )|K }, where Σk is the
ˆ ˆ
k=1
covariance matrix of the k-th Gaussian component from
UBM. Finally, the spatio-temporal context distribution fea492
5. ∑M
are the linear combination coefficients, m=1 dm = 1 and
dm ≥ 0. For each class, we have one pre-learned SVM
classifier using the appearance feature and one pre-learned
SVM classifier by fusing H context features from different
space-time scales in an early fusion fashion. Suppose we
have C classes, then we have L = 2C pre-learned classifiers {fl (x)|L } from all the classes. Given the input feal=1
ture x of the video, the final decision function is defined as
follow:
f (x) =
L
∑
βl fl (x) +
M
∑
T
dm wm φm (x) + b,
∑M
˜
Since
is replaced by
m=1 dm km in our work.
we use linear kernel in this work (i.e., φ(xi ) =
˜
xi ), then km (xi , xj ) can be computed using the
1
1
augmented features [xT , √λ f1 (xi ), ..., √λ fL (xi )]T and
i
1
1
[xT , √λ f1 (xj ), ..., √λ fL (xj )]T which combine the origij
nal context/appearance feature and the decision values from
the pre-learned SVM classifiers of all the classes. We therefore name our method as Multiple Kernel Learning with
Augmented Features (AFMKL).
Following [23], we iteratively update the linear combination coefficient d and the dual variable α to solve (2).
In [23], the first-order gradient descent method is used for
updating d in (2). In contrast, we employ the second-order
gradient descent method because of its faster convergence.
After obtaining the optimal d and α, we rewrite the decision function in (1) as follows:
( M
)
I
L
∑
∑
1∑
fl (xi )fl (x) +b.
f (x) =
αi yi
dm km (xi , x)+
λ
m=1
i=1
(1)
m=1
l=1
where βl is the weight of the l-th pre-learned classifier, wm
and b are the parameters of the standard SVM. Note that
∑M
T
m=1 dm wm φm (x) + b is the decision function in Multiple Kernel Learning (MKL).
Given the training videos {xi |I } with the correspondi=1
ing class labels {yi |I } where yi ∈ {−1, +1}, we formui=1
late AFMKL as follows by adding a quadratic regularization
term on dm ’s:
1
min ∥d∥2 + J(d),
d
2
where
J(d) =
min
wm ,β,b,ξi
1
2
M
∑
s.t.
dm = 1, dm ≥ 0,
l=1
(2)
m=1
(
M
∑
)
dm ∥wm ∥ +λ∥β∥ +C
m=1
2
2
I
∑
ξi ,
i=1
s.t. yi f (xi ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , I,
d = [d1 , d2 , . . . , dM ]T is the vector of linear combination
coefficients and β = [β1 , β2 , . . . , βL ]T is the weighting
vector. Note that in J(d) we penalize the complexity of
the weighting vector β to control the complexity of the pre˜m
learned classifiers. By replacing wm = wm and accordd
ing to [17], the above optimization problem in (2) is jointly
˜
convex with respect to d, wm , β, b and ξi . Thus, the global
optimum of the objective in (2) can be reached. Similarly
as in [23], by replacing the primal form of the optimization
problem in J(d) with its dual form, J(d) can be rewritten
as:
I
I
M
∑
∑
1∑
˜
max
αi −
αi αj yi yj
dm km (xi , xj ),
α
2 i,j=1
m=1
i=1
J(d) =
I
∑
αi yi = 0, 0 ≤ αi ≤ C, i = 1, . . . , I,
s.t.
i=1
where α = [α1 , α2 , . . . , αI ]T is the vector of the
˜
dual variables αi ’s and km (xi , xj ) = φ(xi )T φ(xj ) +
∑L
1
It is interesting to observe that
l=1 fl (xi )fl (xj ).
λ
the dual problem of (2) is similar to that of Generalized MKL [23] except that the kernel function
493
Discussion: The most related method is Adaptive Multiple Kernel Learning (A-MKL) [4]. A-MKL is proposed for
cross-domain learning problems and it learns a target classifier by leveraging the pre-learned classifiers from the same
class, which are trained based on the training data from
two domains (i.e., auxiliary domain and target domain). In
contrast, our method AFMKL is specifically proposed for
action recognition and all the samples are assumed to be
from the same domain. The pre-learned classifiers used
in AFMKL are from other action classes. Our method
AFMKL is also different from SVM with Augmented Features (AFSVM) proposed in [2]. Although AFSVM also
makes use of the pre-learned classifiers from other classes,
only one kernel is considered in AFSVM. In contrast, our
proposed method AFMKL is a Multiple Kernel Learning
(MKL) technique that can learn the optimal kernel by linearly combining multiple kernels constructed from different
types of features.
5. Experiments
5.1. Human action datasets
The KTH human dataset [21] is a commonly used action dataset. It contains six human action classes: walking,
jogging, running, boxing, handwaving and handclapping.
These actions are performed by 25 subjects in four different
scenarios: outdoors (s1), outdoors with scale variation (s2),
outdoors with different clothes (s3) and indoors with lighting variation (s4). There are totally 599 video clips with
the image size of 160 × 120 pixels. We adopt the leaveone-subject-out cross validation setting in which videos of
24 subjects are used as training data and the videos of the
remaining one subject are used for testing.
6. Method
Dollar et al. [3]
Savarese et al. [20]
Niebles et al. [16]
Liu and Shah [14]
Bregonizo et al. [1]
Ryoo and Aggarwal [19]
Schuldt et al. [21]
JHuang et al. [8]
Klaser et al. [10]
Zhang et al. [31]
Laptev et al. [12]
Liu et al. [13]
Gilbert et al. [6]
Kovashka and Grauman [11]
Our method
The IXMAS dataset [26] consists of 12 complete action
classes with each action executed three times by 12 subjects and recorded by five cameras with the frame size of
390 × 291 pixels. These actions are: check watch, cross
arms, scratch head, sit down, get up, turn around, walk,
wave, punch, kick, point and pick up. The body position
and orientation are freely decided by different subjects. As
in the KTH action dataset, we also use the same leave-onesubject-out cross validation setting.
The UCF sports dataset [18] contains ten different types
of sports actions: swinging, diving, kicking, weight-lifting,
horse-riding, running, skateboarding, swinging at the high
bar (HSwing), golf swinging (GSwing) and walking. The
dataset consists of 149 real videos with a large intra-class
variability. Each action class is performed in different number of ways, and the frequencies of various actions also differ considerably. In order to increase the amount of training samples, we extend the dataset by adding a horizontally
flipped version of each video sequence to the dataset as suggested in [25]. In the leave-one-sample-out cross validation
setting, one original video sequence is used as the test data
while the rest original video sequences together with their
flipped versions are employed as the training data. Following [25], the flipped version of the test video sequence is not
included in the training set.
Accuracy(%)
81.2
86.8
81.5
94.3
93.2
91.1
71.7
91.7
84.3
91.3
91.8
93.8
94.5
94.5
94.5
Table 1. Recognition accuracies(%) of different methods on the
KTH dataset.
Method
ST Context
Appearance
GMKL [23]
AFMKL
S1
92.7
90.7
96.0
96.7
S2
76.7
89.3
86.0
91.3
S3
86.0
87.3
90.7
93.3
S4
88.7
90.7
94.0
96.7
Ave
86.0
89.5
91.7
94.5
Table 2. Recognition accuracies (%) using different features on
four scenarios of the KTH dataset. The last column is the average
accuracy of all scenarios.
5.2. Experimental setting
Boxing
0.02
0.01
0
0
0.01
Clapping
0.01
0.99
0
0
0
0
Waving
For interest points detection, the spatial and temporal
scale parameters σ and τ are empirically set by σ = 2.5
and τ = 2, respectively. The size of the cuboid is empirically fixed as 7 × 7 × 5. For the KTH and IXMAS datasets,
200 interest points are extracted from each video, because
they contain enough information to distinguish the actions
in relatively homogeneous and static background. About
1000 interest points detected from each video are used for
the more complex UCF sports dataset. The number of Gaussian components in GMM (i.e., K) is empirically set to 300,
400 and 300 for the KTH, IXMAS and UCF datasets, respectively.
For our method, we report the results using SVM with
multi-scale spatio-temporal (ST) context distribution feature only, SVM with local appearance distribution feature
only, the combinations of these two types of features via
Generalized MKL (GMKL) [23] and AFMKL. For SVM
with ST context distribution feature, we adopt the early
fashion method to integrate the multiple features from H
space-time scales. In both GMKL and AFMKL, we use
M = H + 1 linear base kernels respectively constructed
for the sptaio-temporal context distribution features from H
space-time scales and the appearance distribution feature.
To cope with multi-class classification task using SVM, we
adopt the default setting in LIBSVM.
0.96
0.01
0.01
0.98
0
0
0
0
0
0
0.91
0.04
0.05
0
0
0
0.07
0.92
0.01
0
0
0
0.01
0.99
Jogging
Running
Walking
Boxing Clapping Waving
0
Jogging
Running Walking
Figure 3. Confusion table of our AFMKL on the KTH dataset.
Camera
ST Context
Appearance
GMKL [23]
AFMKL
Liu and Shah [14]
Yan et al. [27]
Weinland et al. [26]
Junejo et al. [9]
1
61.1
72.2
76.4
81.9
76.7
72.0
65.4
76.4
2
57.9
72.2
74.5
80.1
73.3
53.0
70.0
77.6
3
53.2
70.4
73.6
77.1
72.1
68.1
54.4
73.6
4
51.6
65.7
71.8
77.6
73.1
63.0
66.0
68.8
5
42.6
67.6
60.4
73.4
33.6
66.1
Table 3. Classification accuracies (%) of different methods for single view action recognition on the IXMAS dataset.
5.3. Experimental results
Table 1 compares our method with the existing methods
on the KTH dataset. Among the results [3, 20, 16, 14, 1, 19]
494
7. Cameras
ST Context
Appearance
GMKL [23]
AFMKL
Liu and Shah [14]
Yan et al. [27]
Weinland et al. [26]
1,2,3,4,5
73.8
81.9
81.3
88.2
82.8
-
1,2,3,4
72.2
77.6
80.8
88.2
78.0
81.3
1,2,3,5
73.2
83.6
81.5
89.4
75.9
1,2,3
73.2
77.8
81.3
87.7
60.0
-
1,3,5
67.8
82.4
77.6
88.4
70.2
1,3
67.4
74.3
76.2
86.6
71.0
-
2,4
66.7
72.7
77.1
82.4
71.0
81.3
3,5
57.6
79.9
72.7
83.8
61.6
Table 4. Classification accuracies (%) of different methods for multi-view action recognition on the IXMAS dataset.
0.06
scratch head 0.06
0
0
0
0
0
0
0
0.14
0
Swing
0.95
0
0
0
0
0.05
0
0
0
0
0
0
0
0
0
0.03
0
0.05
0
Dive
0
1
0
0
0
0
0
0
0
0
0.11
0.72
0
0
0
0
0.08
0
0
0.03
0
Kick
0
0
1
0
0
0
0
0
0
0
Lift
0
0
0
1
0
0
0
0
0
0
0
0.08
0
0
0.67
0.17
0
0
0.08
0
Run
cross arms
0.83 0.03
Ride
check watch 0.78
0
0
0
0
0.07
0.93
0
0
0
0
Skate
0.08
0
0
0
0
0
0.84
0
0
0.08
HSwing
0
0
0
0
0
0
0
0.93
0
0.07
0
0
0
0
0
0
0.06
0
0.88
0.06
0.04
0
0
0.91
0.08
0
sit down
0
0
0
1
0
0
0
0
0
0
0
0
get up
0
0
0
0
1
0
0
0
0
0
0
0
turn around
walk
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
wave
0.02
0.06
0.08
0
0
0
0
0.75
0.03
0.03
0.03
0
punch
0.03
0
0
0
0
0
0
0
0.94
0.03
0
0
kick
0
0
0
0
0
0.03
0
0
0.03
0.94
0
0
GSwing
point
0.08
0.06
0.03
0
0
0
0
0.03
0.14
0.03
0.64
0
Walk
pick up
0
0
0
0
0
0
0
0
0
0
0
1
cw
ca
sh
sd
gu
ta
wk
wv
pu
ki
po
pu
0
0
0
0
Swing
Dive
Kick
Lift
Ride
Run
Skate HSwing GSwing Walk
Figure 5. Confusion table of our AFMKL on the UCF sports
dataset.
Figure 4. Confusion table of our AFMKL in the multi-view setting
using all five views on the IXMAS dataset.
Method
ST Context
Appearance
GMKL [23]
AFMKL
Kovashka and Grauman [11]
Wang et al. [25]
Rodriguez et al. [18]
Yeffet and Wolf [28]
0
0.05
view action recognition on the IXMAS dataset. Table 3 and
Table 4 respectively report the classification results of different methods for single-view and multi-view action recognition. Among the methods [14, 26, 27, 9] that are tested
on the IXMAS dataset, the method [14] is the most related
one because the spatio-temporal interest points are also used
in [14]. Other methods need to use more information of
3D human poses via camera calibration, background subtraction and 3D pose construction which are not employed
in this work. The confusion table of recognition results in
the multi-view setting using all five views is shown in Figure 4. For some actions, such as “sit down”, “get up”, “turn
around”, “walk” and “pick up”, our method achieves very
high recognition accuracies. Our method also achieves reasonable performances for some challenging actions (e.g.,
“point”, “scratch head” and “wave”) that have small and
ambiguous motions.
Evaluation results on the UCF sports videos are presented in Table 5. Our method achieves 91.3% recognition
accuracy which is better than 85.6% and 87.3% respectively
reported in recent work [25] and [11] using the same setting. The confusion matrix of our AFMKL is depicted in
Figure 5. While this dataset is difficult with large viewpoint and appearance variability as well as camera motion,
our result is still encouraging.
From Tables 2, 3, 4 and 5, it is important to note that
although the spatio-temporal context distribution feature is
Accuracy (%)
71.1
76.5
85.2
91.3
87.3
85.6
69.2
79.3
Table 5. Recognition accuracies (%) of different methods on the
UCF sports dataset.
using the interest points detected by Dollar’s method [3]
and the leave-one-subject-out cross validation setting, our
method achieves the highest recognition accuracy of 94.5%,
which is also the same as the best results from the recent
works [21, 8, 10, 31, 12, 13, 6, 11]. Table 2 reports the
recognition accuracies on four different scenarios. Figure 3
shows the confusion table of recognition results on the KTH
dataset. It is interesting to note that the leg-related actions
(“jogging”, “running”, “walking”) are more confused with
each other. Especially “running” is easy to be misclassified
as “jogging”. The possible explanation is that “jogging” and
“running” have similar spatio-temporal context and appearance information.
Our approach is also tested for single-view and multi495
8. generally not as effective as the appearance distribution feature, the combination of two types of complementary features via GMKL [23] significantly outperforms the methods using single feature in most cases. Moreover, AFMKL
achieves the best results in all the cases, which demonstrates
the effectiveness of using the pre-learned classifiers from
other action classes and multiple kernel learning to improve
the recognition performance.
[12] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, I. Rennes, I. I.
Grenoble, and L. Ljk. Learning realistic human actions from movies.
In CVPR, 2008.
[13] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from
videos. In CVPR, 2009.
[14] J. Liu and M. Shah. Learning human actions via information maximization. In CVPR, 2008.
[15] S. Nayak, S. Sarkar, and B. L. Loeding. Distribution-based dimensionality reduction applied to articulated motion recognition. IEEE
Trans. Pattern Anal. Mach. Intell., 31(5):795–810, 2009.
[16] J. C. Niebles, H. Wang, and L. Fei-fei. Unsupervised learning of
human action categories using spatial-temporal words. In IJCV, volume 79, pages 299–318, 2008.
[17] A. Rakotomamonjy, F. R. Bach, and Y. Grandvalet. SimpleMKL.
JMLR, 9:2491–2521, 2008.
[18] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatiotemporal maximum average correlation height filter for action recognition. In CVPR, 2008.
[19] M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: video
structure comparison for recognition of complex human activities. In
ICCV, 2009.
[20] S. Savarese, A. Delpozo, J. C. Niebles, and L. Fei-fei. Spatialtemporal correlatons for unsupervised action classification. In
WMVC, 2008.
[21] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a
local svm approach. In ICPR, 2004.
[22] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In
CVPR, 2009.
[23] M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In ICML, 2009.
[24] I. R. Vega and S. Sarkar. Statistical motion model based on the
change of feature relationships: Human gait-based recognition. IEEE
Trans. Pattern Anal. Mach. Intell., 25(10):1323–1328, 2003.
[25] H. Wang, M. M. Ullah, A. Klser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In
BMVC, 2009.
[26] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from
arbitrary views using 3d exemplars. In ICCV, 2007.
[27] P. Yan, S. M. Khan, and M. Shah. Learning 4d action feature models
for arbitrary view action recognition. In CVPR, 2008.
[28] L. Yeffet and L. Wolf. Local trinary patterns for human action recognition. In ICCV, 2009.
[29] A. Yilmaz. Recognizing human actions in videos acquired by uncalibrated moving cameras. In ICCV, 2005.
[30] A. Yilmaz and M. Shah. Actions sketch: a novel action representation. In CVPR, 2005.
[31] Z. Zhang, Y. Hu, S. Chan, and L.-T. Chia. Motion context: a new
representation for human action recognition. In ECCV, 2008.
[32] X. Zhou, X. Zhuang, S. Yan, S.-F. Chang, M. Hasegawa-Johnson,
and T. S. Huang. Sift-bag kernel for video event analysis. In ACM
Multimedia, 2008.
6. Conclusions
In order to describe the “where” property of interest
points, we have proposed a new visual feature by using
multiple Gaussian Mixture Models (GMMs) to represent
the distributions of local spatio-temporal context between
interest points at different space-time scales. The local appearance distribution in each video is also modeled using
one GMM in order to capture the “what” property of interest points. To fuse both spatio-temporal context distribution
and local appearance distribution features, we additionally
propose a new learning algorithm called Multiple Kernel
Learning with Augmented Features (AFMKL) to learn an
adapted classifier by leveraging the existing SVM classifiers
of other action classes. Extensive experiments on KTH, IXMAS and UCF datasets have demonstrated that our method
generally outperforms the state-of-the-art algorithms for action recognition. In the future, we plan to investigate how
to automatically determine the optimal parameters in our
method.
Acknowledgements This work is funded by Singapore
A*STAR SERC Grant (082 101 0018).
References
[1] M. Bregonzio, S. Gong, and T. Xiang. Recognising action as clouds
of space-time interest points. In CVPR, 2009.
[2] L. Chen, D. Xu, I. W.-H. Tsang, and J. Luo. Tag-based web photo
retrieval improved by batch mode re-tagging. In CVPR, 2010.
[3] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS PETS, 2005.
[4] L. Duan, D. Xu, I. W.Tsang, and J. Luo. Visual event recognition in
videos by learning from web data. In CVPR, 2010.
[5] C. Fanti, L. Zelnik-manor, and P. Perona. Hybrid models for human
motion recognition. In ICCV, 2005.
[6] A. Gilbert, J. Illingworth, and R. Bowden. Fast realistic multi-action
recognition using mined dense spatio-temporal features. In ICCV,
2009.
[7] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions
as space-time shapes. In ICCV, 2005.
[8] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired
system for action recognition. In ICCV, 2007.
[9] I. N. Junejo, E. Dexter, I. Laptev, and P. Prez. Cross-view action
recognition from temporal self-similarities. In ECCV, 2008.
[10] A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008.
[11] A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition.
In CVPR, 2010.
496