2188 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012public security. However, action recognition becomes quitedifﬁcult under unconstrained conditions such as view change,cluttered background, and occlusions. In this paper, we focuson view-invariant action recognition to elaborate a new methodmore capable of dealing with viewpoint variations in realapplications. The kinds of methods concerning view invariance in actionrecognition have been proposed. Generally, they could be cat-egorized into three categories. First, 3-D reconstruction tech-niques could provide the most reliable view-independent rep-resentation of actions. In  and , the images recorded omby multiple calibrated cameras are projected back to 3-D vi-sual hulls of human body in different poses. However, the highcomputational cost of this method limits its practical applica- Fig. 1. Flowchart of our method. t.ctions since it has to calculate calibration conﬁgurations of cam-eras and correspondences between different images from dif- poferent views. Second, the epipolar geometric relations in mul- • A discriminative probabilistic model, i.e., hidden CRFtiple-view geometry lead to some constraints between image (hCRF), is used to fuse the proposed statistics of motionpoints in different views. For example, the fundamental ratios and projective invariability of cross ratio in one effective gsproposed in , which are the ratios of fundamental matrix and compact framework.and proved to be invariant to viewpoint, could be used to repre- Practically, the proposed method has shown excellent discrimi- losent pose transitions in a model-based method. The problem is nation ability in recognizing different actions while preservingthat manually labeling of joints of the human body is required high robustness to view changes in real circumstances. .bto ﬁnd those triplets of points necessary for homography calcu- tslation. Third, the mapping between motion representation and II. FRAMEWORK OF OUR METHODview-invariant patterns could be automatically learned by ma- ecchine learning techniques such as those in  and . These In this paper, we will describe our method in three phases. Inmethods do not need to extract view-invariant features from im- the ﬁrst stage, we will introduce a motion detection method to http://ieeexploreprojects.blogspot.comages. Nevertheless, they implicitly assume that the mapping sat- detect interest points in space and time domain, which could fa- ojisﬁes the underlying predeﬁned models with empirical priors cilitate optical ﬂow extraction not only in an expected local area prand are not clear on which aspect of their representation of ac- but also for view invariants–cross ratio from those detected in-tions accounts for the observations. Last, people also try to con- terest points. In the second stage, the oriented optical ﬂow fea- restruct invariants from images. For example,  applies a spa- ture is described to be represented by oriented histogram pro-tiotemporal curvature of 2-D trajectory of hands to capture dra- jection to capture statistical motion information, and in the third lomatic changes in motion. The curvature is a view-invariant fea- stage, the optical ﬂow and view-invariant features are fused to-ture, but a high-order derivative degrades signal-to-noise ratio gether in a discriminative model. xpsince the curve is not smooth enough. The method in  as- The proposed framework of action recognition has shownsumes that there is a moment in an action when some of the good integration and achieved expected performance on some eejoints of the body are coplanar, i.e., canonical pose, in order to challenging databases. The ﬂowchart of the proposed method isﬁnd invariant patterns in actions. However, the application is shown in Fig. 1. iequite limited since it is hard to detect such a canonical pose invideos automatically and correctly. III. MOTION DETECTION :// A compact framework is proposed for view-invariant action In previous work of view-invariant action recognition, it is tprecognition from the perspective of motion information in difﬁcult to obtain very accurate trajectories of moving objectsimage sequences, in which we could properly encapsulate because of noise caused by self-occlusions . Appearance- htmotion pattern and view invariants in such a model that results based methods such as scale-invariant feature transform (SIFT)in a complementary fusion of two aspects of characteristics  are not quite suitable for motion analysis since those ap-of human actions. In the following sections, we will discuss pearance-based methods, such as color, gray level, and texture,a series of issues relating to interest point detection in image are not stable among neighboring frames of dynamic scenes.sequence, motion feature extraction, and representation and Compounded by nonrigidity of the human body, extraction ofmodel selection. Above all, the main contribution of this paper stable continuous interest points is far from easy.lies in the following three aspects. For object detection, background modeling and tracking tech- • New features are extracted from video motion information, niques may be the ﬁrst choice. However, existing methods of including view-invariant feature–cross ratio and optical- background modeling, such as Gaussian mixture model, suffer ﬂow-based features. from low effectiveness required for accurate human behavior • A new feature representation is described based on optical analysis. For example, some traditional methods do not work ﬂow in the statistical way of oriented histogram projection well with the existence of shadows, light changes, and, particu- to keep the feature representation away from much noise. larly, view changes in real scenes. Consequently, we are inclined
HUANG et al.: MODEL OF MOTION AND CROSS RATIO FOR VIEW-INVARIANT ACTION RECOGNITION 2189to methods of directly extracting informative features from im-ages without background modeling and tracking, which couldbe achieved by taking advantage of image saliency measure-ment to detect key points from images. Usually, object detection starts from corner detection bymaking use of gray gradient in the 2-D image plane. In orderto detect such a region of interest in images, the measurementof cornerness has to be deﬁned. The most common method isto take the Laplacian of Gaussian (LOG) method as a responseof image gray gradients; for example, the response function ofHarris corner detection is deﬁned as om Fig. 2. STIP detection. trace (1) t.c the motion variations in the local neighborhood. In our method,where and are the eigenvalues of the derivative of the cross ratio is important for view invariance, which needs poGaussian of the local area in image , and is an the action trajectories with the stable key points. Intrinsically,unknown parameter. we also make use of the spatial–temporal property. Here, we Lowe proposed the SIFT feature detection method whose re- gs take the advantage of STIP detection  to extract motion in-sponse function is the deviation of LOG on different scales and formation in video sequences and meanwhile to ease extractionachieved invariance in scale space . of view invariants directly from images. As shown in Fig. 2, lo The idea of 2-D detection of interest point is extended to the red points illustrate the detected STIPs when a person isthe image sequences in 3-D space and time domain. In addi- .b jumping.tion to gray scale, a measurement of gray gradient change along In this way, we have extracted many key points informative tstime could be deﬁned as another portion of energy response in for recognition in spite of the occurrences of noise. Since manytimescale. Dollar et al. proposed Gabor ﬁlter-based algorithm points are detected, a nonmaxima suppression method will be ecof space–time interest point (STIP) detection . Unlike the used to select the relatively stable points of interest as a rep-original methods in , Dollar’s method takes into account the http://ieeexploreprojects.blogspot.com frame, which gives much better per- resentation of the current ojgradient of an image spatially and temporally by using a Laplace formance particularly for periodic motion patterns. The algo-operator to detect the unstability of image intensity. They apply rithm could run more than 25 frames/s on a PC with Intel Quad pra Gabor ﬁlter to measure the saliency of a speciﬁc region in the 2 2.8-GHz CPU.image sequences in space and time domain. Since the Gabor reﬁlter fairly captures texture information of images, it is widely A. Nonmaxima Suppressionused for iris recognition, ﬁngerprint recognition, or any other loapplications of texture analysis. Since multiple key points could be found in each frame, we use a nonmaxima suppression method  to select a relatively xp In their method, the response function is deﬁned as follows: reliable one as the point of interest of this frame. As shown in Fig. 3, for each detected key point, we deﬁne a ee radius to it and then compare its energy response to those in (2) the neighboring area of circle with radius . If there is no other iewhere the image sequences are denoted as . In addition, point whose energy response presents a higher value than that represents 2-D Gaussian kernel ; and are of the central point, the center is considered to be a candidate ://a pair of temporal Gabor ﬁlters, corresponding to and point of interest in this image. After ﬁnishing traversing all the detected points, we select the points with the highest value of tpchannels energy response from all the candidate points as the ﬁnal points of interest in this image. In addition, for each sequence of an ac- ht tion, we deﬁne a ﬁxed size of sliding window as a basic unit of (3) data sample. Correspondingly, we pick up a ﬁxed number of in- terest points within the window to describe an action sequence.Each component of the response function reﬂects the degree Although a single point per frame is selected, the overall sta-of deviation of image intensity in space and time domain. Ex- bility of those points from the image sequences ensures highperimentally, they tune the values of and to customize the robustness for the view-invariant feature extracted from con-window size of the convolution operation spatially and tempo- secutive interest points in the neighborhood, as we will see inrally. our tests in later sections. By convolution operation within a time window, the localgradient information turns out to be the form of energy. Afterthresholding, the points of interest could be located. IV. FEATURE EXTRACTION AND REPRESENTATION Another method considering the spatial–temporal volume In view-invariant action recognition, it is challenging toachieves better results in feature extraction , which models ﬁnd an appropriate feature that is robust to view change while
2190 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Fig. 3. Nonmaxima suppression.preserving discrimination in recognition. The most important omcharacteristic of view invariance is the stability under differentviews but distinctive from different classes. As known in in- Fig. 4. Histogram projection.formation theory, the principle of entropy tells that the greater t.centropy of an observation, the less amount of information it car- Unlike the Lucas–Kanade method , PLK samples down im-ries. In addition, empirically, the more random things happen, ages into different scales and then minimizes the gradient varia- pothe more uniform the distribution of the data is. Therefore, tions of gray level in the local area between neighboring framestrying to extract a feature that is absolute invariant to view iteratively.change will necessarily lead to decline of its discrimination Since PLK has to compute the derivation of the image inten- gspower. Here, we would consider the tradeoff of the invariance sity on different scales iteratively, we only extract optical ﬂowand the discrimination by combining motion features and view features of a local area around the STIPs instead of the whole loinvariants. Therefore, we will put forward a method of combi- image, by which we could avoid high computational cost and meanwhile reduce redundancy in feature extraction. .bnation of motion features and view invariants and use a learningmethod to obtain a tradeoff between the aforementioned two b) Feature Description of Local Motion: The drawback of optical ﬂow is that it brings much noise in the experiment, par- tsaspects. ticularly when the object is moving fast with large displace- ec ment. Therefore, we would rather take the advantage of statis-A. Motion Feature http://ieeexploreprojects.blogspot.comoptical ﬂow, of image patch, than think tical characteristics of for example, the main di- oj rection or histogram of optical ﬂow of the optical ﬂow vector as the pixel displacement. In pre- pr Since appearance-based features such as Harris, histogram of vious work , HOG achieved good results in pedestrian de-oriented gradient (HOG), SIFT, Gabor, and shape highly depend tection mainly because of its effective statistical feature expres- reon the stability of image processing, they fail to accurately rec- sion strategy. Similarly, we project the magnitude of opticalognize different kinds of actions because of the nonrigidity na- ﬂows into directional histogram bins to describe the local mo- loture of the human body or some other impacts in real applica- tion information.tions. Therefore, in our method, after detection of interest points After a key point is detected, we apply the PLK algorithm xpin videos, we extract motion features from the neighboring area to calculate optical ﬂow in the local area around the key point.around the interest points and build the representation of the sta- Each optical ﬂow within an image cell carries a weight in pro- eetistical properties of the local area of the image. portion to its magnitude in the projection into the histogram a) Optical Flow Extraction: Optical ﬂow takes the form of bins. Histograms from different image cells in a block are con- ie2-D vector representing image pixel velocity in the - and catenated together to form a feature vector. -directions. The beginning and ending points of the optical As shown in Fig. 4, we divide a circumference ://ﬂow vector correspond to displacement of image pixels. into eight equivalent bins in our experiment. Each bin collects There are mainly two kinds of methods to extract optical ﬂow voted weights of the magnitudes of the optical ﬂows in the cur- tpfrom images. The ﬁrst one is a feature-based method, which cal- rent region of interest. For each optical ﬂow vector ,culates the matching score of the feature points between neigh- the weights are set to and , htboring frames and takes the displacements of the matched points where and are the orientation gaps between the optical ﬂowas the start and endpoints of the optical ﬂow vector. However, and boundary of the bins. The image block is set to cells,due to the instability of the image edges and large displacement and each cell is divided into eight equivalent bins, which areof moving human body, the calculated optical ﬂow could hardly voted into eight histogram bins, as shown in Fig. 4; thus, theexhibit the real movement of human body. The second one is projected histogram forms a (128) dimensional fea-gradient-based methods, which are widely used in computer vi- ture vector.sion tasks. Gradient-based methods assume that the gray level It is worth noting that the spatiotemporal volume achieved ex-in a local area of the images is relatively stable between adjacent cellent results in representing motion information in  for itsframes. More importantly, by calculating the image gradient and property on modeling motion information in the local neighbor-optimizing the cost function in an iterative way, we can give a hood. Intrinsically, we also take this advantage. The differencedense ﬁeld of optical ﬂow. is that the spatiotemporal volumes are cuboids in videos in their In this paper, we use the pyramidal Lucas–Kanade (PLK) al- method, but they are blocks along consecutive interest points ingorithm  to calculate optical ﬂow from the image sequence. our method.
HUANG et al.: MODEL OF MOTION AND CROSS RATIO FOR VIEW-INVARIANT ACTION RECOGNITION 2191Fig. 5. Sets of four points with identical cross ratios under projective transfor-mations. Fig. 6. Sets of four points with identical cross ratios under projective transfor- mations. omB. View Invariants Here, , , , and represent a set of four collinear Geometric invariants capture invariant information of a geo- t.c points, and the value of is preserved by pro-metric conﬁguration under a certain class of transformations. jective transformations.Group theory gives us theoretical foundation for constructing This precondition of collinearity makes the application of poinvariants . Since they could be measured directly from cross ratio of four collinear points limited. Therefore, we make aimages without knowing the orientation and position of the generalization by constructing a pair of cross ratios in the same gscamera, they have been widely used for object recognition to way as in . As illustrated in Fig. 6, if we have obtainedtackle the problem of projective distortion caused by viewpoint a trajectory and there are ﬁve points ,variations. lo which are approximately coplanar, we use these ﬁve points on In view-invariant action recognition, traditional-model-based the trajectory to generate two groups of four collinear points, .bmethods evaluate the ﬁtness between image points and the pre- i.e., and . With the two groups ofdeﬁned 3-D models. However, it is difﬁcult to detect qualiﬁed collinear points, we compute their cross ratios and denote them tsimage points that satisfy the speciﬁc geometric conﬁguration re- as CR and CR , respectively. Thus, we get the view-invariantquired to get the desired invariants. ec representation of these ﬁve points as follows: Cross ratio is the most common invariant. As shown in Fig. 5,the sets of four collinear points with the http://ieeexploreprojects.blogspot.com same permutation lying CR CR (5) ojon different planes form cross ratios with the same value. To geta cross ratio as an invariant, the image points must be collinear where CR CR denotes our view-invariant representation prin the original 3-D space before projection. of ﬁve trajectory points. The only precondition of this general- re To avoid the constraints of collinearity of image points when ization is the coplanarity of the ﬁve points. Empirical tests inconstructing invariants from an image, we calculate invariants Section VI show that the precondition is satisﬁed in most real loacross neighboring frames rather than from a single image. cases. We generalize the cross ratio of four collinear points to cross Computing CR and CR is straightforward as long as the xpratios across frames (CRAFs) using ﬁve neighboring coplanar coordinates of the ﬁve points on the image plane are known.points sampled from trajectories of actions. The only assump- Here, we use formulas that have been proven in higher geometry eetion we should make is that the ﬁve points detected from neigh- as follows:boring frames are approximately coplanar. Under this assump-tion, the proposed method does not need a model human body CR ieand manual labeling of image points. :// a) Invariants Across Frames: In previous work , we (6)have assumed that trajectories of several key joints on the tpbody, such as hand, foot, or head, could be obtained by fea- CRture-tracking techniques. Once we get trajectories from the htimage sequences, we could construct a pair of cross ratios for (7)every ﬁve points sampled from the trajectories. Thus, pairs ofcross ratios are transformed to histograms as the feature vectorsin SVM classiﬁcation. In this paper, we take the sequential where is the determinant of the 2 2 matrix .STIPs as the motion trajectories of action and extract view-in- Degenerated groups of points might appear while computingvariant feature–cross ratio from the key points detected from CR . For example, the line deﬁned by and is parallel tomultiframes. the line deﬁned by and , or , , and are collinear. Cross ratio is invariant to projective transformations. It is de- In these cases, we either assign a ﬁxed number to CR relativelyﬁned as large or just ignore them. Since most of the sampled points are in general position, the degenerated groups do not affect the outputs of the algorithm. b) CRAF Histograms: For each ﬁve points, we get a se- (4) quence of pairs of cross ratios. These CR are voted into bins
2192 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012to form a histogram as the representation of the feature vectorfor classiﬁcation. In detail, the value of each histogram bin isdeﬁned as where if CR (8) elsewhere is the count of CR , and and correspond Fig. 7. Structure of hidden CRF.to the lower and upper bounds of the th bin of the histogram. omPractically, the values of cross ratios vary from 0 to 1, and here,we discretize them into eight bins. indicates the class label of actions. Overall, the graph outputs a conditional probability, which takes the form of t.c V. ACTION MODELING po Up to now, we have obtained motion feature description andview invariants of interest points. The remaining problem is how (9)to model the temporal information from sequential data. Unlike where is the unknown parameters of the model; graph vertices, gsobject classiﬁcation in static image, action recognition should denoted by , represent different variables. The edges betweentake into account the temporal dependence and paces of an ac- the different nodes of the graph, denoted by , describe interac- lotion. In the literature of time series analysis, the most commonly tions between different variables. The sets of nodes and edgesused model is the HMM, particularly in speech recognition and constitute graph . .bhuman action recognition, which models the joint probability According to Fig. 7, we can formulate the potential function tsof observation and state sequences given the model parameters. in (9) as follows:Once emission probability is set, state transition matrix A and ecobservation matrix B could be learned by an EM algorithm. Inthe recognition process, HMM predicts the category of a se- http://ieeexploreprojects.blogspot.com ojquence by calculating the likelihood of the observation given (10)the model. pr Although HMM has a very simple and effective structure,it suffers from several limitations such as conditional indepen- The four components on the right-hand side of (10) corre- redence assumptions between observations, strong prior knowl- spond to four different connections in the graphic structureedge of the data, and local optimization. CRF  is widely shown in Fig. 7. The speciﬁc deﬁnition of each part is cus- loused in word segmentation, named-entity recognition, text anal- tomized in our experiment as follows. 1) Observations of motion and hidden states . xpysis, and so on. In , hidden states are introduced to modelthe different part in object recognition. Because of its excellent This component describes the relation between observa-performance, it has been also applied to human action recogni- tions and hidden variables, which measures the matching eetion , , –. score between motion features extracted from images and To effectively model more complex human actions, a dis- hidden states. In our experiment, observations corre- iecriminative model based on maximum entropy–hCRF is used spond to the histograms obtained by oriented optical ﬂowfor motion features of an action sequence . Another model projection ://named latent-dynamic CRF, which can model dynamics be- (11) tptween actions and allow automatic action segmentation ,needs much more computation cost. In this paper, view-in- htvariant feature could be integrated into the hCRF model to where is the parameter that measures the compatibilitymaintain an overall high recognition rate, which results in good between observation and hidden state.robustness of the model to the change in view angles. 2) Hidden states and labels With the hCRF model, we could make full use of informa-tion extracted from neighboring frames rather than a single (12)static image and meanwhile bring the view invariants togetherin dealing with view changes in action recognition applica- where measures the compatibility between hiddentions. In , hCRF is used to model the dependence between states and labels.different image patches. Intuitively, it could be also used to 3) Different hidden states and labelsmodel different phases of human actions in image sequences.The graphic structure of hCRF is shown in Fig. 7, in which denotes the observations, representing optical ﬂow feature. is the hidden variable in the middle layer of the graph; (13)
HUANG et al.: MODEL OF MOTION AND CROSS RATIO FOR VIEW-INVARIANT ACTION RECOGNITION 2193 where measures the compatibility among different hidden states and labels linked by edges in the graph shown in Fig. 7. 4) View invariants and labels The part of po- tential function takes into account the view-invariant fea- ture–cross ratio, represented by , which could be com- puted once the location of the points of interest is known (14) where measures the compatibility between view invari- om ants and labels. Fig. 8. Camera position distribution . By normalization, the potentials summed up are convertedinto conditional probability that we have expected. t.c In the training process, a gradient descent method is applied toget model parameters iteratively. The logarithmic derivatives poof the conditional probability with respect to different parame-ters are gs lo .b Fig. 9. Projected trajectories of hand on each viewpoint. ts ec the cameras is depicted in Fig. 8. All the actions are performed http://ieeexploreprojects.blogspot.com the hemisphere. We project 3-D data around the center within oj onto images of each viewpoint with the focal length randomly chosen in a range of 1000 300 mm. Fig. 9 gives an example pr (15) of projected trajectories of hand. It illustrates the action of jump in each viewpoint with varying appearance caused by projective re The learning process starts with a given initial value of , and distortions.we could obtain a local optimal solution after a number of itera- We select ﬁve classes of actions, i.e., climb, jump, run, swing, lotions by minimizing the logarithmic derivatives. Speciﬁcally, in and walk, for the database to test. For each action, we get theour method, we have assumed that the graph obeys a tree struc- xp trajectories of head, left hand, and left foot of the subject. Forture, which could be approximated by using a minimum span- every neighboring ﬁve points on the trajectory, we compute aning tree algorithm. In addition, all the expectations in (15) are pair of CR by (6) and (7). We transform CR of each action to eecomputed by belief propagation. histograms as the view-invariant features of the action. After projection, we get 200 trajectories of each viewpoint, ie VI. EXPERIMENTAL RESULTS AND ANALYSIS speciﬁcally 12 sequences for climb, 57 sequences for jump, 41 sequences for run, 10 sequences for swing, and 80 sequences :// Here, we give a thorough illustration of our experimental re- for walk. The data provided are unbalanced; hence, a weightedsults after intensively testing on both view-invariant features tp training strategy is applied in the training process. We use anseparately and fusion with motion features. SVM as the classiﬁer. In SVM training, the radial basic function ht kernel parameters are chosen by way of grid search. We trainA. View Invariance one model for each viewpoint and test it on the other viewpoints. To testify the effectiveness of cross ratio as a view-invariant The output of each viewpoint is the one with the highest score.feature, we ﬁrst use Carnegie Mellon University Motion Cap- The performance is shown in Fig. 10. Although the recogni-ture (MoCap) Database to evaluate its robustness to view tion rate of some views is a little lower than that in , thechange. MoCap database records 3-D position information cap- average accuracy is about 92.38%, which is much higher com-tured from sensors on the body of the subject. After projection pared to 81.60% in , demonstrating high stability over theonto image planes, we could get 2-D trajectories in different 17 viewpoints.views. Theoretically, the cross ratios of ﬁve coplanar points in gen- To make a comparison with the state of the art, our experiment eral position remain the same under projective transformations.is under the same condition with . Since we have assumed that the ﬁve points used to compute CR In the projection process, there are 17 synthesized cameras are approximately coplanar, we evaluate the variance of CR ofuniformly distributed around a hemisphere. The distribution of a group of neighboring ﬁve points at different sampling rates.
2194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Fig. 12. Comparison results on Weizmann and KTH action data sets. Fig. 10. Recognition rate in different views. om t.c Fig. 13. Results on different scenarios on KTH data set. po and background subtraction in , we use the STIP method and then compute oriented optical ﬂow histograms in the detected gs local region, as depicted in Sections III and IV. In the learning process, the cost between different hidden nodes of hCRF is lo Fig. 11. Mean and variance of CRs in different viewpoint frames. measured by Euclidean distance in spatiotemporal domain. As illustrated in Fig. 12, the performance is comparable to state of .b art , –. The average recognition rate (89.7%) is betterThe variance and mean curves with respect to different sam- than that in  (87.6%) because of the effective feature expres- tspling rates are shown in Fig. 11. sion by oriented histogram projection. Although the accuracy ec The mean value of CR is around 0.6. As shown in Fig. 11, of our method is a little lower than that in  (91.7%) and inthe variance is negligible compared to the mean value when the  (91.8%), further evaluation demonstrates that our method http://ieeexploreprojects.blogspot.com ojsampling rate is above 25 Hz, which is to say that the calculated bears high ﬂexibility when actions become complex in a moreCR is stable as long as the frame rate is above 25 Hz, indicating challenging data set. prthat our approximate coplanar assumption is acceptable under To evaluate the accuracy of different scenarios on the KTHreal circumstances. data set, we test our method using four different combinations re of training data as . The results are given in Fig. 13. s1,B. Performance on Public Action Data Sets s2, s3, and s4 are four different scenarios, i.e., outdoors, out- lo Cross ratio has shown high stability and robustness to view doors with scale variation, outdoors with different clothes, and xpchange in motion capture data set. In order to evaluate its ef- indoors, respectively. In Fig. 13, we can see that our methodfectiveness in real data, cross ratio is incorporated with motion achieves nearly the same results (s1 is 82.1%; s1 s4 is 83.2%;feature by a discriminative model on public data sets to see the s1 s3 s4 is 85.6%; and s1 s2 s3 s4 is 89.7%, all over eeoverall performance. 80%) in the different scenarios, which shows the robustness of For single view, we test our method on two public action data our method. iesets, namely, the Weizmann action data set  and the KTH We also test our method on a large multiview action dataaction data set . set, i.e., the Institute of Automation, Chinese Academy of Sci- :// The Weizmann action data set contains 90 video sequences ences (CASIA) action data set . The CASIA action data set tpof 10 natural actions performed by nine different people. contains sequences of human activities captured by video cam-For comparison with the state of the art, we selected nine eras outdoors from different angles of view. There are 1446 se- htkinds of actions, including running, walking, jumping-jack, quences in all containing eight types of single-person actionsjumping-forward-on-two-legs, jumping-in-place-on-two-legs, (walk, run, bend, jump, crouch, faint, wander, and punching agalloping-sideways, waving-two-hands, waving-one-hand, and car) performed each by 24 subjects and seven types of two-bending, similar to what was done in . The KTH action data person interactions (rob, ﬁght, follow, follow–gather, meet–part,set contains six kinds of actions, including walking, jogging, meet–gather, and overtake) performed by every two subjects.running, boxing, hand waving, and hand clapping, with large All video sequences were simultaneously taken with three non-variations both in appearance and scenes. calibrated stationary cameras from different angles of view (hor- Respectively, the two data sets are partitioned by subjects into izontal, angle, and top-down views).two equal subsets. One subset contains videos of half of the sub- We selected six kinds of actions in our experiment, includingjects for training and the other one of the remaining subjects for bend, crouch, fall, jump, run, and walk, because these actionstesting. We then cut all the video sequences in the two data sets can be described and compared better with other methods basedinto subsequences with 30 frames for each sequence. Half of the on feature descriptors. Similarly, we use videos of 12 subjectssubsequences are chosen as the training set. Unlike by tracking for training, and the other videos of the remaining 12 subjects for
HUANG et al.: MODEL OF MOTION AND CROSS RATIO FOR VIEW-INVARIANT ACTION RECOGNITION 2195 TABLE I TABLE IV RECOGNITION RESULTS USING SVM FRAME BY FRAME (OVERALL, 74.5%) RECOGNITION RESULTS OF DIFFERENT VIEWS TABLE II RECOGNITION RESULTS USING HMM BASED om ON CONTOUR FEATURE (OVERALL, 78.7%) t.c po gs shown in Table IV, the accuracy values of 78.8% and 70.5% are obtained; however, it seems that they are more vulnerable lo TABLE III RECOGNITION RESULTS USING HCRF (OVERALL, 84.2%) to view change than motion features  and achieves higher .b average precision in side view (54.2%) and top-down view (47.8%) because of motion feature. Simple cross ratio (STIP ts CR hCRF) outputs lower accuracy than motion (STIP OF hCRF) because cross ratio cannot be stable enough in ec real scenes. Our method, considering the optical ﬂow and cross http://ieeexploreprojects.blogspot.com ratio (STIP OF CR hCRF), obtains the best results in oj three views; as compared with , our method can achieve pr over nearly 10% accuracy improvement in three views. Although using view invariants alone for action recognitiontesting. We then cut all the video sequences into subsequences produces low recognition rate, it does help to maintain robust- rewith 30 frames for each sequence. ness to view change in fusion (STIP OF CR hCRF) We compared the effectiveness of hCRF with those of SVM for its inherent invariability among different views, even if the loand HMM on horizontal view. In hCRF modeling, we assume view angle extremely changes. As shown in Table IV, cross ratio xpa tree structure of the graph and use Euclidean distance to mea- helps to improve 6%–10% precision after fusion with opticalsure the cost between different nodes. Every 30 frames from the ﬂow in the hCRF framework, even when the view angle be-video is taken as an action sequence, and for each sequence, comes top down. ee15–20 points of interest are selected. The results are shown inTables I–III ie hCRF achieves the best results compared with SVM and VII. CONCLUSION AND DISCUSSIONHMM. In the learning process, sequential modeling methods, In this paper, we have proposed a method for view-invariant ://i.e., hCRF and HMM, obtain much better accuracy than SVM action recognition, which could naturally encapsulate motion tpat the expense of twice as much training time as SVM costs; pattern and view invariants. A feature detection method is usedhowever, their internal structure facilitates more expressive to extract motion information from image sequences, which htmodels for complex actions. For example, in recognizing “fall” is much more efﬁcient than traditional background modelingand “crouch,” hCRF and HMM achieve much better results. methods; for feature representation, a variety of statistical in-On the other hand, they are more capable of modeling actions formation is fused to overcome much noise in motion features.with obscure discriminative boundaries; for example, “run” When computing invariants across frames, we made general-and “walk” are easily confused without knowing the pace of izations to cross ratio of four collinear points so that it couldtheir execution. be applied to view-invariant representation of actions; as for To verify the effectiveness of the fusion of motion feature the time series modeling, a discriminative probabilistic model,and view invariants, we tested our method and state-of-the-art i.e., hCRF, is applied to model temporal motion patterns andmethods , ,  on different views in CASIA multiview view invariants, by which we could consider motion patternsdatabase. For each view, we train a model and then test it against and view invariants in one framework. Experimental resultsthe other two views. The results are shown in Table IV. demonstrate that the proposed method presents excellent dis- In Table IV, the commonly used appearance-based methods crimination ability in recognizing different actions with high,  give better results under horizontal view. As it is robustness to view changes in real circumstances.
2196 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 However, since hidden states are introduced into the expres-  C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation andsion of conditional probability, the object function fails to pre- recognition of actions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 203–226, Nov. 2002.serve the convexity. Therefore, we could only obtain a local  V. Parameswaran and R. Chellappa, “View invariants for human actionoptimal solution hCRF. Moreover, we have to pass and collect recognition,” in Proc. IEEE CVPR, 2003, vol. 2, pp. 613–621.messages for each node in the graph in the process of gradient  Y. Zhang, K. Huang, Y. Huang, and T. Tan, “View-invariant action recognition using cross ratios across frames,” in Proc. ICIP, 2009, pp.ascent optimization, which brings much computational cost for 3549–3552.model training.  D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. ICCV, 2001, pp. 1150–1158.  P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recogni- tion via sparse spatio-temporal features,” in Proc. 14th ICCCN, 2005, REFERENCES pp. 65–72.  A. F. Bobick and J. W. Davis, “The recognition of human movement  I. Laptev and T. Lindeberg, “Space–time interest points,” in Proc. ICCV, 2003, pp. 432–439. om using temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 3, pp. 257–267, Mar. 2001.  J. Y. Bouguet, “Pyramidal implementation of the Lucas Kanade feature  Y. Wang, K. Huang, and T. Tan, “Human activity recognition based on tracker: Description of the algorithm,” in Proc. KLT Implementation r transform,” in Proc. IEEE CVPR, 2007, pp. 1–8. OpenCV, 2002, pp. 1–9.  B. D. Lucas and T. Kanade, “An iterative image registration technique t.c  R. Souvenir and K. Parrigan, “Viewpoint manifolds for action recog- nition,” J. Image Video Process., vol. 1, pp. 1–13, 2009. with an application to stereo vision,” in Proc. 7th IJCAI, 1981, pp.  A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classiﬁcation 674–679.  N. Dalal and B. Triggs, “Histograms of oriented gradients for human po and tracking from real-time video,” in Proc. 4th IEEE WACV, 1998, pp. 8–14. detection,” in Proc. IEEE CVPR, 2005, pp. 886–893.  H. Fujiyoshi and A. J. Lipton, “Real-time human motion analysis by  Geometric Invariance in Computer Vision, J. L. Mundy and A. Zis- image skeletonization,” in Proc. 4th IEEE WACV, 1998, pp. 15–21. serman, Eds. Cambridge, MA: MIT Press, 1992. gs  F. I. Bashir, A. K. Ashfaq, and S. Dan, “View-invariant motion trajec-  A. Quattoni, M. Collins, and T. Darrell, “Conditional random ﬁelds for tory-based activity classiﬁcation and recognition,” Multimedia Syst., object recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. vol. 12, no. 1, pp. 45–54, Aug. 2006. 1097–1104. lo  M. Ahmad and S.-W. Lee, “Human action recognition using shape and  S. B. Wang, A. Quattoni, L.-P. Morency, and D. Demirdjian, “Hidden clg-motion ﬂow from multi-view image sequences,” Pattern Recognit., conditional random ﬁelds for gesture recognition,” in Proc. IEEE .b vol. 41, no. 7, pp. 2237–2252, Jul. 2008. CVPR, 2006, pp. 1521–1527.  Y. Wang and G. Mori, “Learning a discriminative hidden part model  L. Wang and D. Suter, “Recognizing human activities from silhou- ts for human action recognition,” in Proc. NIPS, 2008, vol. 21, pp. ettes: Motion subspace and factorial discriminative graphical model,” 1721–1728. in Proc. IEEE CVPR, 2007, pp. 1–8. ec  A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action at  J. Zhang and S. Gong, “Action categorization with modiﬁed hidden a distance,” in Proc. ICCV, Nice, France, 2003, pp. 726–733. conditional random ﬁeld,” Pattern Recognit., vol. 43, no. 1, pp.  N. Johnson and D. Hogg, “Learning the http://ieeexploreprojects.blogspot.com distribution of object trajecto- 197–203, Jan. 2010. oj ries for event recognition,” in Proc. 6th BMVC, 1995, pp. 583–592.  M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions  C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using as space-time shapes,” in Proc. IEEE ICCV, 2005, pp. 1395–1402.  C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A pr real-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 747–757, Aug. 2000. local SVM approach,” in Proc. ICPR, 2004, vol. 3, pp. 32–36.  K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan, “View independent  J. C. Niebles and L. Fei-fei, “A hierarchical model of shape and ap- re human behavior analysis,” IEEE Trans. Syst., Man, Cybern. B, Cy- pearance for human action classiﬁcation,” in Proc. IEEE CVPR, 2007, bern., vol. 39, no. 4, pp. 1028–1035, Aug. 2009. pp. 1–8.  H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired lo  D. Buzan, S. Sclaroff, and G. Kollios, “Extraction and clustering of motion trajectories in video,” in Proc. ICPR, Washington, DC, 2004, system for action recognition,” in Proc. ICCV, 2007, pp. 1–8. pp. 521–524.  I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning re- xp  W. Hu, D. Xie, and T. Tan, “A hierarchical self-organizing approach alistic human actions from movies,” in Proc. IEEE CVPR, 2008, pp. for learning the patterns of motion trajectories,” IEEE Trans. Neural 1–8. Netw., vol. 15, no. 1, pp. 135–144, Jan. 2004.  L. P. Morency, A. Quattoni, and T. Darrell, “Latent-dynamic discrim- ee  J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time- inative models for continuous gesture recognition,” in Proc. IEEE sequential images using hidden Markov model,” in Proc. IEEE CVPR, CVPR, 2007, pp. 1–8. 1992, pp. 379–385.  S. Wang, A. Quattoni, L. P. Morency, D. Demirdjian, and T. Darrell, ie  M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov “Hidden conditional random ﬁelds for gesture recognition,” in Proc. models for complex action recognition,” in Proc. CVPR, 1997, pp. IEEE CVPR, 2006, pp. 1521–1527. :// 994–999.  S. M. Nixon and A. S. Aguado, Feature Extraction and Image Pro-  C. Sminchisescu, A. Kanaujia, and D. Metaxas, “Conditional models cessing for Computer Vision. New York: Academic, 2008. tp for contextual human motion recognition,” Comput. Vis. Image Under- standing, vol. 104, no. 2/3, pp. 210–220, Nov./Dec. 2006.  J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional Kaiqi Huang (M’05–S’09–SM’09) received the ht random ﬁelds: Probabilistic models for segmenting and labeling M.S. degree in electrical engineering from Nanjing sequence data,” in Proc. 18th ICML, 2001, pp. 282–289. University of Science and Technology, Nanjing,  M. A. Mendoza and N. Pérez De La Blanca, “Applying space state China, and the Ph.D. degree in signal and informa- models in human action recognition: A comparative study,” in Proc. tion processing from Southeast University, Nanjing. 5th Int. Conf. AMDO, 2008, pp. 53–62. After receiving the Ph.D. degree, he became a  D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action Postdoctoral Researcher with the National Labora- recognition using motion history volumes,” Comput. Vis. Image tory of Pattern Recognition, Institute of Automation, Understanding, vol. 104, no. 2/3, pp. 249–257, Nov./Dec. 2006. Chinese Academy of Sciences, Beijing, China,  D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbi- where he is currently an Associate Professor. He trary views using 3D exemplars,” in Proc. IEEE ICCV, 2007, pp. 1–7. has published more than 80 papers on TPAMI, TIP,  Y. Shen and H. Foroosh, “View-invariant action recognition using fun- TCSVT, TSMCB, CVIU, Pattern Recognition and CVPR, and ECCV. His damental ratios,” in Proc. IEEE CVPR, 2008, pp. 1–6. interests include visual surveillance, image and video analysis, human vision  P. Natarajan and R. Nevatia, “View and scale invariant action recogni- and cognition, computer vision, etc. tion using multiview shape-ﬂow models,” in Proc. IEEE CVPR, 2008, Dr. Huang is a Program Committee Member of more than 50 international pp. 1–8. conferences and workshops and is a board member of the IEEE Systems, Man,  R. Souvenir and J. Babbs, “Learning the viewpoint manifold for action and Cybernetics Technical Committee on Cognitive Computing. He is the recognition,” in Proc. IEEE CVPR, 2008, pp. 1–7. Deputy Secretary-General of the IEEE Beijing Section.
HUANG et al.: MODEL OF MOTION AND CROSS RATIO FOR VIEW-INVARIANT ACTION RECOGNITION 2197Yeying Zhang received the B.Sc. degree in electrical engineering in video pro- to join the National Laboratory of Pattern Recognition (NLPR), Institutecessing and multimedia communication from Chengdu University, Chengdu, of Automation, Chinese Academy of Sciences, Beijing, China, where he isChina, in 2008. He is currently working toward the Master’s degree in pattern currently a Professor and the Director of the NLPR. He has published morerecognition and intelligent system in the National Laboratory of Pattern Recog- than 200 research papers in refereed journals and conferences in the areasnition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. of image processing, computer vision, and pattern recognition. His current His current research interests include computer vision, pattern recognition, research interests include image processing, machine and computer vision,human behavior analysis, etc. pattern recognition, multimedia, and robotics. Dr. Tan was a Guest Editor of the International Journal of Computer Vision (June 2000) and is an Associate Editor or member of the Editorial Board of eight international journals, including the TPAMI, TCSVT, and Pattern Recognition. Tieniu Tan (F’03) received the B.Sc. degree in elec- He serves as Referee or Program Committee Member and Chair for many major tronic engineering from Xi’an Jiaotong University, national and international journals and conferences. He is the Chair of the IAPR Xi’an, China, in 1984 and the M.Sc. and Ph.D. de- Technical Committee on Signal Processing for Machine Intelligence and the grees in electronic engineering from Imperial Col- Chair of the Fellow Committee of the IEEE Beijing Section. om lege of Science, Technology, and Medicine, London, U.K., in 1986 and 1989, respectively. In October 1989, he was with the Computational Vision Group, Department of Computer Science, t.c The University of Reading, Berkshire, U.K., where he was a Research Fellow, Senior Research Fellow, and Lecturer. In January 1998, he returned to China po gs lo .b ts ec http://ieeexploreprojects.blogspot.com oj pr re lo xp ee ie :// tp ht