Pattern Recognition Letters 24 (2003) 1295–1302
www.elsevier.com/locate/patrec

Facial expression recognition: A clusterin...
1296

X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

during expression actions is available. Many...
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

1297

Fig. 2. Projection of neutral expression cla...
1298

X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

(samples are represented by ‘‘þ’’) and the o...
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

different classes without putting any constraint on...
1300

X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

ordered and concatenated to a feature vector...
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

tral vs. smile (classifier 4), or neutral vs. anger...
1302

X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302

Duda, R., Hart, P., 1973. Pattern Recognitio...
Upcoming SlideShare
Loading in …5
×

Facial expression recognition a clustering based approach

102
-1

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
102
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Facial expression recognition a clustering based approach

  1. 1. Pattern Recognition Letters 24 (2003) 1295–1302 www.elsevier.com/locate/patrec Facial expression recognition: A clustering-based approach Xue-wen Chen *, Thomas Huang Beckman Institute for Advanced Technology and Development, University of Illinois at Urbana-Champaign, 405 N. Mathews Avenue, Urbana, IL 61801, USA Received 4 February 2002; received in revised form 17 September 2002 Abstract This paper describes a new clustering based feature extraction method for facial expression recognition. We demonstrate the effectiveness of this method and compare it with commonly used principal component analysis method and linear discriminant analysis method. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Clustering based discriminant analysis; Facial expression recognition; Human computer interaction; Principal component analysis; Linear discriminant analysis 1. Introduction Facial expression delivers rich information about human emotion and plays an important role in human communications. For intelligent and natural human–computer interaction, it is essential to recognize facial expression automatically. Various techniques have been developed for automatic facial expression recognition, which differ in data used (still images vs. video sequences), feature extraction methods, and classifiers used. For facial expression recognition from image sequences, optical flow estimation is typically used to extract * Corresponding author. Address: Electrical and Computer Engineering Department, California State University, 18111 Nordhoff Street, Northridge, CA 91330, USA. Tel.: +1-818677-4755/217-244-1958; fax: +1-818-677-7062. E-mail addresses: xwchen@csun.edu, xwchen@uiuc.edu (X.-w. Chen). features. Mase (1991) selected some facial regions manually and used optical flow to estimate the motion of facial muscles and a k-nearest neighbor rule for recognition. Yacoob and Davis (1994) used optical flow to track the motion of brows, eyes, nose and mouth; a lookup table was used to classify six basic expressions. Barlett et al. (1996) combined optical flow and principal component analysis (PCA) for facial expression recognition. Essa and Pentland (1997) used optical flow in a physical model of the face with a recursive framework to classify facial expressions. Otsuka and Ohya (1997) computed the 2D Fourier transform coefficients of optical flow and hidden Markov model was used to classify facial expressions. The performance of these methods depends on the reliability of optical flow estimation. Facial expression recognition from static images is a more difficult problem than from image sequences due to the fact that less information 0167-8655/03/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 2 ) 0 0 3 7 1 - 9
  2. 2. 1296 X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 during expression actions is available. Many psychologists used single image for expression recognition because of the difficulty to obtain wellcontrolled video sequences of standard facial expressions. For example, ‘‘mug-shot’’ images that reveal peak expressions were used in most psychological research on facial expression recognition (e.g., Young and Ellis, 1989). In this paper we are concerned particularly with recognition of facial expressions from still images with emphasis on feature extraction. Two most commonly used methods for extracting features from still images are PCA and linear discriminant analysis (LDA) (e.g. Cottrell and Metcalfe, 1991; Lyons et al., 1999; Padgett and Cottrell, 1995). PCA is an unsupervised method, which treats samples of the same class and of different classes the same way. In the case of having a prior knowledge in form of labeled samples, PCA is not able to exploit such information. LDA is a supervised method using the category information associated with each sample to extract the most discriminatory features, which has been shown to perform well in many applications. However, LDA may not be able to separate samples from different classes if multiple clusters per class exist in input feature space (an illustration of this point is shown in Section 2.2), especially if there is little or no difference in mean vectors (as such, the LDA projections are not reliable). To utilize the class information and exploit cluster information associated with training samples, we propose a new clustering based feature extraction method, which is modified from LDA methods. We then apply this method to facial expression recognition. In Section 2, we will describe the proposed algorithm. Experiment results are present in Section 3. Our conclusion is presented in Section 4. 2. Clustering based discriminant analysis Each image can be treated as a feature vector by concatenating the rows of the image together, using each pixel as a single feature. Thus, each image can be represented by an n-dimensional vector xk , where n is the number of pixels in each image. Let fx1 ; x2 ; . . . ; xN g be a set of N images, ðiÞ xk (k ¼ 1; . . . ; Ni , Ni is the number of samples in class i) be samples in class i, and l 2 Rn be the mean vector of all N samples. For the ith class xi (i ¼ 1; . . . ; c, c is the number of classes), li is the mean vector of samples in this class. In this section, we briefly discuss PCA and LDA methods in Section 2.1; in Section 2.2, we show that LDA may perform poorly given multiple clusters in one class; the proposed clustering based approach is then described in Section 2.3. 2.1. PCA and LDA PCA (e.g., Turk and Pentland, 1991) projects the original n-dimensional feature vectors xk to a new reduced feature space, where the new projected m-dimensional feature vectors yk ¼ AT xk . A 2 RnÂm is the transformation matrix consisting of orthonormal eigenvectors of the total scatter maPN T trix ST defined as ST ¼ k¼1 ðxk À lÞðxk À lÞ , corresponding to the m largest eigenvalues. An important property of PCA is that it generates features capturing the main scatter directions and thus is optimal for representation in a reduced space in terms of the minimum mean square error. As a result, PCA can drastically reduce the dimensionality of the original space without loss of much information in the sense of representation. The disadvantage is that PCA may lose important information for discrimination between different classes, since it projects original feature space onto axes determined by eigenvectors of the total scatter matrix ST , which treats samples of the different classes in the same way. To illustrate this point, Fig. 1 shows an example of a two-class problem with two-dimensional feature space, where PCA is used to project samples onto one-dimensional space. As can be seen, PCA-projected samples from two classes overlap heavily (thus, a poor recognition result is expected), although PCA preserves large total scatter. To extract features useful for discrimination, LDA projects the original n-dimensional feature vectors xk to a new reduced feature space with the projected m-dimensional feature vectors yk ¼ AT xk . A 2 RnÂm is the transformation matrix consisting of m generalized eigenvectors v corresponding to the m largest eigenvalues of the equation, SB v ¼ ki Sw v
  3. 3. X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 1297 Fig. 2. Projection of neutral expression class onto 2-D PCA space. Fig. 1. Projections of 2-D samples onto 1-D using PCA and LDA. where SB is the between-class scatter matrix defined P as SB ¼ c ðli À lÞðli À lÞT and P is the withinSw i¼1 P ðiÞ Ni class matrix defined as Sw ¼ c i¼1 k¼1 ðxk À li Þ Â ðiÞ T ðxk À li Þ (e.g., Duda and Hart, 1973). Apparently, LDA takes both the within-class scatter and the between-class scatter into consideration, and thus it will carry discrimination information important for classification. For the two-dimensional dataset shown in Fig. 1, LDA is used to project samples onto one-dimensional space. As can be seen from Fig. 1, LDA-projected samples from different classes are well separated; i.e., LDA carries better information useful for distinguishing two classes compared to PCA. 2.2. Multiple clusters of facial space For face or facial expression recognition problems, it is highly possible that multiple clusters in one class exist due to different lighting conditions or different expressions. To see this, we project samples in the neutral expression class (detailed description about the database is in Section 3) onto two-dimensional PCA space. Fig. 2 shows the projected result. It is clear that there are four clusters for the neutral expression class. For input data with multiple clusters per class, LDA may fail to generate features for separating samples from two classes, since class means do not represent the data structure. This can be seen from Fig. 3(a), where, one class has three clusters Fig. 3. Comparison of different projection methods, (a) LDA, and (b) PCA, LDA, and CDA; lines are the axis of projections.
  4. 4. 1298 X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 (samples are represented by ‘‘þ’’) and the other has one cluster (samples are represented by ‘‘o’’). Apparently, after LDA projection, many samples from two different classes overlap. Next, we introduce a new method modified from LDA, the clustering based discriminant analysis, for feature extraction. 2.3. Clustering based discriminant analysis (CDA) We wish to determine a transform yk ¼ AT xk on an n-dimensional input xk such that the projections (transformed values) for each class (which may consist of many clusters) are separated. For a cclass problem, let us denote the number of clusters in the ith class by di , the mean vector for the jth ðiÞ cluster in the ith class by lj (j ¼ 1; . . . ; di ), and the m projection vectors by ut , t ¼ 1; . . . ; m (i.e., the transformation matrix A ¼ ½u1 ; u2 ; . . . ; um Š). After projection onto ut , the mean difference between the jth cluster in the ith class and the hth cluster in the ðiÞðlÞ ðiÞ ðlÞ ðiÞ ðlÞ T lth class is Rjh ¼ uT ðlj À lh Þðlj À lh Þ ut , t and the variance of the jth cluster in the ith P ðiÞ ðiÞ ðiÞ class is Cj ¼ s uT ðxs À lj ÞT ðxs À lj ÞT ut , where t summation is over all samples xs of the jth cluster in the ith class. To separate clusters belonging to different classes without constraints on clusters belonging to P same class, we maximize Pdi thel ðiÞðlÞ PcÀ1 Pc d e ¼ uT R ut (e.g., for i¼1 l¼iþ1 t j¼1 h¼1 Rjh two-class problems,P ¼ 2, i ¼ 1, and l ¼ 2; thus, c P ð1Þð2Þ d2 we maximize d1 which is mean difj¼1 h¼1 Rjh ferences from every clusters in class 1 to every clusters in class 2); we also minimize each cluster scatter so asP keep each cluster compact to P ðiÞ c di Te (i.e., minimize j¼1 Cj ¼ ut C ut ). Thus, i¼1 e we maximize a criterion function J ¼ ðuT R ut Þ= t e ðuT C ut Þ for all ut , where t e R¼ dl di cÀ1 c X X XX i¼1 l¼iþ1 j¼1 ðiÞ ðlÞ ðiÞ ðlÞ T ðlj À lh Þðlj À lh Þ ð1Þ h¼1 and e C¼ di c XXX i¼1 j¼1 ðiÞ ðiÞ T ðxs À lj Þðxs À lj Þ ð2Þ s e e By setting the derivative of ðuT R ut Þ=ðuT C ut Þ with t t respect to ut to zero, we obtain the solutions ut , which satisfy the following equation e e R ut ¼ kt C ut ð3Þ e e where kt ¼ ðuT R ut Þ=ðuT C ut Þ ¼ J . Thus, the prot t jection vectors of CDA are the m eigenvectors ut corresponding to the m largest eigenvalues kt of Eq. (3). In above discussion, we assume that clusters in each class are known. A necessary procedure in the proposed CDA method is clustering. In our application, a fuzzy c-means (FCM) clustering algorithm (e.g., Bezdek, 1981) is used (for high dimensional data, we use PCA to reduce feature space first, and clustering is done in the reduced space); the number of clusters for each class is selected by the cluster validity suggested by Xie and BeniÕs validity method (e.g., Xie and Beni, 1991). It is worthy of noting that from our experimental results, the selection of clustering algorithms is not critical to CDA algorithms. This is what we expected because most clustering algorithms could find well-separated clusters. For clusters in the same class which are close each other, considering them as one cluster or as several clusters does not significantly affect the classification performance. As a modified version of LDA, CDA is mainly designed for two-class problems, although it can be directly applied to multi-class problems (c > 2). The main difference between CDA and LDA is that LDA is designed to separate different classes without considering each individual cluster, while CDA exploits cluster information and separates clusters associated with different classes. Thus, CDA can handle input data with multiple clusters per class (e.g., a mixture of Gaussians) and can work well even if there is little or no difference in mean vectors (as such, the LDA projections are not reliable). For two class problems with multiple clusters per class, one can also apply multiple discriminant analysis (MDA), an extension of the LDA for multiple classes (e.g., Duda and Hart, 1973), by treating different clusters as different classes. MDA finds projections that separate all clusters regardless of their class membership, i.e., MDA treats clusters in the same class differently (doing so needs extra constrain). However, it is not necessary to separate clusters (and thus samples) in the same class. CDA separates only clusters in
  5. 5. X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 different classes without putting any constraint on clusters in the same class. To compare the performances of PCA, LDA, and CDA, Fig. 3(b) shows projection vectors for the same data as in Fig. 3(a). It is clear that for PCA projection, many samples from cluster 2 of class 1 and from class 2 overlap; for LDA projection, samples from both clusters 1 and 3 of class 1 overlap with samples from class 2; while for CDA projection, samples from two classes separate well. 3. Experimental results In this section, we demonstrate the application of the proposed CDA algorithm to facial expression recognition. We compare the proposed CDA algorithms to LDA algorithms for two-class classification problems. We use a one-to-many classification scheme for multiple class classification problems, where each class is trained against all other classes. Since the dimensionality of face im- 1299 ages is very high, both the within-class matrix Sw in e LDA and C in CDA are typically singular. Thus, before we apply LDA and CDA, PCA is used to reduce the dimensionality of original feature space to N À c, as did in Belhumeur et al. (1997) and Swets and Weng (1996). The face database (Martinez and Benavente, 1998) used in the experiments consists of 1428 images corresponding to 119 personÕs faces (64 men and 55 women) with three facial expressions (neutral, smile, and anger). Each person has 12 images––half of them was from the first shot and used as our training sample set (total 714); the other was taken under similar conditions after two weeks later and used as our test sample set (total 714). Fig. 4 is example images with three facial expressions in the database, where the first column images are neutral, the second column images are smile, and the last column images are anger. We then manually select a rectangular region containing brow and eyes and a rectangular region containing mouth for each person. Pixels inside both rectangular regions are lexicographically Fig. 4. Example image sets of three facial expressions in the face database.
  6. 6. 1300 X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 ordered and concatenated to a feature vector. Transformation matrices A for feature extraction are then derived from only training set. We use nearest neighbor rule for classification. The 3-class classification problem is decomposed into six 2class classification problems, as shown in Fig. 5: a sample is first classified as neutral vs. non-neutral (classifier 1), smile vs. non-smile (classifier 2), and anger vs. non-anger (classifier 3); if this sample is classified as single expression only (e.g., in classi- fier 1, neutral, in classifier 2, non-smile, and in classifier 3, non-anger), it is classified as that expression in our system (e.g., neutral); if this sample is classified as all three expressions (e.g., in classifier 1, neutral, in classifier 2, smile, and in classifier 3, anger) or as non-expression (e.g., in classifier 1, non-neutral, in classifier 2, non-smile, and in classifier 3, non-anger), it is classified as ‘‘uncertain’’ (this case does not happen in our application); otherwise, this sample is further classified as neu- Fig. 5. Classification procedures.
  7. 7. X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 tral vs. smile (classifier 4), or neutral vs. anger (classifier 5), or smile vs. anger (classifier 6), depending on the classification results from previous classifiers. For two-class problems, we can only extract one final feature from LDA. To determine the number of final CDA features, we compute classification rates vs. the number of final CDA features for training samples and retain the feature number corresponding to the largest classification rate. Table 1 lists the test results using LDA and CDA from classifiers 1, 2, and 3. As can be seen, CDA performs better and yields higher classification rates for all three cases (86.7%, 98.2%, and 89.1% for classifications of neutral vs. non-neutral, smile vs. non-smile, and anger vs. non-anger, respectively). Final classification results are summarized in Tables 2 and 3, which are the confusion matrices for LDA and CDA, respectively. As can be seen, for LDA, most samples with smile expression (113) are correctly classified; however, LDA has trouble to distinguish anger expression from neutral expression (43 images with neutral expressions 1301 are classified as anger and 62 images with anger expressions are classified as neutral). The proposed CDA outperforms LDA: 118 samples out of 119 with smile expression, 451 samples with neutral expression and 96 samples with anger expression are correctly classified. 4. Conclusions In this paper, we describe a novel discriminant feature extraction method, the clustering based discriminant analysis (CDA), and show its feasibility for the facial expression recognition problems. CDA is modified from LDA methods. Unlike PCA, which is designed for representation, CDA method extracts features with discriminant power; unlike LDA, which separates the whole class means, CDA seeks to separate clusters of different classes. Experimental results show that CDA generates better results than LDA in terms of the classification accuracy. The CDA method provides an efficient feature reduction and extraction schemes useful for facial expression recognition and can be applied to face recognition as well. Table 1 Comparison of classification rates Features Neutral vs. non-neutral (%) Smile vs. non-smile (%) Anger vs. non-anger (%) LDA CDA 78.2 86.7 97.5 98.2 80.3 89.1 Neutral Smile Anger 409 5 43 5 113 2 62 1 74 Neutral Smile Anger 451 0 23 3 118 0 22 1 96 Table 3 Confusion matrix for CDA Neutral Smile Anger The authors would like to thank the reviewers for their valuable comments. References Table 2 Confusion matrix for LDA Neutral Smile Anger Acknowledgements Barlett, M., Viola, P., Sejnowski, T., Larsen, L., Hager, J., Ekman, P., 1996. Classifying facial action. In: Touretzky, D., Mozer, M., Hasselmo, M. (Eds.), Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA. Belhumeur, P., Hespanha, J., Kriegman, D., 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 711–720. Bezdek, J., 1981. Pattern Recognition weith Fuzzy Objective Function Algorithms. Plenum, New York. Cottrell, G., Metcalfe J., 1991. EMPPATH: Face, gender, and emotion recognition using holons. In: Lippman, R., Moody, J., Touretzky, D. (Eds.), Advances in Neural Information Processing Systems, CA, vol. 3, pp. 564–571.
  8. 8. 1302 X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302 Duda, R., Hart, P., 1973. Pattern Recognition and Scene Analysis. Wiley, New York. Essa, I., Pentland, A., 1997. Coding, analysis, interpretation, and recognition of facial expressions. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 757–763. Lyons, M., Budynek, J., Akamatsu, S., 1999. Classifying images of facial expression using a Gabor wavelet representation. In: Proceedings, 2nd International Conference on Cognitive Science, Tokyo, Japan, pp. 113–118. Martinez, A., Benavente, R., 1998. The AR Face Database. CVC Technical Report #24. Mase, K., 1991. Recognition of facial expression from optical flow. IEICE Trans. E 74 (10), 3474–3483. Otsuka, T., Ohya, J., 1997. Recognizing multiple persons facial expressions using HMM based on automatic extraction of significant frames from image sequence. In: Proc. Int. Conf. On Image Processing, pp. 546–549. Padgett, C., Cottrell, G., 1995. Identifying emotion in static images. In: Processing of the second Joint Symposium of Neural Computation, vol. 5. pp. 91–101. Swets, D., Weng, J., 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Anal. Machine Intell. 18 (8), 831–836. Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J. Cognition Neurosci. 3 (1), 71–86. Xie, L., Beni, G., 1991. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Machine Intell. 3, 841–847. Yacoob, Y., Davis, L., 1994. Recognizing human facial expressions from long image sequences using optical flow. IEEE Trans. Pattern Anal. Machine Intell. 16 (6), 636–642. Young, A., Ellis, H., 1989. Handbook of Research on Face Recognition. Elsevier, New York.

×