Facial expression recognition a clustering based approach
1.
Pattern Recognition Letters 24 (2003) 1295–1302
www.elsevier.com/locate/patrec
Facial expression recognition: A clustering-based approach
Xue-wen Chen *, Thomas Huang
Beckman Institute for Advanced Technology and Development, University of Illinois at Urbana-Champaign,
405 N. Mathews Avenue, Urbana, IL 61801, USA
Received 4 February 2002; received in revised form 17 September 2002
Abstract
This paper describes a new clustering based feature extraction method for facial expression recognition. We demonstrate the eﬀectiveness of this method and compare it with commonly used principal component analysis method and
linear discriminant analysis method.
Ó 2002 Elsevier Science B.V. All rights reserved.
Keywords: Clustering based discriminant analysis; Facial expression recognition; Human computer interaction; Principal component
analysis; Linear discriminant analysis
1. Introduction
Facial expression delivers rich information
about human emotion and plays an important role
in human communications. For intelligent and
natural human–computer interaction, it is essential
to recognize facial expression automatically. Various techniques have been developed for automatic
facial expression recognition, which diﬀer in data
used (still images vs. video sequences), feature extraction methods, and classiﬁers used. For facial
expression recognition from image sequences, optical ﬂow estimation is typically used to extract
*
Corresponding author. Address: Electrical and Computer
Engineering Department, California State University, 18111
Nordhoﬀ Street, Northridge, CA 91330, USA. Tel.: +1-818677-4755/217-244-1958; fax: +1-818-677-7062.
E-mail addresses: xwchen@csun.edu, xwchen@uiuc.edu
(X.-w. Chen).
features. Mase (1991) selected some facial regions
manually and used optical ﬂow to estimate the
motion of facial muscles and a k-nearest neighbor
rule for recognition. Yacoob and Davis (1994)
used optical ﬂow to track the motion of brows,
eyes, nose and mouth; a lookup table was used to
classify six basic expressions. Barlett et al. (1996)
combined optical ﬂow and principal component
analysis (PCA) for facial expression recognition.
Essa and Pentland (1997) used optical ﬂow in a
physical model of the face with a recursive
framework to classify facial expressions. Otsuka
and Ohya (1997) computed the 2D Fourier transform coeﬃcients of optical ﬂow and hidden Markov model was used to classify facial expressions.
The performance of these methods depends on the
reliability of optical ﬂow estimation.
Facial expression recognition from static images is a more diﬃcult problem than from image
sequences due to the fact that less information
0167-8655/03/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 8 6 5 5 ( 0 2 ) 0 0 3 7 1 - 9
2.
1296
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
during expression actions is available. Many
psychologists used single image for expression
recognition because of the diﬃculty to obtain wellcontrolled video sequences of standard facial expressions. For example, ‘‘mug-shot’’ images that
reveal peak expressions were used in most psychological research on facial expression recognition (e.g., Young and Ellis, 1989).
In this paper we are concerned particularly with
recognition of facial expressions from still images
with emphasis on feature extraction. Two most
commonly used methods for extracting features
from still images are PCA and linear discriminant
analysis (LDA) (e.g. Cottrell and Metcalfe, 1991;
Lyons et al., 1999; Padgett and Cottrell, 1995).
PCA is an unsupervised method, which treats
samples of the same class and of diﬀerent classes
the same way. In the case of having a prior
knowledge in form of labeled samples, PCA is not
able to exploit such information. LDA is a supervised method using the category information
associated with each sample to extract the most
discriminatory features, which has been shown to
perform well in many applications. However,
LDA may not be able to separate samples from
diﬀerent classes if multiple clusters per class exist
in input feature space (an illustration of this point
is shown in Section 2.2), especially if there is little
or no diﬀerence in mean vectors (as such, the LDA
projections are not reliable). To utilize the class
information and exploit cluster information associated with training samples, we propose a new
clustering based feature extraction method, which
is modiﬁed from LDA methods. We then apply
this method to facial expression recognition. In
Section 2, we will describe the proposed algorithm.
Experiment results are present in Section 3. Our
conclusion is presented in Section 4.
2. Clustering based discriminant analysis
Each image can be treated as a feature vector by
concatenating the rows of the image together,
using each pixel as a single feature. Thus, each
image can be represented by an n-dimensional
vector xk , where n is the number of pixels in each
image. Let fx1 ; x2 ; . . . ; xN g be a set of N images,
ðiÞ
xk (k ¼ 1; . . . ; Ni , Ni is the number of samples in
class i) be samples in class i, and l 2 Rn be the
mean vector of all N samples. For the ith class xi
(i ¼ 1; . . . ; c, c is the number of classes), li is the
mean vector of samples in this class. In this section, we brieﬂy discuss PCA and LDA methods in
Section 2.1; in Section 2.2, we show that LDA may
perform poorly given multiple clusters in one class;
the proposed clustering based approach is then
described in Section 2.3.
2.1. PCA and LDA
PCA (e.g., Turk and Pentland, 1991) projects
the original n-dimensional feature vectors xk to a
new reduced feature space, where the new projected m-dimensional feature vectors yk ¼ AT xk .
A 2 RnÂm is the transformation matrix consisting of
orthonormal eigenvectors of the total scatter maPN
T
trix ST deﬁned as ST ¼ k¼1 ðxk À lÞðxk À lÞ ,
corresponding to the m largest eigenvalues. An
important property of PCA is that it generates
features capturing the main scatter directions and
thus is optimal for representation in a reduced
space in terms of the minimum mean square error.
As a result, PCA can drastically reduce the dimensionality of the original space without loss of
much information in the sense of representation.
The disadvantage is that PCA may lose important
information for discrimination between diﬀerent
classes, since it projects original feature space onto
axes determined by eigenvectors of the total scatter
matrix ST , which treats samples of the diﬀerent
classes in the same way. To illustrate this point,
Fig. 1 shows an example of a two-class problem
with two-dimensional feature space, where PCA is
used to project samples onto one-dimensional
space. As can be seen, PCA-projected samples
from two classes overlap heavily (thus, a poor
recognition result is expected), although PCA
preserves large total scatter.
To extract features useful for discrimination,
LDA projects the original n-dimensional feature
vectors xk to a new reduced feature space with the
projected m-dimensional feature vectors yk ¼ AT xk .
A 2 RnÂm is the transformation matrix consisting of
m generalized eigenvectors v corresponding to the
m largest eigenvalues of the equation, SB v ¼ ki Sw v
3.
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
1297
Fig. 2. Projection of neutral expression class onto 2-D PCA
space.
Fig. 1. Projections of 2-D samples onto 1-D using PCA and
LDA.
where SB is the between-class scatter matrix deﬁned
P
as SB ¼ c ðli À lÞðli À lÞT and P is the withinSw
i¼1
P
ðiÞ
Ni
class matrix deﬁned as Sw ¼ c
i¼1
k¼1 ðxk À li Þ Â
ðiÞ
T
ðxk À li Þ (e.g., Duda and Hart, 1973). Apparently, LDA takes both the within-class scatter and
the between-class scatter into consideration, and
thus it will carry discrimination information important for classiﬁcation. For the two-dimensional
dataset shown in Fig. 1, LDA is used to project
samples onto one-dimensional space. As can be
seen from Fig. 1, LDA-projected samples from
diﬀerent classes are well separated; i.e., LDA carries better information useful for distinguishing
two classes compared to PCA.
2.2. Multiple clusters of facial space
For face or facial expression recognition problems, it is highly possible that multiple clusters in
one class exist due to diﬀerent lighting conditions
or diﬀerent expressions. To see this, we project
samples in the neutral expression class (detailed
description about the database is in Section 3)
onto two-dimensional PCA space. Fig. 2 shows the
projected result. It is clear that there are four
clusters for the neutral expression class.
For input data with multiple clusters per class,
LDA may fail to generate features for separating
samples from two classes, since class means do not
represent the data structure. This can be seen from
Fig. 3(a), where, one class has three clusters
Fig. 3. Comparison of diﬀerent projection methods, (a) LDA,
and (b) PCA, LDA, and CDA; lines are the axis of projections.
4.
1298
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
(samples are represented by ‘‘þ’’) and the other
has one cluster (samples are represented by ‘‘o’’).
Apparently, after LDA projection, many samples
from two diﬀerent classes overlap.
Next, we introduce a new method modiﬁed
from LDA, the clustering based discriminant
analysis, for feature extraction.
2.3. Clustering based discriminant analysis (CDA)
We wish to determine a transform yk ¼ AT xk on
an n-dimensional input xk such that the projections
(transformed values) for each class (which may
consist of many clusters) are separated. For a cclass problem, let us denote the number of clusters
in the ith class by di , the mean vector for the jth
ðiÞ
cluster in the ith class by lj (j ¼ 1; . . . ; di ), and the
m projection vectors by ut , t ¼ 1; . . . ; m (i.e., the
transformation matrix A ¼ ½u1 ; u2 ; . . . ; um ). After
projection onto ut , the mean diﬀerence between the
jth cluster in the ith class and the hth cluster in the
ðiÞðlÞ
ðiÞ
ðlÞ
ðiÞ
ðlÞ T
lth class is Rjh ¼ uT ðlj À lh Þðlj À lh Þ ut ,
t
and the variance of the jth cluster in the ith
P
ðiÞ
ðiÞ
ðiÞ
class is Cj ¼ s uT ðxs À lj ÞT ðxs À lj ÞT ut , where
t
summation is over all samples xs of the jth cluster in the ith class. To separate clusters belonging
to diﬀerent classes without constraints on clusters belonging to P same class, we maximize
Pdi thel ðiÞðlÞ
PcÀ1 Pc
d
e
¼ uT R ut (e.g., for
i¼1
l¼iþ1
t
j¼1
h¼1 Rjh
two-class problems,P ¼ 2, i ¼ 1, and l ¼ 2; thus,
c
P
ð1Þð2Þ
d2
we maximize d1
which is mean difj¼1
h¼1 Rjh
ferences from every clusters in class 1 to every
clusters in class 2); we also minimize each cluster scatter so asP keep each cluster compact
to P
ðiÞ
c
di
Te
(i.e., minimize
j¼1 Cj ¼ ut C ut ). Thus,
i¼1
e
we maximize a criterion function J ¼ ðuT R ut Þ=
t
e
ðuT C ut Þ for all ut , where
t
e
R¼
dl
di
cÀ1
c
X X XX
i¼1 l¼iþ1 j¼1
ðiÞ
ðlÞ
ðiÞ
ðlÞ T
ðlj À lh Þðlj À lh Þ
ð1Þ
h¼1
and
e
C¼
di
c
XXX
i¼1
j¼1
ðiÞ
ðiÞ T
ðxs À lj Þðxs À lj Þ
ð2Þ
s
e
e
By setting the derivative of ðuT R ut Þ=ðuT C ut Þ with
t
t
respect to ut to zero, we obtain the solutions ut ,
which satisfy the following equation
e
e
R ut ¼ kt C ut
ð3Þ
e
e
where kt ¼ ðuT R ut Þ=ðuT C ut Þ ¼ J . Thus, the prot
t
jection vectors of CDA are the m eigenvectors ut
corresponding to the m largest eigenvalues kt of
Eq. (3).
In above discussion, we assume that clusters in
each class are known. A necessary procedure in the
proposed CDA method is clustering. In our application, a fuzzy c-means (FCM) clustering algorithm (e.g., Bezdek, 1981) is used (for high
dimensional data, we use PCA to reduce feature
space ﬁrst, and clustering is done in the reduced
space); the number of clusters for each class is
selected by the cluster validity suggested by Xie
and BeniÕs validity method (e.g., Xie and Beni,
1991). It is worthy of noting that from our experimental results, the selection of clustering algorithms is not critical to CDA algorithms. This is
what we expected because most clustering algorithms could ﬁnd well-separated clusters. For
clusters in the same class which are close each
other, considering them as one cluster or as several
clusters does not signiﬁcantly aﬀect the classiﬁcation performance.
As a modiﬁed version of LDA, CDA is mainly
designed for two-class problems, although it can
be directly applied to multi-class problems (c > 2).
The main diﬀerence between CDA and LDA is
that LDA is designed to separate diﬀerent classes
without considering each individual cluster, while
CDA exploits cluster information and separates
clusters associated with diﬀerent classes. Thus,
CDA can handle input data with multiple clusters
per class (e.g., a mixture of Gaussians) and can
work well even if there is little or no diﬀerence in
mean vectors (as such, the LDA projections are
not reliable). For two class problems with multiple
clusters per class, one can also apply multiple
discriminant analysis (MDA), an extension of the
LDA for multiple classes (e.g., Duda and Hart,
1973), by treating diﬀerent clusters as diﬀerent
classes. MDA ﬁnds projections that separate all
clusters regardless of their class membership, i.e.,
MDA treats clusters in the same class diﬀerently
(doing so needs extra constrain). However, it is not
necessary to separate clusters (and thus samples) in
the same class. CDA separates only clusters in
5.
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
diﬀerent classes without putting any constraint on
clusters in the same class.
To compare the performances of PCA, LDA,
and CDA, Fig. 3(b) shows projection vectors for
the same data as in Fig. 3(a). It is clear that for
PCA projection, many samples from cluster 2 of
class 1 and from class 2 overlap; for LDA projection, samples from both clusters 1 and 3 of class
1 overlap with samples from class 2; while for
CDA projection, samples from two classes separate well.
3. Experimental results
In this section, we demonstrate the application
of the proposed CDA algorithm to facial expression recognition. We compare the proposed CDA
algorithms to LDA algorithms for two-class classiﬁcation problems. We use a one-to-many classiﬁcation scheme for multiple class classiﬁcation
problems, where each class is trained against all
other classes. Since the dimensionality of face im-
1299
ages is very high, both the within-class matrix Sw in
e
LDA and C in CDA are typically singular. Thus,
before we apply LDA and CDA, PCA is used to
reduce the dimensionality of original feature space
to N À c, as did in Belhumeur et al. (1997) and
Swets and Weng (1996).
The face database (Martinez and Benavente,
1998) used in the experiments consists of 1428
images corresponding to 119 personÕs faces (64
men and 55 women) with three facial expressions
(neutral, smile, and anger). Each person has 12
images––half of them was from the ﬁrst shot and
used as our training sample set (total 714); the
other was taken under similar conditions after two
weeks later and used as our test sample set (total
714). Fig. 4 is example images with three facial
expressions in the database, where the ﬁrst column
images are neutral, the second column images are
smile, and the last column images are anger. We
then manually select a rectangular region containing brow and eyes and a rectangular region
containing mouth for each person. Pixels inside
both rectangular regions are lexicographically
Fig. 4. Example image sets of three facial expressions in the face database.
6.
1300
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
ordered and concatenated to a feature vector.
Transformation matrices A for feature extraction
are then derived from only training set. We use
nearest neighbor rule for classiﬁcation. The 3-class
classiﬁcation problem is decomposed into six 2class classiﬁcation problems, as shown in Fig. 5: a
sample is ﬁrst classiﬁed as neutral vs. non-neutral
(classiﬁer 1), smile vs. non-smile (classiﬁer 2), and
anger vs. non-anger (classiﬁer 3); if this sample is
classiﬁed as single expression only (e.g., in classi-
ﬁer 1, neutral, in classiﬁer 2, non-smile, and in
classiﬁer 3, non-anger), it is classiﬁed as that expression in our system (e.g., neutral); if this sample
is classiﬁed as all three expressions (e.g., in classiﬁer 1, neutral, in classiﬁer 2, smile, and in classiﬁer
3, anger) or as non-expression (e.g., in classiﬁer 1,
non-neutral, in classiﬁer 2, non-smile, and in classiﬁer 3, non-anger), it is classiﬁed as ‘‘uncertain’’
(this case does not happen in our application);
otherwise, this sample is further classiﬁed as neu-
Fig. 5. Classiﬁcation procedures.
7.
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
tral vs. smile (classiﬁer 4), or neutral vs. anger
(classiﬁer 5), or smile vs. anger (classiﬁer 6), depending on the classiﬁcation results from previous
classiﬁers.
For two-class problems, we can only extract
one ﬁnal feature from LDA. To determine the
number of ﬁnal CDA features, we compute classiﬁcation rates vs. the number of ﬁnal CDA features for training samples and retain the feature
number corresponding to the largest classiﬁcation
rate. Table 1 lists the test results using LDA and
CDA from classiﬁers 1, 2, and 3. As can be seen,
CDA performs better and yields higher classiﬁcation rates for all three cases (86.7%, 98.2%, and
89.1% for classiﬁcations of neutral vs. non-neutral,
smile vs. non-smile, and anger vs. non-anger, respectively).
Final classiﬁcation results are summarized in
Tables 2 and 3, which are the confusion matrices
for LDA and CDA, respectively. As can be seen,
for LDA, most samples with smile expression
(113) are correctly classiﬁed; however, LDA has
trouble to distinguish anger expression from neutral expression (43 images with neutral expressions
1301
are classiﬁed as anger and 62 images with anger
expressions are classiﬁed as neutral). The proposed
CDA outperforms LDA: 118 samples out of 119
with smile expression, 451 samples with neutral
expression and 96 samples with anger expression
are correctly classiﬁed.
4. Conclusions
In this paper, we describe a novel discriminant
feature extraction method, the clustering based
discriminant analysis (CDA), and show its feasibility for the facial expression recognition problems. CDA is modiﬁed from LDA methods. Unlike
PCA, which is designed for representation, CDA
method extracts features with discriminant power;
unlike LDA, which separates the whole class
means, CDA seeks to separate clusters of diﬀerent
classes. Experimental results show that CDA generates better results than LDA in terms of the
classiﬁcation accuracy. The CDA method provides an eﬃcient feature reduction and extraction
schemes useful for facial expression recognition
and can be applied to face recognition as well.
Table 1
Comparison of classiﬁcation rates
Features
Neutral vs.
non-neutral
(%)
Smile vs.
non-smile
(%)
Anger vs.
non-anger
(%)
LDA
CDA
78.2
86.7
97.5
98.2
80.3
89.1
Neutral
Smile
Anger
409
5
43
5
113
2
62
1
74
Neutral
Smile
Anger
451
0
23
3
118
0
22
1
96
Table 3
Confusion matrix for CDA
Neutral
Smile
Anger
The authors would like to thank the reviewers
for their valuable comments.
References
Table 2
Confusion matrix for LDA
Neutral
Smile
Anger
Acknowledgements
Barlett, M., Viola, P., Sejnowski, T., Larsen, L., Hager, J.,
Ekman, P., 1996. Classifying facial action. In: Touretzky,
D., Mozer, M., Hasselmo, M. (Eds.), Advances in Neural
Information Processing Systems. MIT Press, Cambridge,
MA.
Belhumeur, P., Hespanha, J., Kriegman, D., 1997. Eigenfaces
vs. ﬁsherfaces: Recognition using class speciﬁc linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (7),
711–720.
Bezdek, J., 1981. Pattern Recognition weith Fuzzy Objective
Function Algorithms. Plenum, New York.
Cottrell, G., Metcalfe J., 1991. EMPPATH: Face, gender, and
emotion recognition using holons. In: Lippman, R., Moody,
J., Touretzky, D. (Eds.), Advances in Neural Information
Processing Systems, CA, vol. 3, pp. 564–571.
8.
1302
X.-w. Chen, T. Huang / Pattern Recognition Letters 24 (2003) 1295–1302
Duda, R., Hart, P., 1973. Pattern Recognition and Scene
Analysis. Wiley, New York.
Essa, I., Pentland, A., 1997. Coding, analysis, interpretation,
and recognition of facial expressions. IEEE Trans. Pattern
Anal. Machine Intell. 19 (7), 757–763.
Lyons, M., Budynek, J., Akamatsu, S., 1999. Classifying images
of facial expression using a Gabor wavelet representation.
In: Proceedings, 2nd International Conference on Cognitive
Science, Tokyo, Japan, pp. 113–118.
Martinez, A., Benavente, R., 1998. The AR Face Database.
CVC Technical Report #24.
Mase, K., 1991. Recognition of facial expression from optical
ﬂow. IEICE Trans. E 74 (10), 3474–3483.
Otsuka, T., Ohya, J., 1997. Recognizing multiple persons facial
expressions using HMM based on automatic extraction of
signiﬁcant frames from image sequence. In: Proc. Int. Conf.
On Image Processing, pp. 546–549.
Padgett, C., Cottrell, G., 1995. Identifying emotion in static
images. In: Processing of the second Joint Symposium of
Neural Computation, vol. 5. pp. 91–101.
Swets, D., Weng, J., 1996. Using discriminant eigenfeatures for
image retrieval. IEEE Trans. Pattern Anal. Machine Intell.
18 (8), 831–836.
Turk, M., Pentland, A., 1991. Eigenfaces for recognition.
J. Cognition Neurosci. 3 (1), 71–86.
Xie, L., Beni, G., 1991. A validity measure for fuzzy
clustering. IEEE Trans. Pattern Anal. Machine Intell. 3,
841–847.
Yacoob, Y., Davis, L., 1994. Recognizing human facial
expressions from long image sequences using optical
ﬂow. IEEE Trans. Pattern Anal. Machine Intell. 16 (6),
636–642.
Young, A., Ellis, H., 1989. Handbook of Research on Face
Recognition. Elsevier, New York.
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment