A Neurocontrol for Automatic Reconstruction of Facial Displays
Max H. Garzon and Buvaneshwari Sivakumar, Member, IEEE

Abs...
conversation to complement speech generated by a synthetic
engine. The main results were that facial features could be
cla...
(ranging from 5 to 15) shown at appropriate speed to be
perceived as motion by the human eye. This reduced
information sti...
Backpropagation was still able to produce networks
(generally, in 300 epochs) that jointly derive several features
at a ti...
Similar to the conclusion in (Garzon et aI., 2002), the
mouth width was again found to be a primary feature that
proved ca...
on

Neural

Networks

(IJCNN-02)

Hawaii,

World

Congress

in

Computational Intelligence WCCI (2002), Computer Society P...
Upcoming SlideShare
Loading in...5
×

A neurocontrol for automatic reconstruction of facial displays

225

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
225
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A neurocontrol for automatic reconstruction of facial displays

  1. 1. A Neurocontrol for Automatic Reconstruction of Facial Displays Max H. Garzon and Buvaneshwari Sivakumar, Member, IEEE Abstract- Anthropomorphic representations of software agents (avatars) are being used as user interfaces in order to enhance communication bandwidth and interaction with human users. They have been traditionally programmed and controlled by ontologies designed according to intuitive and heuristic considerations. More recently, recurrent neural nets have been trained as neurocontrollers for emotional displays in avatars, on a continuous scale of negative, neutral, and positive feedback, that are meaningful to users in the context of tutoring sessions on a particular domain (computer literacy). We report on a new neurocontrol developed as a recurrent network to autonomously and dynamically generate and synchronize the movements of facial features such as lips, eyes, eyebrows, and gaze in order to produce facial displays that convey high information content on nonverbal behavior to untrained human users. The neurocontrol is modular and can be easily integrated with semantic processing modules of larger agents that operate in real-time, such as videoconference systems, tutoring systems, and more generally, user interfaces coupled with affective computing modules for naturalistic communication. A novel technique, cascade inversion, provides an alternative to backpropagation through time where the latter may fail to learn recurrent neural nets from previously learned modules playing a role in the final solution. 1. INTRODUCTION Avatars and talking heads are anthropomorphic representations of software agents used to facilitate and enhance interaction with humans. They are used, in particular, in applications where the bandwidth of the interaction between the agent and the user is very high, so that a text-only interface would either unduly tax the user or would simply be out of the question. They afford new interfaces that rely on nonverbal behavior to expand the bandwidth and speed of real-time communication between human and computer [Mehrabian, 2007; Garzon and Rajaya, 2003; Cassell et al,., 2000; Lester & Stone, 1997; Massaro, 2000]. Many such devices have been described in the recent literature and their advantages have been discussed in several places. Notable examples are Autotutor (http://www.autotutor.orgD a software agent agent capable of tutoring a human user in a restricted domain of expertise, such as computer literacy or physics, at the level of an untrained Grace human tutor; (http://www.cmu.edu/cmnews/020906/020906grace.html) was a robot that registered and delivered a speech at the 2002 AAAI conference on Artificial Intelligence; and a videophone model for videoconferencing on a low bandwidth channels (Yan et aI., 2006). AutoTutor is an embodied conversational agent consisting of a dialog module that handles the computational intelligence to carry on a conversation with the student, and an interface module embodied by a talking head to convey non-verbal feedback (Garzon et aI., 2002; Garzon, 1999). Despite their popularity and potential for applications, the design and implementation of talking heads has been primarily a job in heuristics. (A notable exception is Massaro's Baldi (Massaro, 2000), based on a neurofuzzy model, but the primary emphasis of the model is largely on mouth sequences with a high degree of realism that permits, for example, deaf people to learn to adequately pronounce English language.) The usual design method thus consists in designing an ontology of prototypical facial expressions or animations that presumably reflect the desired type of responses, and some heuristic ontology to display them based on pre-programmed action sequences. While the result may suffice in many cases, it is clear that the resulting solutions suffer from a number of problems, such unnaturalness of the expressions, robotic appearance, inappropriate responses, and perhaps worst of all, a continuing programming effort to produce scripts that do not adapt to a rich variety of circumstances with sufficient richness to appeal naturalistic when interacting with humans. The videophone model introduced in (Yan et aI., 2006) introduced a technique that increases the efficiency of the communication at least lOOO-fold by replacing unrealistic video transmissions with a three stage process (extraction and coding of facial features into text form, transmission over a low bandwidth channel, and reconstruction of the visual expression in synch with voice at the receiving end.) This development has many advantages (see (Yan et aI., 2006) for more details), but critically relies on a deep analysis of facial features to perform effectively the first and third stages in the process. Manuscript received February 8, 20 10. This work has been partially supported by grants from The National Science Foundation NSF/KDI9720314 and NSF/ROLE-O I 06965. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF. M. H. Garzon is with the Computer Science Department, The University of Memphis, 209 Dunn Hall, Memphis, TN 38152-3240 USA. (e-mail: mgarzon@memphis.edu). B. Sivakumar was a graduate student with the Electrical and Computer Engineering Department, The University bsvkumar@memphis.edu). 978-1-4244-8126-2/101$26.00 ©2010 IEEE of Memphis (e-mail: In previous work (Garzon and Rajaya, 2003 and 2002; Garzon et aI., 2002b), we successfully trained feedforward and recurrent neural net modules to generate facial expressions that convey meaningful nonverbal emotional content to tutees in the context of Autotutor tutoring sessions on a particular domain (computer literacy) on a continuous scale of negative, neutral, and positive feedback. The visual feedback is given in the course of normal
  2. 2. conversation to complement speech generated by a synthetic engine. The main results were that facial features could be classified as primary (such as mouth positions, time durations, for which a neural net can be built that will recover the feature from the remaining features in a facial display consisting of various facial expressions displayed dynamically over a short time period, order of seconds) or derived facial features (to a higher or lesser degree, all the remaining features such as brow heights, eye widths, etc., for which such a neural network is not possible.) In particular, it was also established that feedforward neural networks cannot train a network to output the proper signals to control the primitive features (mouth position and time duration) to complete the solution to the original problem. The performance of this model, while acceptable, did not prove to be appropriate for a videophone, where the naturalness of the facial displays in far more critical for satisfactory evaluation. The reason was traced back to the quality of the training and testing data used in the date set, which was more appropriate for systems with low transmission rates such as Autotutor. The data set was generated by feedback from human users of a tutoring system who created on an avatar the facial expression they would have liked to have received on their answers to specific situation in a given tutoring sessions. These problems are exacerbated under the need for mUltiple concurrent sessions in the videophone application. In this paper, we report results of a much higher quality that are likely to be satisfactory for a videophone. It is desirable to use data directly obtained from human faces. Video clips from the well known Cohn-Kanade database (Kanade et aI., 2000) have been used wherein trained actors enacted facial expressions. Since the human face is extremely rich in emotional displays, it is labor intensive to use the exhaustive 44 features to train a neurocontroI. However, we show that a carefully selected set of ten (10) features captures enough information for the videophone. The feature vectors have been extracted by image processing (Gonzales et aI., 2008) from images sampled from the Cohn­ Kanade database. The better data translates into much better performance of the corresponding neurocontrol, described in detail in the following section. The network autonomously controls and synchronizes the movements of various facial features such as lips, eyes and eyebrows in order to produce facial animations that are valid and meaningful to human users. The autonomous nature of the resulting network makes it possible to easily interface it with semantic processing modules of larger agents that cooperate in real­ time concurrently, such as a videphone or Autotutor. In the course of developing this application, the training procedure required the application of a novel learning algorithm, cascade inversion, on interest in its right for further research. Cascade inversion provides an alternative to backpropagation through time where the latter may fail to learn recurrent neural nets from previously learned modules playing a role in the final solution. Also, the results obtained confirm the dependencies among facial features suggested by the results in (Garzon and Rajaya, 2003; Garzon et aI., 2002) that may be useful to make identification and recognition of people's faces more efficient, as discussed below in the conclusions. II. F ACI AL DISPLAYS AND T HE F ACS SYSTEM Nonverbal facial behavior may account for as much as 93% of human affective communication content [13, I], compared to 7% from words. Blushing, eye rolling, eye winking, tongue sticking, are gestures we all use in daily life to communicate quickly feelings, emotions, and attitudes far more effectively than words. The Mehrabian's study (Mehrabian, 2007) further suggests that incongruencies between word and facial displays is resolved mostly by nonverbal means (the study suggested 7%/45%/55% attribution to verbal cues/vocal cues/facial cues). Facial features are used by human speakers in a highly dynamic and well orchestrated process that has nothing to envy to the complexity and information bandwidth already present in verbal discourse process. For example, facial animations alone are an old and fertile art for communication, as evidenced by long traditions of popular cartoonists and animators. Here, we restrict our attention to a minimal subset of features that allow for a simplified but automatic generation of facial displays in a talking head for nonverbal communication from a computer to a human user, in a way that is naturalistic, ergonomic, and easily understood by virtually anyone without further training beyond normal human-human interaction. Darwin (Darwin, 1959) and (Ekman; 1992) were early pioneers in the discourse of nonverbal facial communication. Feature Vectors FACS AUs Mouth width, mouth height AUs 10,12,15,16,17 Eyelid Opening AUs 5,7 Eyebrow Height Eye Direction AUs 1,4 AUs 61,62,63,64 Figure 1: Arto (top), an avatar developed for a tutoring system and videophone, and its action unit vectors. A. Facial features, Expressions and Displays Early work by Ekman provides a full set of 64 facial features in his Facial Action Unit Coding System (F ACS) to describe static facial expressions of human faces (Ekman, 2003). This full set of 64 FACS units has been judiciously reduced to ten, namely two eye brows, distance between the eyebrows, two eye lids, two eye directions --horizontal and vertical-- as well as a mouth height, mouth width, and a frame duration. A facial display is defined as an animation consisting of a sequence of various key static frames
  3. 3. (ranging from 5 to 15) shown at appropriate speed to be perceived as motion by the human eye. This reduced information still permits a continuous and smooth animation, as facial expressions can be interpolated to produce a continuous facial display (Garzon et aI., 2002). The expressions of interest are those that convey positive, negative and neutral emotions on a coarse scale. The features from these images were extracted manually and rated on a scale of 0 to 1 to indicate negative, neutral or positive. A set of several example frames extracted from several displays is shown in Fig. 2. videophone user. Our problem is precisely the inverse problem, in which, ideally, a rating value alone is given as input. The desired solution must translate that value into a set of facial feature values that will evaluate into the original rating by a human viewer. A. Primary and Derived Features Modules that provide partial solutions to the inverse problem for facial gestures were obtained by backpropagation training (Haykin, 2009; Skapura, 1998) based on input features and the rating of the desired emotional attitude. Feedfoward nets with two hidden layers [40-10-8-1] were required for learning to converge while training for one feature at a time with rating as input instead. The performance of the trained networks can be seen in Table I. Rating and all sequence features in the frames, except for distance between eye brows, mouth width and time duration, were shown to be derivable individually. Features expression (positive, neutral or negative) as a 64 feature vector. B. Data Collection After computing the feature vectors from the sequence of faces describing an expression, the feature vectors so obtained were then normalized. Ten feature vectors for four frames in a sequence constitute 40 vectors in a row followed by the rating as the 4151 value in the vector in a training point. Image processing libraries were used on MatLab (Gonzales et aI., 2008). Inputs for a neutral expression where obtained by repeating the first frame in all the four frames. Other displays were sampled evenly. Testing 95.0% 69.0% Left Eye Brow (LEB) Figure 2: Ekman's FACS system describes an arbitrary facial Training Rating* 84.0% 89.4% Right Eye Brow (REB) 88.0% 86.5% Distance between eyebrows (D)* 98.7% 85.5% Left Eye Lid (LEL) 98.0% 91.0% Right Eye Lid (REL) 97.5% 93.2% Horizontal Eye Direction (HED) Mouth Height (MH)* 98.1% 99.0% 99.0% 94.2% 98.0% 97.0% Mouth Width (MW)* 99.0% 86.0% Time Duration (TD)* 99.0% 80.0% Vertical Eye Direction (VED) Table 1: Most selected facial features are derivable, except for the given (primary) ones, here marked *. 10° Perfonnance is 0.0200597, Goal is a .---�--�--�-�----, Figure 3: Extraction of a facial feature vectors from source images in the Cohn-Kanade database using image processing. III. TR AINING A NEUROCONTROL FOR F ACI AL DISPL AYS The problem at hand here is an inverse problem, akin to pole balancing or inverted pendulum (Skapura, 1994). In the forward problem, a given configuration of facial features of the talking head determines an emotional expression that evaluates to a rating for pedagogical feedback by AutoTutor (e.g., confused, frustrated, negative neutral, neutral, positive neutral, and enthusiastic) when viewed by a student, or by a 10" 0 --= ---:: "---: '::" 0 � = " 50 15 '---200 ---::50----:0 100 2= 300 300 Epochs stop Trttining Figure 4: Typical performance of MSE error in BP training. The target was O.
  4. 4. Backpropagation was still able to produce networks (generally, in 300 epochs) that jointly derive several features at a time, within acceptable error rates (MSE under 0.05), as seen in Tables 2 and 3. The best module was able to jointly generate four features out of a set of input features consisting of the rating and the remaining facial features, as seen in Tables 4-6. The typical MSE in the target performance is show above in Figure 4. Additional Feature Training Testing Right Eye Lid (REL) 98.7% 93.2% Vertical Eye Direction (VED) 99.6% 92.7% Horizontal Eye Direction (HED) 98.7% 94.2% Left Eye Brow(LEB) 92.2% 94.0% 99.6% 90.4% 89.0% 93.2% Similar to the conclusion in (Garzon et aI., 2002), the mouth width was again found to be a primary feature that proved capable of deriving the remaining features together with the frame duration and emotional attitude. In particular, the height of the mouth was now derivable from these primary features. On the other hand, the distance between the eye brows also turned out to be a primary feature for which the network could not be trained and with which other features could be derived. Right Eye Brow (REB) Mouth Height (MH) Table 2: Joint derivation of two (2) features at a time, including Left Eye Lid. These results show a much better performance compared to the previous studies (Garzon and Rajaya, 2003; Garzon et aI., 2002), demonstrating the much better quality of the training and testing data. Moreover, whereas backpropagation failed to converge in any sort of network topology or combination to jointly derive more features in the previous studies, now it was able to produce better results en route towards a full solution of the inverse problem. B. Additional Feature Training Horizontal Eye Direction (HED) 98.5% Testing 91.3% Vertical Eye Direction (VED) 99.1% 91.6% Left Eye Brow (LEB) 89.4% 93.9% 90.7% 95.2% 92.6% Mouth Height (MH) 98.9% Table 3: Joint derivation of three (3) features at a time, Right Eye Brow (REB) including Left and Right Eye Lid. Additional Feature Training Testing Horizontal Eye Direction (HED) 99.1% 92.7% 95.0% 96.4% 99.1% 93.5% 91.1% 93.3% Cascade Inversion learning However, a full solution to the inverse problem still requires a recurrent network. This time, although backpropagation through time was now able to derive the remaining facial features from the input set S of (rating, mouth width, distance between the brows, and frame durations) as input, it was still not fully able to successfully derive the full set of facial features. Cascade inversion, the novel method developed in (Garzon et aI., 2002) was applied for this work again, considering the promising results obtained in (Garzon et aI., 2003). Similar results to the previous work were again obtained with excellent performance, as shown in Tables 7 and 8 below. Cascade Inversion (given 2+2=4 Right Eye Brow (REB) Mouth Height (MH) Table 4: Joint derivation of four(4) features at a time, including Left and Right Eye Lid, and Vertical Eye Direction. Additional Feature Training Testing Left Eye Brow (LEB) 96.2% 97.0% 99.4% 93.6% 93.6% Right Eye Brow (REB) Mouth Height (MH) Training Testing X=Left Eye Lid & Right Eye Lid 99% 93% Y=Horizontal & Vertical Eye Direction 99% 93% Network NXY1 Left Eye Brow (LEB) 99% 99% derived features as inputs) Table 7: Neural net learning with given input S={Rating, Mouth Position, Duration} and giving as output all other features. 95.2% Table 5: Joint derivation of five (5) features at a time, including Left and Right Eye Lid, Horizontal and Vertical Eye Direction, Cascade Inversion (given 3+3=6 X=Left Eye Lid + Right Eye Lid + Additional Feature Training Testing Left Eye Brow (LEB) 96.2% 97.0% 99.4% 93.6% 93.6% 95.2% Right Eye Brow (REB) Mouth Height (MH) Table 6: Joint derivation of seven (7) features at a time, including Left and Right Eye Lid, Horizontal and Vertical Eye Direction, Left and Right Eye Brows. Training Testing 94% 93% derived features as inputs) Right Eye Brow (NX1) Y=Horizontal + Vertical Eye Direction 95% 91% Network Nxy4=Inverse Problem 99% 99% Solution Table 8: Neural net learning with input features S={Rating, Mouth Position, Duration} and given all other features as ouput.
  5. 5. Similar to the conclusion in (Garzon et aI., 2002), the mouth width was again found to be a primary feature that proved capable of deriving the remaining features together with the frame duration and emotional attitude. In particular, the height of the mouth was now derivable from these primary features. On the other hand, the distance between the eye brows also turned out to be a primary feature for which the network could not be trained and with which other features could be derived. IV. DISCUSSION AND CONCLUSIONS Previously, recurrent neural nets have been trained, on synthetically generated data for a few features, to control the generation of facial displays in avatars that are meaningful to humans in ordinary non-verbal or non-keyboard mediated computer-to-human communication. Although good, the results appeared to leave some room for improvement. The reason was traced to the fact that the original training set was generated by feedback from human users of a tutoring system who created on an avatar the facial expression they would have liked to have received on their answers to specific situation in a given tutoring sessions (Garzon et aI., 2002 and 2002b; Garzon, 1999). It is well known that emotions are, despite their deep roots in human cognition (Mehrabian, 2007; Carmel and Bentin, 2002), the hardest to define and identify, even more so when self-reported (Ekman, 2003). The results raise at least two questions. First, how valid are they in the context of real life emotional displays of human subjects? Second, are they optimal in terms of an artificial neurocontroller? Finally, there is the issue of how well the new controllers will perform on an avatar in actual interaction with humans on nonverbal affective communication. Here we have addressed the first two questions. We have selected a training set based on recordings of human facial displays produced by real life, professional actors can captured in a well known labeled database, as given by the Cohn-Kanade database (Cohn, Kanade, and Tian, 2000). A neurocontrol for facial expressions was successfully trained with excellent, in fact, nearly optimal results in many cases for a neurocontroller. Since the methodology was kept essentially intact in order to ascertain the effect of the quality of the data, it is clear that the improvement is due to the more naturalistic data in the new training set. A second contribution in this paper is further validation of of a new learning algorithm that addressed the problem of catastrophic forgetting in neural nets, so-called cascade inversion. The technique used is useful where even backpropagation through time training may fail to converge. and cascade inversion, a novel training algorithm to solve inverse fusion problems introduced in (Garzon et aI." 2002). and backpropagation Feedforward backpropagation through time, with only primary features (rating, mouth positions, and frame durations) as input, were unable to learn to produce appropriate set of remaining facial features that performed well for the Cohn-Kanade database. Cascade inversion proved again to be an efficient method to achieve our desired goal of solving the inverse problem of a neurocontrol for facial expressions with the selected set of features with a much richer set of data. We have conducted a preliminary informal evaluation of the quality of facial expressions autonomously generated by the recurrent neurocontrol produced by this method. The avatar appears naturalistic enough to be useful, at least for a first pass of conveying emotional attitude in a videophone application (Yan et aI., 2006). A more formal comprehensive evaluation of the overall quality of the facial expressions is required to establish the communication quality of the emotional displays produced by this recurrent network in the context of tutoring sessions, and more general computer-to-human communication. The results in this paper may have other applications, particularly in related work in facial recognition. The dependencies among facial features place a heavy weight in the information content of the primary facial features for facial displays, namely the dynamics of the mouth (here represented by the mouth positions in the successive frames and duration of each animation) and the emotional state of the individual (here represented by the rating, or emotional attitude.) It was surprising that the mouth height was now a derivable feature, although the mouth width was confirmed to be a primary feature; this is likely due to general high correlation between the two features. Further, the results may have also other implications in the current debate on the nature of biological networks that explain the amazing ability of human to handle nearly instant recognition of emotional states and identity from snapshots of human faces since very early stages of development, even perhaps prenatally (Carmel and Bentin, 2001). ACKNOWLEDGMENT We are grateful to the Cohn-Kanade Lab for making available their facial expression database. REFERENCES [1] [2] A.Mehrabian.Nonverbal Communication, 2007. Max H. Garzon, Kiran Rajaya, "Neural Net Generation of Facial Displays in Talking Heads," Int. Workshop on Artificial Neural Nets IW ANN2003, Springer-Verlag Lecture Notes in Computer Science (Jose Mira,Jose R. Alvarez,eds.),152-166,2003. [3] J..Cassell,J.Sullivan,S.Provost,E.Churchill (eds.).Embodied Conversational Agents. The MIT Press,2000. J. Lester and B. Stone, "Increasing believability in animated pedagogical agents," In Proc. of the First International Conference on Autonomous Agents. W. Lewis Johnson (ed.). ACM Press, February 1997. D. Massaro,M. Cohen, J. Beskow, R.A.Cole, "Developing and W. M. Evaluating Conversational Agents," In (Cassell et aI, 2000 ), 287-318, 2000. X..Yan, M.H.Garzon,M.Nolen. Using Talking Heads for Real-time Virtual Videophone in Wireless Networks. IEEE J. Multimedia, 7884,2006. M.H. Garzon, E. Drumwright, K. Rajaya, "Training a Neurocontrol for Talking Heads," Proc. of the IEEE International Joint Conference [4] [5] [6] [7]
  6. 6. on Neural Networks (IJCNN-02) Hawaii, World Congress in Computational Intelligence WCCI (2002), Computer Society Press, 2449-2453, 2002. [8] M.H. Garzon and The Tutoring Research Group: On Interactive Computation: Intelligent Tutoring Systems. Proc. Theory and Practice of Informatics SOFSEM-1999 (G. Pavelka, G. Tel, M. Bartosek, eds.). Lecture Notes in Computer Science 1725, 261-264. [9] M.H. Garzon, P. Ankaraju, E. Drumwright, R. Kozma, "Neurofuzzy Recognition and Generation of Facial Features in Talking Heads, " Proc. of NEUROFUZZY-2002, Hawaii, World Congress in Computational Intelligence WCCI-2002,926-93,2002b. [10] R.C.Gonzalez, R.E.Woods, S.L.Eddin. Digital Image Processing Using MATLAB, 2,d ed, Prentice Hall, 2008. [II] C. Darwin, The Expression of the Emotions in Man and Animals. Appleton [12] & Co., 1899. P. Ekman, "An Argument for Basic Emotions. " In: N.L. Stein and K. Oatley (eds.) Basic Emotions, 169-200, 1992. [13] D.M.Skapura: Building Neural Networks. Addison-Wesley,1994. [14] S. Haykin. Neural Networks and Learning Machines, 3'd 3d. New Jersey:: Prenctice Hall, 2009. [15] D. Carmel, S. Bentin, "Domain specificity versus expertise: factors influencing distinct processing of faces," Cognition [16] 83, 1-29,2002. P. Ekman, Emotions Revealed, Henry Holt,New York, 2003. [17] T. Kanade , J.F. Cohn , Y. Tian, "Comprehensive Database for Facial Expression Analysis, " The 4th I EEE Conference on Automatic Face and Gesture Recognition, " 46-53, 2000. See also http://vasc.ri.cmu.edu/idb/htmllface/facial expression/ .

×