Automatic 3D facial expression recognition


Published on

Relatório para a cadeira de CG2
LCG / COPPE / UFRJ - 10/09/2013

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Automatic 3D facial expression recognition

  1. 1. Automatic 3D facial expression recognition Rafael Monteiro September 16, 2013 Date Performed: September 10, 2013 Instructors: Claudio Esperan¸ca Ricardo Marroquim 1 Introduction Facial expressions are an important aspect of human emotion communication. They indicate the emotional state of a subject, his personality, among other features. According to Bettadapura [1], their study begun with clinical and psychological purposes, but with recent advances on computer vision, computer science researchers began to show interest on developing systems to automati- cally detect those expressions. Automatic facial expression recognition has several applications, as in HCI (Human-Computer Interaction), where interfaces could be developed in order to respond to certain user expressions, as in games, communication tools, etc. Al- though humans can easily recognize a specific facial expression, its identification by computer systems is not that easy. There are several challenges involved, like illumination changes, occlusion, use of beards, glasses, etc [2]. In the 70s, one of the first problems faced by researchers was: how to accu- rately describe an expression? Paul Ekman on his research defined six basic expressions, which he considered to be universal expressions, because they can be identified on any culture. They are: joy, sadness, fear, surprise, disgust and anger [3]. Examples are shown in Figure 1. Later in 2001, Parrot identified 136 emotional states and categorized them on three levels: primary, secondary and tertiary emotions [4]. Primary emotions are Ekman’s six basic emotions, and the other two levels form a hierarchy. Still in 1971, Ekman wrote a study claiming facial expressions were universal across different cultures [5]. Figure 1: Universal expressions: joy, sadness, fear, surprise, disgust and anger 1
  2. 2. In 1977 Ekman and Friesen developed a methodology to measure expressions in a more precise way, by creating FACS (Facial Action Coding System) [6]. On FACS, basic expression components are defined, called Action Units (AUs). They describe small facial movements, such as raising the inner brows (AU1), or wrinkling the nose (AU9), an so on. These action units can be combined to form facial expressions. A discussion about the universality of human expressions arose in 1994, when Russell questioned Ekman’s position and discussed several points indicating that human expressions are not universal across different cultures [7]. In the same year, Ekman wrote a paper refuting Russell’s arguments one by one [8]. Since then Ekman’s position has been widely accepted and the claim that human ex- pressions are universal across different cultures has been sustained. Facial expression recognition research has many fields of study. One of them is 3D facial expression recognition. These systems are based on facial surface in- formation obtained by creating a 3D model of the subject’s face, and they try to identify the expression in this model. This report will discuss some approaches used in this field. There is a major division between static and dynamic stud- ies. Static studies are performed on a single picture of a subject, where the expression is identified, and dynamic studies consider the temporal behavior of expressions (see Figure 2). A great example of dynamic studies is micro- expression analysis. A micro-expression is an expression that happens in a very short instant of time, generally between 1/25th to 1/15th of a second. They generally occur when a subject is trying to conceal an expression but fail, and it will appear for a brief moment on the face. Figure 2: Example of a dynamic facial expression system One major problem on facial expression studies is to capture spontaneous ex- pressions. Most facial expressions databases are composed of simulated expres- sions, such as the ones displayed in Figure 3. It is easier to ask subjects to display these expressions than to capture expressions generated spontaneously based on emotional reactions to real world stimuli. An interesting development 2
  3. 3. occured when Sebe et al. gave a solution to this problem by using a kiosk with a camera [9]. People would stop by and watch videos, while displaying genuine emotions, and their face was being captured by a camera. At the end of the study, subjects were asked if they would allow their image to be used for aca- demic purposes. Figure 3: Examples of clearly non-spontaneous facial expressions 2 Facial expression systems There are many approaches used by facial expression systems. In a recent survey, Sandbach et al. reviewed the state of the art and they noticed most systems are organized in three steps: face acquisition, face tracking and alignment, and expression recognition [10]. 2.1 Face acquisition Face acquisition is a step performed with the objective of generating a 3D model of the subject’s face. There are some approaches, such as single image recon- struction, use of structured light, stereo photometry and multi-vision stereo acquisition. Single image reconstruction methods are an emerging research topic, because of its simplicity: only a single image is required, using an ordinary camera in a non-restricted environment. Blanz and Vetter developed a method called 3D Morphable Models (3DMM), which statistically builds a model combining infor- mation of 3D shape and 2D texture [11]. The method can be used to generate linear combinations of different expressions and use them to synthesize expres- sions and detect them on facial models. The main disadvantages are: some initialization is required and the method is not robust to partial occlusions. Structured light techniques are based on projecting a light pattern on the sub- ject’s face, analyzing the pattern deformations and recovering 3D shape infor- mation. Figure 4 shows an example of such systems. Hall and Rusinkiewicz developed a system using multiple patterns, which are alternately displayed on the face [12]. An image without the pattern can also be captured in order to incorporate 2D texture information on the 3D model. 3
  4. 4. Figure 4: Illustration of a structured light system Stereo photometry is a variation of structured light techniques which uses more than one light, and each one can emit a different color, as shown in Figure 5. Such systems can retrieve surface normals, which can be integrated in order to recover 3D shape information. Jones et al. developed a system which uses three lights switching on and off in a cycle around the camera [13]. The system performs well using either visible or infrared light. Figure 5: Illustration of a stereo photometry system Multi-vision stereo acquisition systems use more than one camera to simul- taneously capture images from different angles and combine these images to reconstruct the scene. Beeler at al. developed a system which uses high-end cameras and standard illumination, showing great results with sub-millimeter accuracy [14]. 4
  5. 5. 2.2 Face tracking and alignment The second step performed on most facial expression systems is face tracking and alignment. Given two meshes, the problem is to align them in 3D space, so that they can be tracked over time. There are two kinds of alignment: rigid approaches, which assume similar meshes without large transformation, and non-rigid, which deals with large transformations. Most rigid-based approaches rely on the traditional ICP (Iterative Closest Point) algorithm [15]. As for non- rigid approaches, there are several different ways to perform the alignment. Amberg et al. created a variant of ICP which adds a stiffness variable to con- trol the rigidity of the transformation at each iteration [16]. The stiffness value starts with a high value and is reduced at each iteration, so that the matching will gradually allow a non-rigid transformation to be performed. Rueckert et al. used a FFD (Free-Form Deformation) model which performs deformations using control points [17]. By reducing the number of points, computing time can be reduced as well. See Figure 6 for an example of a FFD model. Figure 6: Free-Form Deformation model Wang et al. used harmonic maps to perform the alignment [18]. The face is mapped from 3D space to 2D space, by projecting the mesh into a disc, as shown in Figure 7, thus reducing one dimension. Different discs can be compared in order to perform alignment. Sun et al. used a similar technique called conformal mapping, which maps the mesh into a 2D space, preserving the angles between edges [19]. Tsalakanidou and Malassiotis modified ASMs (Active Shape Models) [20] to work in 3D, using a face model with the most prominent features, such as eyes, nose, etc [21]. Figure 8 shows examples of ASMs plotted on the faces. 2.3 Expression recognition The third and last step of a facial expression system is to recognize the expres- sion. In this step, descriptors are extracted, selected and classified using artificial intelligence techniques. Features can be static or dynamic. Static features are mostly used on a single image, whereas dynamic features have the property to be stable across time, and can be tracked through successive frames on a video 5
  6. 6. Figure 7: Harmonic maps Figure 8: Active Shape Models analysis. Temporal modeling can be done in order to analyze the dynamics of the expression through time. Most systems use HMMs (Hidden Markov Models) [22] to perform this task. Common static features are distance-based, patch- based, morphable models and 2D representations. Distance-based features rely on distances between facial attributes, such as the distance between the corners of the mouth, or between the mouth and the eye, and so on. Soyel and Demirel used 3D distances to recognize expressions [23]. Maalej et al. used patch-based features, where patches are small regions of the mesh represented as surface curves [24], as shown in Figure 9. Patches are compared against templates by computing the geodesic distance between them. Ramanathan et al. used a MEM (Morphable Expression Model), where base expressions are defined and any expression could be modeled through a linear combination between these base expressions by using morphing parameters [25]. These parameters define a parameter space, where similar expressions form clus- ters. A new expression is identified by finding the parameters which generate 6
  7. 7. the closest expression and passing these parameters to a classifier. Berretti et al. used 2D representations, where the depth map of the face is computed, generat- ing a 2D image [26]. Classification is done using SIFT (Scale Invariant Feature Transform) descriptors [27] and SVMs (Support Vector Machines) [28]. Figure 9: Patch-based descriptors As for dynamic features, there are a few approaches. Le et al. used facial level curves, since their variation through time can be tracked and calculated using Chamfer distances [29]. Figure 10 shows an example of such curves. Sandbach et al. used FFDs to model the lattice deformation over time, and they used HMMs to perform temporal analysis [30]. Figure 10: Facial level curves Feature classification is generally performed using well known classifiers, such as AdaBoost and variations [31], k-NNs (k-Nearest Neighbors) [32], Neural Net- works [33], SVMs [28], etc. 7
  8. 8. 3 Future challenges Research on 3D facial expression recognition is evolving, but there are some challenges to consider. One is the construction of more spontaneous expressions databases, since most of them were built using artificial expressions. Further- more, the development of systems capable to distinguish a spontaneous expres- sion from an artificial one is also desirable. Recognition of expressions other than Ekman’s six universal expressions is important, since most systems focus only on these six. Temporal analysis is still on its infant stage. More focus on this area is required, especially on the analysis of micro-expressions, which are very hard to detect. Improvement on algorithms performance is also a crucial factor. Ideally, all systems should work in real-time. References [1] V. Bettadapura. Face expression recognition and analysis: The state of the art. CoRR, abs/1203.6722, 2012. [2] M. Pantic, Student Member, and L. J. M. Rothkrantz. Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:1424–1445, 2000. [3] P. Ekman. Universals and Cultural Differences in Facial Expressions of Emotion. University of Nebraska Press, 1971. [4] W.G. Parrott. Emotions in Social Psychology: Essential Readings. Key readings in social psychology. Psychology Press, 2001. [5] P. Ekman and W. V. Friesen. Constants across cultures in the face and emo- tion. Journal of Personality and Social Psychology, 17(2):124–129, 1971. [6] P. Ekman and W.V. Friesen. ”Manual for the Facial Action Coding Sys- tem”. Consulting Psychologists Press, 1977. [7] J. A. Russell. Is there universal recognition of emotion from facial ex- pressions? A review of the cross-cultural studies. Psychological Bulletin, 115(1):102–141, 1994. [8] P. Ekman. Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique. Psychology Bulletin, 115(2):268–287, 1994. [9] N. Sebe, M.S. Lew, I. Cohen, Yafei Sun, T. Gevers, and T.S. Huang. Au- thentic facial expression analysis. In Automatic Face and Gesture Recog- nition, 2004. Proceedings. Sixth IEEE International Conference on, pages 517–522, 2004. [10] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin. Static and dynamic 3d facial expression recognition: A comprehensive survey. Image and Vision Computing, 30(10):683 – 697, 2012. ¡ce:title¿3D Facial Behaviour Analysis and Understanding¡/ce:title¿. 8
  9. 9. [11] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, SIGGRAPH ’99, pages 187–194, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co. [12] O. Hall-Holt and S. Rusinkiewicz. Stripe boundary codes for real-time structured-light range scanning of moving objects. In Eighth IEEE Inter- national Conference on Computer Vision, pages 359–366, 2001. [13] A. Jones, G. Fyffe, Xueming Yu, Wan-Chun Ma, J. Busch, R. Ichikari, M. Bolas, and P. Debevec. Head-mounted photometric stereo for perfor- mance capture. In Visual Media Production (CVMP), 2011 Conference for, pages 158–164, 2011. [14] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross. High-quality single-shot capture of facial geometry. In ACM SIGGRAPH 2010 papers, SIGGRAPH ’10, pages 40:1–40:9, New York, NY, USA, 2010. ACM. [15] P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes. Pat- tern Analysis and Machine Intelligence, IEEE Transactions on, 14(2):239– 256, 1992. [16] B. Amberg, S. Romdhani, and T. Vetter. Optimal step nonrigid icp algo- rithms for surface registration. In Computer Vision and Pattern Recogni- tion, 2007. CVPR ’07. IEEE Conference on, pages 1–8, 2007. [17] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J. Hawkes. Nonrigid registration using free-form deformations: Application to breast mr images. IEEE Transactions on Medical Imaging, 18:712–721, 1999. [18] Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, and P. Huang. High resolution tracking of non-rigid motion of densely sampled 3d data using harmonic maps. Int. J. Comput. Vision, 76(3):283–300, March 2008. [19] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking vertex flow and model adaptation for three-dimensional spatiotemporal face analysis. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 40(3):461–474, 2010. [20] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models-their training and application. Computer Vision and Image Under- standing, 61(1):38 – 59, 1995. [21] F. Tsalakanidou and S. Malassiotis. Real-time facial feature tracking from 2d-3d video streams. In 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2010, pages 1–4, 2010. [22] L. E. Baum and T. Petrie. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics, 37(6):1554–1563, 1966. 9
  10. 10. [23] H. Soyel and H. Demirel. Facial expression recognition using 3d facial fea- ture distances. In Mohamed Kamel and Aur´elio Campilho, editors, Image Analysis and Recognition, volume 4633 of Lecture Notes in Computer Sci- ence, pages 831–838. Springer Berlin Heidelberg, 2007. [24] A. Maalej, B. Ben Amor, M. Daoudi, A. Srivastava, and S. Berretti. Local 3d shape analysis for facial expression recognition. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 4129–4132, 2010. [25] S. Ramanathan, A. Kassim, Y.V. Venkatesh, and W.S. Wah. Human facial expression recognition using a 3d morphable model. In Image Processing, 2006 IEEE International Conference on, pages 661–664, 2006. [26] S. Berretti, B. Ben Amor, M. Daoudi, and A. del Bimbo. 3d facial expres- sion recognition using sift descriptors of automatically detected keypoints. The Visual Computer, 27(11):1021–1036, 2011. [27] D.G. Lowe. Object recognition from local scale-invariant features. In Com- puter Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pages 1150–1157 vol.2, 1999. [28] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [29] V. Le, H. Tang, and T.S. Huang. Expression recognition from 3d dynamic faces using robust spatio-temporal shape features. In Automatic Face Ges- ture Recognition and Workshops (FG 2011), 2011 IEEE International Con- ference on, pages 414–421, 2011. [30] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert. Recognition of 3d facial expression dynamics. Image and Vision Computing, 30(10):762 – 773, 2012. 3D Facial Behaviour Analysis and Understanding. [31] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting, 1995. [32] David Bremner, Erik Demaine, Jeff Erickson, John Iacono, Stefan Langer- man, Pat Morin, and Godfried Toussaint. Output-sensitive algorithms for computing nearest-neighbour decision boundaries. In F. Dehne, J. Sack, and M. Smid, editors, Algorithms and Data Structures, volume 2748 of Lecture Notes in Computer Science, pages 451–461. Springer Berlin Hei- delberg, 2003. [33] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 1998. 10