1. Facial Expression Recognition
Based on Hybrid Approach
Md. Abdul Mannan1, Antony Lam1, Yoshinori Kobayashi1, 2, and Yoshinori Kuno1
1Graduate School of Science and Engineering, Saitama University, Japan
2Japan Science and Technology Agency (JST), PRESTO, Kawaguchi, Japan
{mannan, antonylam, kobayashi, kuno}@cv.ics.saitama-u.ac.jp
2. Background
Identify the person's expression from his/her face image
Application
Objective
• Combine the appearance and geometric features using
decision level fusion approach
Human-computer interaction
Social robots
Deceit detection
Behavior monitoring
08/22/2015
2
3. Our Approach
Combine the appearance and geometric features using decision
level fusion approach to identify facial expressions
Two types of fusion
Feature level fusion [1]
Decision level fusion [2]
08/22/2015
3
1. C. G. M. Snoek, M. Worring, and A. W. M. Smeulders , “Early versus late usion in semantic video
analysis,” in 13th Annual ACM International Conference on Multimedia, pp. 399–402, 2005.
2. D. Morrison, R. Wang, and L. C. D. Silva, “Ensemble methods for spoken emotion recognition in call
- centres,” Speech Communication vol. 49, no. 2, pp. 98 – 112, 2007.
4. Challenge of Facial Expression
Recognition
The facial expressions under examination: six basic expressions are anger,
disgust, fear, happiness, sadness, and surprise addition to the neutral one
• Facial expressions are not completely person independent
No perfect feature for race, age and gender independent
• Appearance feature: Gabor wavelet representation, Local Binary Pattern (LBP)
• Geometric feature: Relative position of 58 facial landmark points
08/22/2015
4
5. Workflow of Our Proposed Method
Problem Setting
• Take input image with face
• Face detection and segmentation
• Extract geometric and appearance
features
• Reduce dimensionality of appearance
features
• Use SVM
• Combine the scores of two modalities
using product rule
08/22/2015
5
Input Images
Face Detection
Feature Points
Salient Regions Detection
LDN Images
Appearance FeaturesGeometric Features
PCA
SVMSVM Fusion
Output
6. Appearance Features
08/22/2015
6
The appearance features: Extracted from four visually salient regions using
the Local Directional Number (LDN) descriptor algorithm.
Advantages of LDN
• It compute edge responses in the neighborhood, in eight different directions
with a compass mask.
• It uses the information of the entire neighborhood.
•It is more compact, only six bits long.
Coding Method
(x, y) is the central pixel being coded,
ix,y is the directional number of the maximum positive response,
jx,y is the directional number of the minimum negative response
8. Geometric Features Extraction
08/22/2015
8
Geometric features
-Define as distance between facial components
- Euclidean distances D
Geometric features (feature points and the distance among them).
(x1, x2) and (y1, y2) are the coordinates
of any two feature points.
9. Experiment and Results
08/22/2015
9
Experiments on CK+ Database
Image sequence from the CK+ database showing angry, surprise and happy expressions in the first, second and
third row respectively. In each row the first image is the neutral expression and the last image is the peak
prototypical emotion.
11. Experiment and Results
08/22/2015
11
Experiments on the JAFFE Database
Example of JAFFE database. Each column from the left to right showing neutral, angry, disgust, fear, happy,
sadness and surprise expression.
14. Conclusion
This paper proposed a method for automatic facial expression
recognition.
Our system is capable of detecting a human face in a static image and
extracting features using a hybrid approach
Feature selection
SVM classification with linear kernels
08/22/2015
14
Editor's Notes
As described in reference , a feature - level fusion scheme integrates unimodal features before learning concepts. The two main advantages of this scheme are the use of only one learning stage and taking advantage of mutual information from data.
Decision-level fusion, or fusion of classifiers, consists of processing the classification results of prior classification stages. The main goal of this procedure is to take advantage of the redundancy of a set of independent classifiers to achieve higher robustness by combining their results
the multi-scale LDN images with small s presents small-scale, local, more sensitive, micro patterns of face component structures, which is profitable to describe the local detailed features of the face components. In contrast, LDN patterns with large s shows better noise resistance, and present relatively large-scale, regional and macro patterns of face component structures, but the local details are dropped. So the features from different scales can provide complementary information to each other. The rationale behind the proposed facial components descriptor is to fuse the LDN features at various scales, so as to get more complete face components representation.
we use 46 facial points. The location of the 46 points relative to the face position and size results in a useful 92 dimensional feature vector, generated from both the x- and y- coordinates of each point.
To ensure the features are scale invariant, the distances are normalized to the detected face width and height. Then, the two feature sets (coordinates and distances) are concatenated to produce a vector of length 137.
the CK+ database provides standard emotion labels for 309 sequences of 106 subjects for the six universal emotions. Thus, we choose the CK+ database for our evaluation.
Specifically, the CK+ database is composed of 123 subjects aged from 18 to 50 years, with 69% female, 81% Euro-American, 13% Afro-American, and 6% other groups. Each session is an image sequence that starts from a neutral emotion and gradually ends at a peak prototypical emotion. Some examples of image sequences are shown in Fig. 4. The individual images are grayscale and of size 490×640 or 480×640. Unlike most other papers, in our experiment's setup we do not keep multiple images of the same subject with the same emotion label.
we randomly select the 52 first neutral frames and the last peak frames were picked from the 309 labeled sequences, resulting in 361 images, including 83 surprise, 28 sadness, 69 happy, 25 fear, 59 disgust, 45 anger and 52 neutral. In this way, the dataset for the experiment becomes more challenging. The use of this type of dataset makes the experiment identity independent. That means an image of the same person with the same expression does not appear in both training and test data.
From each emotion label, we randomly select 90% of the data for training while the rest of the data are used for testing, we repeat this protocol for our experiments 1000 times and calculate the average accuracy. Our framework obtained an average emotion recognition percentage of 96.36% using an SVM linear kernel. Table 1 shows the confusion matrix of one instant. The performance of state-of-the-art methods are reported in Table 2, including the performance of our method.
The database consists of 213 images from 10 Japanese females with 3 images for each of the six basic facial expressions including the neutral one.
To evaluate the generalization ability of our proposed system across different database, we performed cross-database validations between the CK+ and JAFFE dataset, where the features selection and classifier training were done on the CK+ database; while the performance was tested on the JAFFE database.
Our system takes less than 1 second (depending on the face size) to recognize a facial expression for each image. This is sufficient for real applications as human facial expressions do not change from one state to another in extremely short periods of time. That is why we apply our method to determine facial expressions on video sequences. In the video sequences, each frame is independently processed and both types of features are extracted from it.
For appearance features in this paper, we used the most powerful encoding scheme, LDN, that takes advantage of the structure of the face's textures and encodes it efficiently into a compact code. LDN uses directional information that is more stable against noise than just pixel intensities, to code the different patterns from the face's textures.
In addition, we also considered a geometric approach which does not need information on a person's specific neutral expression. The state-of-the-art geometry-based approaches entail prior-knowledge of person-specific neutral expressions; however such information is not available in real world scenarios. In contrast to that, we extract geometrical features from just 46 facial points. These features do not rely on person-specific-neutral expressions or any temporal information
We also tested a real time implementation of our system to real data. These included youtube videos, privately collected dementia patient videos and footage from our webcam system. Our system's speed was found to be sufficient for real applications and results are promising. We are especially interested in applying our system to the real time monitoring of dementia patient reactions to stimuli over a long period of time. This would be of great benefit to custom fitting therapies for the patients.