This research aims to identify six types of emotions (anger, disgust, fear, happiness, sadness, and surprise)
in humans to enhance naturalness in human-machine interaction, pain monitoring in patients, and
detection and treatment of anxiety and depression. 1166 video sequences of 42 subjects from the publicly
available eNTERFACE’05 database have been used for audio-visual emotion recognition. The core part
of this research includes segmentation, face detection, and histogram features from different color
domains (RGB, HSV, and YCbCr) extraction from video frames as well as segmentation, PLPC, MFCC
feature extraction from audio. After feature matrix generation, five-folds cross-validation has been done with cubic SVM classifier and KNN classifier for emotion recognition.
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
Emotion Recognition.pptx
1. Supervisor:
Dr. Celia Shahnaz
Professor, Department of EEE
Bangladesh University of Engineering and Technology
Presented By:
Rafat Jamal Tazim (1406068)
Faria Armin (1406151)
2. • Enhancing naturalness in human machine interaction
• Speech to speech translation system
• Pain monitoring in patients
• Detection and treatment of depression and anxiety
2
3. The eNTERFACE’05 Audio-Visual Emotion Database[1]
• Public database
• 6 emotions: Anger, Disgust, Fear, Happiness, Sadness, Surprise
• 42 subjects (81% men, 19% women)
• 1166 video sequences
3
5. FINAL FEATURE
TEST SUBJECTS
TRAIN SUBJECTS
CLASSIFICATION
DECISION
VIDEO
SILENT FRAME REMOVAL
PREPROCESSING
FACE DETECTION
RGB HISTOGRAM HSV HISTOGRAM YCbCr HISTOGRAM
HOS IQA PARAMETERS
5
6. • Doesn’t contain any useful
information
• Removed using Short Term
Energy (STE) thresholding of
audio
• Decreases redundancy and
computational complexity
• Increases accuracy and
efficiency Silent
Region
Speech
Region
Silent
Region
Speech
Region
Silent
Region
Speech
Region
Silent
Region
6
7. • 3D median filter suppresses
noise
• Unsharp masking sharpens
image
• No need for gray-scale
conversion
• Different color spaces are used
for feature extraction
Filtering
Sharpening
7
8. • Viola-Jones algorithm is
applied to each frame
• Face region (ROI) is segmented
• Minimizes background effect
• Segmented image is resized to
𝟏𝟎𝟎 × 𝟏𝟎𝟎 × 𝟑 𝐩𝐢𝐱𝐞𝐥𝐬
ROI
Extraction
Resizing
8
9. Why Entire Face Region is Considered?
• Concentration level of hemoglobin and oxygenation under the
skin vary due to changes in person’s emotional and physical
state
• Subtle changes in a hue and saturation components of skin
color have been observed[2]
• Entire face is taken instead of only eye and mouth region
• Increase in accuracy compensates for increasing complexity
9
10. Higher Order Statistical (HOS) Features
• Segmented images are converted from RGB plane to HSV, Lab,
Luv, YCbCr and NTSc color planes
• Three HOS (Kurtosis, Skewness, Variance) from each of the
channels of different color planes are taken
Why?
• HOS gives far less number of,
- Relevant
- Non-redundant
- Distinguishable features
in comparison to typical statistics like mean, standard deviation
10
11. RGB Histogram Features
• Each channel quantized to 8 levels
• 𝟓𝟏𝟐 RGB histogram features are obtained
HSV Histogram Features
• Hue channel is quantized to 16 levels
• Saturation and Value each channel is quantized to 8 levels
• 1024 HSV histogram features are obtained
YCbCr Histogram Features
• Each channel is quantized to 8 levels
• 𝟓𝟏𝟐 YCbCr histogram features are obtained
11
12. Image Quality Analysis Parameter Extraction
• Smoothed version of input image is used as reference image
• Filter with a Gaussian kernel 0.5 to generates smoothed
version of input image
• The quality between the two images are calculated by the
following parameters
- Structural content
- Mean square error
- Peak signal to noise ratio
- Normalized cross correlation
- Average difference
- Maximum distance
- Normalized absolute error 12
13. 15 512 1024 512 7
HOS Features
HSV Histogram
Features
IQA
Parameters
RGB Histogram
Features
YCbCr Histogram
Features
13
14. AUDIO
SILENCE REMOVAL
PRE-EMPHASIS
FRAMING & WINDOWING
PLPC MFCC WPDC
TEMPORAL SMOOTHING
HOS
TEAGER OPERATOR
PLPC MFCC
FINAL FEATURE
TEST SUBJECTS
TRAIN SUBJECTS
CLASSIFICATION DECISION 14
15. • Each channel provides slightly
different values
• Can vary accuracy
• Left, Right and Mono (mean of
two channels) are taken
15
16. • First order high pass filter
with pre-emphasis coefficient,
𝒂 = −𝟎. 𝟗𝟕𝟖𝟓
• Balances frequency spectrum
• Improves SNR
𝐻 𝑧 = 1 + 𝑎𝑧−1
16
17. • Down sampled to 16 kHz from 48 kHz
• Audio is segmented into,
- 25ms frames (400 samples)
- 10ms overlapping (160 samples)
• Within 25ms-10ms signal is quasi-
stationary
• Each frame is multiplied by Hamming
window
• Windowing prevents Gibbs phenomena
Hamming Window
Down Sampling
17
18. Perceptual Linear Predictive
Co-efficient (PLPC)
• Represents the way human
ear perceives frequency
ranges
• Useful for emotion related
information extraction
• 12 order (length 13) PLPCs
are extracted applying Bark
filter bank
Bark Frequency Conversion Formula:
𝐵𝑎𝑟𝑘 = 13 tan−1
0.76𝑓
1000
+ 3.5 tan−1(
𝑓2
75002
)
18
19. Mel-Frequency Cepstral Co-
efficient (MFCC)
• Mimics the non-linear
human ear perception of
sound
• More discriminative at lower
frequencies and less
discriminative at higher
frequencies
• 13 MFCCs, using 13 filters
of Mel-frequency band are
extracted
Mel Frequency Conversion Formula:
𝑀 = 2595 log(1 +
𝑓
700
)
19
20. Temporal Smoothing
• Temporal smoothing of
length 3 (taking into
account two previous and
two following frames) is
applied to each frames
• Removes any sudden
changes in features due
to noisy speech samples
Smoothed Feature Vector:
𝑥𝑠𝑚𝑎(𝑛) =
1
𝑊
𝑖=−
𝑊−1
2
𝑖=
𝑊−1
2
𝑥(𝑛 + 1)
x(n) = Feature vector
W = 3, Smoothing window
20
21. Higher Order Statistics (HOS) Features
• Three HOS (Kurtosis, Skewness and Variance) is taken
• HOS of MFFCs and PLPCs of each frame is taken
21
22. Wavelet Packet Decomposition
• Presents another scale of
perceptual frequency range
• Three level wavelet packet
decomposition is performed
• Both Coiflet and Daubechiesh
filters are used
Wavelet Packet Decomposition Tree
(0,0)
(1,0) (1,1)
(2,0) (2,1)
(3,0) (3,1)
Level 0
Level 1
Level 2
Level 3
*Bold faced nodes’ coefficients are used
22
23. Teager Energy Operator
• Non-linear time domain
operator
• Removes any sudden
changes in features due
to noisy speech samples
Energy Operator:
ψ 𝑥 𝑛 = 𝑥2
𝑛 − 𝑥 𝑛 − 1 𝑥 𝑛 + 1
23
24. PLPC and MFCC of WPD Coefficients
• PLPCs and MFCCs are extracted from each of four wavelet
coefficients
• Combination of three perceptual frequency scale provides
information regarding emotions in grater scales and finer
details
24
33. [1] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05
Audio-Visual emotion database,” in ICDEW 2006 - Proceedings
of the 22nd International Conference on Data Engineering
Workshops, 2006
[2] G. A. Ramirez, O. Fuentes, S. L. Crites, M. Jimenez, and J.
Ordonez, “Color analysis of facial skin: Detection of emotional
state,” in IEEE Computer Society Conference on Computer
Vision and Pattern Recognition Workshops, 2014.
33