Emotion Recognition.pptx

Supervisor:
Dr. Celia Shahnaz
Professor, Department of EEE
Bangladesh University of Engineering and Technology
Presented By:
Rafat Jamal Tazim (1406068)
Faria Armin (1406151)

• Enhancing naturalness in human machine interaction
• Speech to speech translation system
• Pain monitoring in patients
• Detection and treatment of depression and anxiety
2

The eNTERFACE’05 Audio-Visual Emotion Database[1]
• Public database
• 6 emotions: Anger, Disgust, Fear, Happiness, Sadness, Surprise
• 42 subjects (81% men, 19% women)
• 1166 video sequences
3

VIDEO
RGB
HISTOGRAM
HSV
HISTOGRAM
YCbCr
HISTOGRAM
HOS
IQA
PARAMETERS
FRAME AUDIO
PLPC MFCC
WPDC
PLPC MFCC
FINAL FEATURE
TRAIN SUBJECTS TEST SUBJECTS
CLASSIFICATION
DECISION 4

FINAL FEATURE
TEST SUBJECTS
TRAIN SUBJECTS
CLASSIFICATION
DECISION
VIDEO
SILENT FRAME REMOVAL
PREPROCESSING
FACE DETECTION
RGB HISTOGRAM HSV HISTOGRAM YCbCr HISTOGRAM
HOS IQA PARAMETERS
5

• Doesn’t contain any useful
information
• Removed using Short Term
Energy (STE) thresholding of
audio
• Decreases redundancy and
computational complexity
• Increases accuracy and
efficiency Silent
Region
Speech
Region
Silent
Region
Speech
Region
Silent
Region
Speech
Region
Silent
Region
6

• 3D median filter suppresses
noise
• Unsharp masking sharpens
image
• No need for gray-scale
conversion
• Different color spaces are used
for feature extraction
Filtering
Sharpening
7

• Viola-Jones algorithm is
applied to each frame
• Face region (ROI) is segmented
• Minimizes background effect
• Segmented image is resized to
𝟏𝟎𝟎 × 𝟏𝟎𝟎 × 𝟑 𝐩𝐢𝐱𝐞𝐥𝐬
ROI
Extraction
Resizing
8

Why Entire Face Region is Considered?
• Concentration level of hemoglobin and oxygenation under the
skin vary due to changes in person’s emotional and physical
state
• Subtle changes in a hue and saturation components of skin
color have been observed[2]
• Entire face is taken instead of only eye and mouth region
• Increase in accuracy compensates for increasing complexity
9

Higher Order Statistical (HOS) Features
• Segmented images are converted from RGB plane to HSV, Lab,
Luv, YCbCr and NTSc color planes
• Three HOS (Kurtosis, Skewness, Variance) from each of the
channels of different color planes are taken
Why?
• HOS gives far less number of,
- Relevant
- Non-redundant
- Distinguishable features
in comparison to typical statistics like mean, standard deviation
10

RGB Histogram Features
• Each channel quantized to 8 levels
• 𝟓𝟏𝟐 RGB histogram features are obtained
HSV Histogram Features
• Hue channel is quantized to 16 levels
• Saturation and Value each channel is quantized to 8 levels
• 1024 HSV histogram features are obtained
YCbCr Histogram Features
• Each channel is quantized to 8 levels
• 𝟓𝟏𝟐 YCbCr histogram features are obtained
11

Image Quality Analysis Parameter Extraction
• Smoothed version of input image is used as reference image
• Filter with a Gaussian kernel 0.5 to generates smoothed
version of input image
• The quality between the two images are calculated by the
following parameters
- Structural content
- Mean square error
- Peak signal to noise ratio
- Normalized cross correlation
- Average difference
- Maximum distance
- Normalized absolute error 12

15 512 1024 512 7
HOS Features
HSV Histogram
Features
IQA
Parameters
RGB Histogram
Features
YCbCr Histogram
Features
13

AUDIO
SILENCE REMOVAL
PRE-EMPHASIS
FRAMING & WINDOWING
PLPC MFCC WPDC
TEMPORAL SMOOTHING
HOS
TEAGER OPERATOR
PLPC MFCC
FINAL FEATURE
TEST SUBJECTS
TRAIN SUBJECTS
CLASSIFICATION DECISION 14

• Each channel provides slightly
different values
• Can vary accuracy
• Left, Right and Mono (mean of
two channels) are taken
15

• First order high pass filter
with pre-emphasis coefficient,
𝒂 = −𝟎. 𝟗𝟕𝟖𝟓
• Balances frequency spectrum
• Improves SNR
𝐻 𝑧 = 1 + 𝑎𝑧−1
16

• Down sampled to 16 kHz from 48 kHz
• Audio is segmented into,
- 25ms frames (400 samples)
- 10ms overlapping (160 samples)
• Within 25ms-10ms signal is quasi-
stationary
• Each frame is multiplied by Hamming
window
• Windowing prevents Gibbs phenomena
Hamming Window
Down Sampling
17

Perceptual Linear Predictive
Co-efficient (PLPC)
• Represents the way human
ear perceives frequency
ranges
• Useful for emotion related
information extraction
• 12 order (length 13) PLPCs
are extracted applying Bark
filter bank
Bark Frequency Conversion Formula:
𝐵𝑎𝑟𝑘 = 13 tan−1
0.76𝑓
1000
+ 3.5 tan−1(
𝑓2
75002
)
18

Mel-Frequency Cepstral Co-
efficient (MFCC)
• Mimics the non-linear
human ear perception of
sound
• More discriminative at lower
frequencies and less
discriminative at higher
frequencies
• 13 MFCCs, using 13 filters
of Mel-frequency band are
extracted
Mel Frequency Conversion Formula:
𝑀 = 2595 log(1 +
𝑓
700
)
19

Temporal Smoothing
• Temporal smoothing of
length 3 (taking into
account two previous and
two following frames) is
applied to each frames
• Removes any sudden
changes in features due
to noisy speech samples
Smoothed Feature Vector:
𝑥𝑠𝑚𝑎(𝑛) =
1
𝑊
𝑖=−
𝑊−1
2
𝑖=
𝑊−1
2
𝑥(𝑛 + 1)
x(n) = Feature vector
W = 3, Smoothing window
20

Higher Order Statistics (HOS) Features
• Three HOS (Kurtosis, Skewness and Variance) is taken
• HOS of MFFCs and PLPCs of each frame is taken
21

Wavelet Packet Decomposition
• Presents another scale of
perceptual frequency range
• Three level wavelet packet
decomposition is performed
• Both Coiflet and Daubechiesh
filters are used
Wavelet Packet Decomposition Tree
(0,0)
(1,0) (1,1)
(2,0) (2,1)
(3,0) (3,1)
Level 0
Level 1
Level 2
Level 3
*Bold faced nodes’ coefficients are used
22

Teager Energy Operator
• Non-linear time domain
operator
• Removes any sudden
changes in features due
to noisy speech samples
Energy Operator:
ψ 𝑥 𝑛 = 𝑥2
𝑛 − 𝑥 𝑛 − 1 𝑥 𝑛 + 1
23

PLPC and MFCC of WPD Coefficients
• PLPCs and MFCCs are extracted from each of four wavelet
coefficients
• Combination of three perceptual frequency scale provides
information regarding emotions in grater scales and finer
details
24

MFCC
PLPC Wavelet Packet
MFCC and PLPC
3 3 208
25

Visual Features
Audio Features
214
2070
26

0%
20%
40%
60%
80%
100%
120%
Anger Disgust Fear Happiness Sadness Surprise
Cubic SVM Classifier
Audio Visual Audio-Visual
27

75%
80%
85%
90%
95%
100%
Anger Disgust Fear Happiness Sadness Surprise
Fine KNN Classifier
Audio Visual Audio-Visual
28

Speech Emotion
Recognition:
84.8%
Visual Emotion
Recognition:
96.3%
Audio-Visual
Emotion
Recognition:
97.1%
Classifier: Cubic SVM
Kernel Function: Cubic
Validation: 5 fold cross validation
29

Speech Emotion
Recognition:
87.3%
Visual Emotion
Recognition:
91.9%
Audio-Visual
Emotion
Recognition:
93.8%
Classifier: Fine KNN
No. of Neighbors: 1
Distance Metrics: Euclidean
30

Author
Audio
Recognition Rate
Visual
Recognition Rate
Audio-Visual
Recognition Rate
Datcu et al 55.9% 37.70% 56.30%
Paleari et al. 35.00% 25.00% 67.00%
Mansoorizadeh et al. 33.00% 37.00% 71.00%
Gajsek et al. 62.90% 54.70% 71.30%
Wang et al. 38.00% 58.00% 76.00%
Jiang et al. 52.19% 46.78% 66.54%
Huant et al. 48.40% 54.85% 61.10%
Zhalehpour et al. 72.95% 38.22% 76.40%
Proposed Method 84.8% 96.3% 97.1%
31

0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
Audio
Video
Audio-Video
32

[1] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05
Audio-Visual emotion database,” in ICDEW 2006 - Proceedings
of the 22nd International Conference on Data Engineering
Workshops, 2006
[2] G. A. Ramirez, O. Fuentes, S. L. Crites, M. Jimenez, and J.
Ordonez, “Color analysis of facial skin: Detection of emotional
state,” in IEEE Computer Society Conference on Computer
Vision and Pattern Recognition Workshops, 2014.
33

Emotion Recognition.pptx

Recommended

Recommended

More Related Content

Similar to Emotion Recognition.pptx

Similar to Emotion Recognition.pptx (20)

Recently uploaded

Recently uploaded (20)

Emotion Recognition.pptx