SlideShare a Scribd company logo
1 of 31
Audio-Visual
Emotion Recognition
Supervisor: Dr. Celia Shahnaz
Professor
Bangladesh University of Engineering and Technology
Shamman Noor (1206053)
Ehsan Ahmed Dhrubo (1206100)
Why Emotion Recognition ?
1. Enhancing naturalness in human machine interaction (for
example: on-board car driving system, autonomous call center
services, interactive movie, story telling, E-tutoring, autonomous
psychological therapy etc. )
2. Speech-to-speech translation system
3. Medical disorder diagnosis
4. Indexing and retrieving audio/video files based on emotions
and so on
VIDEO
AUDIO
PLPC MFCC WAVELET PACKET
FRAMES
VCCR HCCR
FRAMESPLPC MFCC
FINAL FEATURE
TRAIN SUBJECTS
CLASSIFICATION
TEST SUBJECTS
DECISION
Proposed Method for
Visual Emotion
Detection
VIDEO
MOUTH REGION EYE REGION
FRAME
RGB TO GRAY CONVERSION
VERTICAL CCR HORIZONTAL CCR
SILENT FRAMES REMOVAL
VERTICAL CCR HORIZONTAL CCR
FINAL FEATURE
TRAIN SUBJECTS
CLASSIFICATION
TEST SUBJECTS
DECISION
Pre-processing of video
frames
1. Each video frame is converted
from RGB scale to gray scale
and median filtered.
Why? RGB scale isn’t needed as only
geometrical information is extracted.
Median filtering removes noisy pixels
from image.
Removing silent frames
1. Short time energy is utilized to
remove frames where no
speech is given by the subject.
Why? Gives only the frames where
speech is provided and emotion is
expressed. Reduces redundant frames
and increases accuracy to a great extent.
Why not in audio? Removes the Vowel
Onset Points (VOP), which carry vital
information about specific emotions.
Silent region 1
Speech region 1
Silent region 2
Silent
region3
Speech
region 2
Silent region 4
Eye and mouth
segmentation by Viola
Jones
1. Viola Jones algorithm is
applied to each frame for
selecting only the mouth and
eye regions of face.
Why? Emotion is primarily expressed by
using different shapes of mouth and
eyes and relative location between eyes
and eyebrows.
Eye Region Extraction Mouth Region
Extraction
Original Gray Scale Image
Extracting Vertical and
Horizontal Cross
Correlation
1. Cross Correlation features,
Vertical Cross Correlation
(between each two columns)
and Horizontal Cross
Correlation (between each two
rows) are extracted from
segmented regions.
Why? Cross Correlation features give
detailed geometrical shape information
with minimal computational cost.
x(n) y(n)
x(n)
y(n)
𝑅 𝑥𝑦(𝑚)
𝑅 𝑥𝑦(𝑚) =
𝑛=0
𝑁 −𝑚 −1
𝑥 𝑚+𝑛 𝑦𝑛
∗
𝑅 𝑥𝑦(𝑚)
Vertical Cross
Correlation
Horizontal Cross
Correlation
Horizontal CCR - Eye
Horizontal CCR - Mouth
Vertical CCR - Eye
Vertical CCR - Mouth
Horizonal Cross
Correlation-Mouth
Left eye is considered for
symmetry
Left Half of mouth region is
considered due to symmetry
Verical Cross Correlation - Eye
Left eye is considered for
symmetricity
Verical Cross
Correlation - Mouth
Left Half of mouth region is
considered due to symmetry
Complete Visual Feature from one video file (234 length):
Horizonal Cross Correlation-Eye
HCCR Eye (57) HCCR Mouth (65) VCCR Eye (44) VCCR Mouth (68)
Proposed Method for
Emotion Detection from
speech
AUDIO
PLPC MFCC WAVELET PACKET
FRAMESPLPC MFCC
PRE EMPHASIS
FRAMING AND WINDOWING
MEAN MEAN
FINAL FEATURE
TRAIN SUBJECTS
CLASSIFICATION
TEST SUBJECTS
DECISION
Proposed Features:
1. Perceptual Linear
Predictive Coefficients
2. Mel Frequency Cepstral
Coefficients
3. Perceptual Linear
Predictive Coefficients
and Mel Frequency
Cepstral Coefficient of
Wavelet Packet
Coefficients
Channel Selection and
Endpoint detection
1. Left, right and mono channels
are selected for feature
extraction.
2. Short time enegy is calculated
for threholding the frames at
starting and ending of audio
signal.
Why? Each channel gives slightly
different accuracies.
Silent regions at starting and ending
lowers the accuracy value.
Pre-Emphasis filtering
1. Audio signal is passed through
a first order high pass filter with
pre-emphasis coefficient
0.9785.
Why? To emphasis information of
formants and remove the impact of
excitation source.
H(Z) V(Z)
𝟏
𝐇(𝐙)
𝐇 𝐳 = 𝟏 − 𝐚𝐳−𝟏
E(z) S(z)
H(z) = Pre Emphasis Filter
E(z) = Excitation Source
V(z) = Vocal Tract Filter
S(s) = Sound Signal
Framing and windowing
1. Audio signal is framed into
25ms windows, with 10ms
overlaps and windowed by
Hamming window.
Why? Within 10-25ms duration, speech
can be considered quasi-stationary.
Hamming windowing prevents Gibbs
phenomena.
25ms
25ms
25ms
Audio :
10ms
…
Block Processing
Perceptual Linear
Predictive Co-efficient
(PLPC) Extraction
1. 12 order (length 13) PLPCs are
extracted applying Bark filter
bank from each 25ms frame.
Why? Bark filter bank or Bark frequency
scale represents how the human ear
perceives frequency ranges, which is
useful for emotion related information
extraction.
𝐵𝑎𝑟𝑘 = 13 tan−1
(
0.76𝑓
1000
) + 3.5 tan−1
(
𝑓2
75002)
Bark Frequency conversion formula:
Mel-Frequency Cepstral
Coefficient (MFCC)
Extraction
1. 26 MFCCs, using 13 filters of
Mel-Frequency band are
extracted from 25ms frames.
Why? Mel scale represents a different
entity for human-ear frequency range
perception mechanism. Both Bark and
Mel scale is utilized here for better
understanding information related to
emotion.
𝑚𝑒𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 2595 log(1 +
7
700.0
)
Mel Frequency conversion formula:
Mel Filter Banks
Temporal Smoothing
1. Temporal smoothing or
averaging filtering of length 3
(taking into account two
previous and two following
frames) is applied to each
frame’s features.
Why? Removes any sudden changes in
features due to noisy speech samples.
xsma(n) =
1
W
i=−(W−1)/2
(W−1)/2
x(n + i)
Smoothed feature vector,
x(n) = feature vector
W = 3, smoothing window
Applying Statistical
Functionals
1. Mean of all frames’ features is
taken.
Why? Effect of all frames and their
temporal evolution is taken into account
as emotions are expressed for long
durations, not just in 25ms frames,
Statistical Functional
(Mean),
Mean =
1
𝑁 𝑛=0
𝑁−1
𝑥(𝑛)
x(n) = feature vector
N = feature length
Wavelet Packet
decomposition
1. 3 level wavelet packet
decomposition is performed
using both Coiflet and
Daubechiesh filters after down-
sampling to 16kHz.
Why? Wavelet Packet decomposition
presents another scale of perceptual
frequency range. Down-sampling
reduces computational cost.
Wavelet Packet tree decomposition
(0, 0)
(1, 0)
(2, 0) (2, 1)
(1, 1)
(3, 0) (3, 1)
Level 0
Level 1
Level 2
Level 3
* Bold faced nodes’ coefficients are used
PLPC and MFCC of
wavelet coefficients
1. PLPC and MFCCs are extracted
from each of four wavelet
coefficients using previous
method.
Why? Combination of three different
perceptual frequency scales provide
information regarding emotions in
greater and finer details. frequency
𝑓𝑛
8
𝑓𝑛
4
𝑓𝑛
2
𝑓𝑛
Level 1Level 2
Level 3
Amplitude
Wavelet Packet filter banks for level 3:
26 Mel-Frequency Cepstral Coefficients
Complete Audio Feature from one video file (195 length):
12 order filter coefficient using
Bark Frequency Scale
13 Mel Filter Banks and Discrete Cosine Transform
of type 2 is used
13 Perceptual Linear
Predictive Coefficients
13 PLPC Feature 26 MFCC Feature 156 WPD PLPC and MFCC
156 wavelet packet PLPC
and MFCC
13 PLPCs and 26 MFCCs from each of
4 wavelet coefficients
Video Feature: length 234
13 PLPC, 26 MFCC, 154 WPD PLPC
and WPD MFCC
57 HCCR (Eye), 65 HCCR (Mouth),
44 VCCR (Eye), 68 VCCR (Mouth)
Complete Audio-Visual Feature from one video file (429 length):
Audio Feature: length 195
195 length audio feature 234 length video feature
Emotion Recognition results from Speech Features
Channel : Left
0
10
20
30
40
50
60
70
80
90
100
Angry Disgust Fear Happiness Sadness Surprise
PLPC + MFCC
PLPC + MFCC + WPC
(coif5)
Emotion Recognition results from Visual Features
0
20
40
60
80
100
120
Angry Disgust Fear Happiness Sadness Surprise
VCCR
HCCR
VCCR + HCCR
Emotion Recognition results from Audio-Visual Features
Audio Channel : Left
0
20
40
60
80
100
120
Angry Disgust Fear Happiness Sadness Surprise
Audio (Wavelet coif5)
Audio (Wavelet db10)
Video
Audio + Video
Summary of Results:
Audio-Visual
Emotion recognition:
96.67%
Combined features of both
audio and images.
Classifier: KNN
Visual Emotion
Recognition:
87.6%
Features: Horizontal and
Vertical Cross Correlation
Classifier: Ensemble KNN
Speech Emotion
Recognition:
69.83%
Features: PLPC, MFCC and
WPD PLPC-MFCC
Classifier: KNN
Comparison With Previous Works
Authors Subject Independency Audio Recognition Rate Visual Recognition Rate
Audio-Visual Recognition
Rate
Datcu et. al.. [1] Yes (3 fold cross validation) 55.9%` 37.70% 56.30%
Paleari et. al. [2] Yes (No Information) 35.00% 25.00% 67.00%
Mansoorizadeh et. al. [3] No (10 fold cross validation) 33.00% 37.00% 71.00%
Gajsek et. al. [4] No (5 fold cross validation) 62.90% 54.70% 71.30%
Wang et. al. [5] No (10 fold cross validation) 38.00% 58.00% 76.00%
Jiang et. al. [6] No (10 fold cross validation) 52.19% 46.78% 66.54%
Haunt et. al. [7] Yes (6 Fold cross Validation) 48.40% 54.85% 61.10%
Zhalehpour et. al. [8] Yes (Leave one subject out) 72.95% 38.22% 76.40%
Our Approach No (20% Hold out validation) 69.83% 86.6% 96.67%
Comparison With Previous Works
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Audio Recognition Rate
Visual Recognition Rate
Audio-Visual Recognition Rate
Future Work
Wavelet Decomposition
Level Increase
Effects of wavelet decomposition level, wavelet
decomposition filters on accuracy
Nasalobial region, Cheek
region and other regions
Visual features from other significant facial regions and
their effects on accuracy
VOP (Vowel onset point)
Speech features
Instead of using entire speech, only vowel onset point
features and their effects on recognition capability
Speech features from
pitch cycle
Instead of block processing, features can be extracted
from pitch cycles of speech.
References
[1] D. Datcu and L. Rothkrantz, “Multimodal recognition of emotions in car environments Multimodal
recognition of emotions in car environments,” DCI&I 2009, 2009
[2] M. Paleari, B. Huet, and S. Antipolis, “Towards Multimodal Emotion Recognition : A New Approach,” pp.
174–181.
[3] M. Mansoorizadeh and N. Moghaddam, “Multimodal information fusion application to human emotion
recognition from face and speech,” pp. 277–297, 2010.
[4] V. Struc, “Multi-modal Emotion Recognition using Canonical Correlations and Acoustic Features ˇ,” no. i,
pp. 4141–4144, 2010.
[5] S. Zhalehpour, Z. Akhtar, and C. E. Erdem, “Multimodal Emotion Recognition with Automatic Peak Frame
Selection,” pp. 0–5, 2014.
[6] K. Huang, H. S. Lin, J. Chan, and Y. Kuo, “LEARNING COLLABORATIVE DECISION-MAKING PARAMETERS
FOR MULTIMODAL EMOTION RECOGNITION.”
[7] D. Jiang, Y. Cui, X. Zhang, and P. Fan, “Audio Visual Emotion Recognition Based on Triple-Stream Dynamic
Bayesian Network Models,” pp. 609–610.
[8] Y. Wang, L. Guan, and A. N. Venetsanopoulos, “Kernel Cross-Modal Factor Analysis for Information Fusion
With Application to Bimodal Emotion Recognition,” vol. 14, no. 3, pp. 597–607, 2012.
Thank You!!

More Related Content

What's hot

Master thesispresentation
Master thesispresentationMaster thesispresentation
Master thesispresentation
Matthew Urffer
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector QuantizationLossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
Omar Ghazi
 

What's hot (20)

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
 
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 AudioNovel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
 
Spectral Analysis of Sample Rate Converter
Spectral Analysis of Sample Rate ConverterSpectral Analysis of Sample Rate Converter
Spectral Analysis of Sample Rate Converter
 
Advance Digital Video Watermarking based on DWT-PCA for Copyright protection
Advance Digital Video Watermarking based on DWT-PCA for Copyright protectionAdvance Digital Video Watermarking based on DWT-PCA for Copyright protection
Advance Digital Video Watermarking based on DWT-PCA for Copyright protection
 
Master thesispresentation
Master thesispresentationMaster thesispresentation
Master thesispresentation
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniques
 
FIR Filter Design using Particle Swarm Optimization with Constriction Factor ...
FIR Filter Design using Particle Swarm Optimization with Constriction Factor ...FIR Filter Design using Particle Swarm Optimization with Constriction Factor ...
FIR Filter Design using Particle Swarm Optimization with Constriction Factor ...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
 
Wavelet Based Image Watermarking
Wavelet Based Image WatermarkingWavelet Based Image Watermarking
Wavelet Based Image Watermarking
 
Fractal Image Compression Using Quadtree Decomposition
Fractal Image Compression Using Quadtree DecompositionFractal Image Compression Using Quadtree Decomposition
Fractal Image Compression Using Quadtree Decomposition
 
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector QuantizationLossy Compression Using Stationary Wavelet Transform and Vector Quantization
Lossy Compression Using Stationary Wavelet Transform and Vector Quantization
 
Generating a time shrunk lecture video by event
Generating a time shrunk lecture video by eventGenerating a time shrunk lecture video by event
Generating a time shrunk lecture video by event
 
High Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image CompressionHigh Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image Compression
 
W4101139143
W4101139143W4101139143
W4101139143
 
Foteini Setaki - Acoustics by Additive Manufacturing
Foteini Setaki - Acoustics by Additive ManufacturingFoteini Setaki - Acoustics by Additive Manufacturing
Foteini Setaki - Acoustics by Additive Manufacturing
 
Ky2418521856
Ky2418521856Ky2418521856
Ky2418521856
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...
 
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
 
H010234144
H010234144H010234144
H010234144
 

Similar to Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet Domain Features

THE TELEVISION SYSTEM IN INDIA
THE TELEVISION SYSTEM IN INDIATHE TELEVISION SYSTEM IN INDIA
THE TELEVISION SYSTEM IN INDIA
Ishank Ranjan
 
Mphil Transfer
Mphil TransferMphil Transfer
Mphil Transfer
spachoud
 
Computed radiography &digital radiography
Computed radiography &digital radiographyComputed radiography &digital radiography
Computed radiography &digital radiography
Rad Tech
 

Similar to Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet Domain Features (20)

Emotion Recognition.pptx
Emotion Recognition.pptxEmotion Recognition.pptx
Emotion Recognition.pptx
 
THE TELEVISION SYSTEM IN INDIA
THE TELEVISION SYSTEM IN INDIATHE TELEVISION SYSTEM IN INDIA
THE TELEVISION SYSTEM IN INDIA
 
N017428692
N017428692N017428692
N017428692
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
 
Mphil Transfer
Mphil TransferMphil Transfer
Mphil Transfer
 
Audio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for DepressionAudio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for Depression
 
Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...
 
ISSCS2011
ISSCS2011ISSCS2011
ISSCS2011
 
Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...
 
Dynamic Audio-Visual Client Recognition modelling
Dynamic Audio-Visual Client Recognition modellingDynamic Audio-Visual Client Recognition modelling
Dynamic Audio-Visual Client Recognition modelling
 
Computed radiography &digital radiography
Computed radiography &digital radiographyComputed radiography &digital radiography
Computed radiography &digital radiography
 
D04812125
D04812125D04812125
D04812125
 
05 comparative study of voice print based acoustic features mfcc and lpcc
05 comparative study of voice print based acoustic features mfcc and lpcc05 comparative study of voice print based acoustic features mfcc and lpcc
05 comparative study of voice print based acoustic features mfcc and lpcc
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home Applications
 
1801 1805
1801 18051801 1805
1801 1805
 
1801 1805
1801 18051801 1805
1801 1805
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet Domain Features

  • 1. Audio-Visual Emotion Recognition Supervisor: Dr. Celia Shahnaz Professor Bangladesh University of Engineering and Technology Shamman Noor (1206053) Ehsan Ahmed Dhrubo (1206100)
  • 2. Why Emotion Recognition ? 1. Enhancing naturalness in human machine interaction (for example: on-board car driving system, autonomous call center services, interactive movie, story telling, E-tutoring, autonomous psychological therapy etc. ) 2. Speech-to-speech translation system 3. Medical disorder diagnosis 4. Indexing and retrieving audio/video files based on emotions and so on
  • 3. VIDEO AUDIO PLPC MFCC WAVELET PACKET FRAMES VCCR HCCR FRAMESPLPC MFCC FINAL FEATURE TRAIN SUBJECTS CLASSIFICATION TEST SUBJECTS DECISION
  • 4. Proposed Method for Visual Emotion Detection VIDEO MOUTH REGION EYE REGION FRAME RGB TO GRAY CONVERSION VERTICAL CCR HORIZONTAL CCR SILENT FRAMES REMOVAL VERTICAL CCR HORIZONTAL CCR FINAL FEATURE TRAIN SUBJECTS CLASSIFICATION TEST SUBJECTS DECISION
  • 5. Pre-processing of video frames 1. Each video frame is converted from RGB scale to gray scale and median filtered. Why? RGB scale isn’t needed as only geometrical information is extracted. Median filtering removes noisy pixels from image.
  • 6. Removing silent frames 1. Short time energy is utilized to remove frames where no speech is given by the subject. Why? Gives only the frames where speech is provided and emotion is expressed. Reduces redundant frames and increases accuracy to a great extent. Why not in audio? Removes the Vowel Onset Points (VOP), which carry vital information about specific emotions. Silent region 1 Speech region 1 Silent region 2 Silent region3 Speech region 2 Silent region 4
  • 7. Eye and mouth segmentation by Viola Jones 1. Viola Jones algorithm is applied to each frame for selecting only the mouth and eye regions of face. Why? Emotion is primarily expressed by using different shapes of mouth and eyes and relative location between eyes and eyebrows. Eye Region Extraction Mouth Region Extraction Original Gray Scale Image
  • 8. Extracting Vertical and Horizontal Cross Correlation 1. Cross Correlation features, Vertical Cross Correlation (between each two columns) and Horizontal Cross Correlation (between each two rows) are extracted from segmented regions. Why? Cross Correlation features give detailed geometrical shape information with minimal computational cost. x(n) y(n) x(n) y(n) 𝑅 𝑥𝑦(𝑚) 𝑅 𝑥𝑦(𝑚) = 𝑛=0 𝑁 −𝑚 −1 𝑥 𝑚+𝑛 𝑦𝑛 ∗ 𝑅 𝑥𝑦(𝑚) Vertical Cross Correlation Horizontal Cross Correlation
  • 9. Horizontal CCR - Eye Horizontal CCR - Mouth Vertical CCR - Eye Vertical CCR - Mouth
  • 10. Horizonal Cross Correlation-Mouth Left eye is considered for symmetry Left Half of mouth region is considered due to symmetry Verical Cross Correlation - Eye Left eye is considered for symmetricity Verical Cross Correlation - Mouth Left Half of mouth region is considered due to symmetry Complete Visual Feature from one video file (234 length): Horizonal Cross Correlation-Eye HCCR Eye (57) HCCR Mouth (65) VCCR Eye (44) VCCR Mouth (68)
  • 11. Proposed Method for Emotion Detection from speech AUDIO PLPC MFCC WAVELET PACKET FRAMESPLPC MFCC PRE EMPHASIS FRAMING AND WINDOWING MEAN MEAN FINAL FEATURE TRAIN SUBJECTS CLASSIFICATION TEST SUBJECTS DECISION Proposed Features: 1. Perceptual Linear Predictive Coefficients 2. Mel Frequency Cepstral Coefficients 3. Perceptual Linear Predictive Coefficients and Mel Frequency Cepstral Coefficient of Wavelet Packet Coefficients
  • 12. Channel Selection and Endpoint detection 1. Left, right and mono channels are selected for feature extraction. 2. Short time enegy is calculated for threholding the frames at starting and ending of audio signal. Why? Each channel gives slightly different accuracies. Silent regions at starting and ending lowers the accuracy value.
  • 13. Pre-Emphasis filtering 1. Audio signal is passed through a first order high pass filter with pre-emphasis coefficient 0.9785. Why? To emphasis information of formants and remove the impact of excitation source. H(Z) V(Z) 𝟏 𝐇(𝐙) 𝐇 𝐳 = 𝟏 − 𝐚𝐳−𝟏 E(z) S(z) H(z) = Pre Emphasis Filter E(z) = Excitation Source V(z) = Vocal Tract Filter S(s) = Sound Signal
  • 14. Framing and windowing 1. Audio signal is framed into 25ms windows, with 10ms overlaps and windowed by Hamming window. Why? Within 10-25ms duration, speech can be considered quasi-stationary. Hamming windowing prevents Gibbs phenomena. 25ms 25ms 25ms Audio : 10ms … Block Processing
  • 15. Perceptual Linear Predictive Co-efficient (PLPC) Extraction 1. 12 order (length 13) PLPCs are extracted applying Bark filter bank from each 25ms frame. Why? Bark filter bank or Bark frequency scale represents how the human ear perceives frequency ranges, which is useful for emotion related information extraction. 𝐵𝑎𝑟𝑘 = 13 tan−1 ( 0.76𝑓 1000 ) + 3.5 tan−1 ( 𝑓2 75002) Bark Frequency conversion formula:
  • 16. Mel-Frequency Cepstral Coefficient (MFCC) Extraction 1. 26 MFCCs, using 13 filters of Mel-Frequency band are extracted from 25ms frames. Why? Mel scale represents a different entity for human-ear frequency range perception mechanism. Both Bark and Mel scale is utilized here for better understanding information related to emotion. 𝑚𝑒𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 2595 log(1 + 7 700.0 ) Mel Frequency conversion formula: Mel Filter Banks
  • 17. Temporal Smoothing 1. Temporal smoothing or averaging filtering of length 3 (taking into account two previous and two following frames) is applied to each frame’s features. Why? Removes any sudden changes in features due to noisy speech samples. xsma(n) = 1 W i=−(W−1)/2 (W−1)/2 x(n + i) Smoothed feature vector, x(n) = feature vector W = 3, smoothing window
  • 18. Applying Statistical Functionals 1. Mean of all frames’ features is taken. Why? Effect of all frames and their temporal evolution is taken into account as emotions are expressed for long durations, not just in 25ms frames, Statistical Functional (Mean), Mean = 1 𝑁 𝑛=0 𝑁−1 𝑥(𝑛) x(n) = feature vector N = feature length
  • 19. Wavelet Packet decomposition 1. 3 level wavelet packet decomposition is performed using both Coiflet and Daubechiesh filters after down- sampling to 16kHz. Why? Wavelet Packet decomposition presents another scale of perceptual frequency range. Down-sampling reduces computational cost. Wavelet Packet tree decomposition (0, 0) (1, 0) (2, 0) (2, 1) (1, 1) (3, 0) (3, 1) Level 0 Level 1 Level 2 Level 3 * Bold faced nodes’ coefficients are used
  • 20. PLPC and MFCC of wavelet coefficients 1. PLPC and MFCCs are extracted from each of four wavelet coefficients using previous method. Why? Combination of three different perceptual frequency scales provide information regarding emotions in greater and finer details. frequency 𝑓𝑛 8 𝑓𝑛 4 𝑓𝑛 2 𝑓𝑛 Level 1Level 2 Level 3 Amplitude Wavelet Packet filter banks for level 3:
  • 21. 26 Mel-Frequency Cepstral Coefficients Complete Audio Feature from one video file (195 length): 12 order filter coefficient using Bark Frequency Scale 13 Mel Filter Banks and Discrete Cosine Transform of type 2 is used 13 Perceptual Linear Predictive Coefficients 13 PLPC Feature 26 MFCC Feature 156 WPD PLPC and MFCC 156 wavelet packet PLPC and MFCC 13 PLPCs and 26 MFCCs from each of 4 wavelet coefficients
  • 22. Video Feature: length 234 13 PLPC, 26 MFCC, 154 WPD PLPC and WPD MFCC 57 HCCR (Eye), 65 HCCR (Mouth), 44 VCCR (Eye), 68 VCCR (Mouth) Complete Audio-Visual Feature from one video file (429 length): Audio Feature: length 195 195 length audio feature 234 length video feature
  • 23. Emotion Recognition results from Speech Features Channel : Left 0 10 20 30 40 50 60 70 80 90 100 Angry Disgust Fear Happiness Sadness Surprise PLPC + MFCC PLPC + MFCC + WPC (coif5)
  • 24. Emotion Recognition results from Visual Features 0 20 40 60 80 100 120 Angry Disgust Fear Happiness Sadness Surprise VCCR HCCR VCCR + HCCR
  • 25. Emotion Recognition results from Audio-Visual Features Audio Channel : Left 0 20 40 60 80 100 120 Angry Disgust Fear Happiness Sadness Surprise Audio (Wavelet coif5) Audio (Wavelet db10) Video Audio + Video
  • 26. Summary of Results: Audio-Visual Emotion recognition: 96.67% Combined features of both audio and images. Classifier: KNN Visual Emotion Recognition: 87.6% Features: Horizontal and Vertical Cross Correlation Classifier: Ensemble KNN Speech Emotion Recognition: 69.83% Features: PLPC, MFCC and WPD PLPC-MFCC Classifier: KNN
  • 27. Comparison With Previous Works Authors Subject Independency Audio Recognition Rate Visual Recognition Rate Audio-Visual Recognition Rate Datcu et. al.. [1] Yes (3 fold cross validation) 55.9%` 37.70% 56.30% Paleari et. al. [2] Yes (No Information) 35.00% 25.00% 67.00% Mansoorizadeh et. al. [3] No (10 fold cross validation) 33.00% 37.00% 71.00% Gajsek et. al. [4] No (5 fold cross validation) 62.90% 54.70% 71.30% Wang et. al. [5] No (10 fold cross validation) 38.00% 58.00% 76.00% Jiang et. al. [6] No (10 fold cross validation) 52.19% 46.78% 66.54% Haunt et. al. [7] Yes (6 Fold cross Validation) 48.40% 54.85% 61.10% Zhalehpour et. al. [8] Yes (Leave one subject out) 72.95% 38.22% 76.40% Our Approach No (20% Hold out validation) 69.83% 86.6% 96.67%
  • 28. Comparison With Previous Works 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Audio Recognition Rate Visual Recognition Rate Audio-Visual Recognition Rate
  • 29. Future Work Wavelet Decomposition Level Increase Effects of wavelet decomposition level, wavelet decomposition filters on accuracy Nasalobial region, Cheek region and other regions Visual features from other significant facial regions and their effects on accuracy VOP (Vowel onset point) Speech features Instead of using entire speech, only vowel onset point features and their effects on recognition capability Speech features from pitch cycle Instead of block processing, features can be extracted from pitch cycles of speech.
  • 30. References [1] D. Datcu and L. Rothkrantz, “Multimodal recognition of emotions in car environments Multimodal recognition of emotions in car environments,” DCI&I 2009, 2009 [2] M. Paleari, B. Huet, and S. Antipolis, “Towards Multimodal Emotion Recognition : A New Approach,” pp. 174–181. [3] M. Mansoorizadeh and N. Moghaddam, “Multimodal information fusion application to human emotion recognition from face and speech,” pp. 277–297, 2010. [4] V. Struc, “Multi-modal Emotion Recognition using Canonical Correlations and Acoustic Features ˇ,” no. i, pp. 4141–4144, 2010. [5] S. Zhalehpour, Z. Akhtar, and C. E. Erdem, “Multimodal Emotion Recognition with Automatic Peak Frame Selection,” pp. 0–5, 2014. [6] K. Huang, H. S. Lin, J. Chan, and Y. Kuo, “LEARNING COLLABORATIVE DECISION-MAKING PARAMETERS FOR MULTIMODAL EMOTION RECOGNITION.” [7] D. Jiang, Y. Cui, X. Zhang, and P. Fan, “Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models,” pp. 609–610. [8] Y. Wang, L. Guan, and A. N. Venetsanopoulos, “Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition,” vol. 14, no. 3, pp. 597–607, 2012.