Multimodal Emotion Recognition in Conversational
Settings
Nastaran Saffaryazdi
Empathic Computing Lab
Motivation
● Emotions are multimodal processes that play a crucial role in our everyday lives
● Recognizing emotions is becoming more essential in a wide range of application
domains
○ Health-care
○ Education
○ Entertainment
○ Advertisement
○ Customer services
Motivation
● Many of application areas include human-human or human-machine conversations
● The focus of conversational emotion recognition is on facial expressions and text
● Human behavior can be controlled or faked
● Designing a general model using behavioral changes is difficult because of cultural or
language differences
● Physiological data are reliable but are very weak
Research Focus
How can we combine various behavioral and physiological
cues to recognize emotions in human conversations and
enhance empathy in human-machine interactions?
5
Research Questions
How can human body responses be employed to
identify emotions?
How can the data be obtained, captured, and
processed simultaneously from multiple sensors?
Can a combination of physiological cues be used
to recognize emotions in conversations accurately?
Can we increase the level of empathy between
humans and machines using neural and physiological
signals?
6
RQ
1
RQ
2
RQ
3
RQ
4
What are the human body's responses to emotional stimuli, and how can these
diverse responses be employed to identify emotions?
Reviewing and replicating existing research in human emotion recognition using
various modalities
8
RQ1
● Behavioral datasets
○ CMU_MOSEI
○ SEMANE
○ IEMOCAP
○ …
● Multimodal datasets with neural or physiological signals
○ DEAP
○ MAHNOB-HCI
○ RECOLA
○ SEED-IV
○ AMIGOS
No research has specifically studied brain activity and physiological signals
for recognizing emotion in human-human and human-machine
conversations.
Multimodal Emotion Recognition
11
Multimodal Emotion Recognition
12
23
Participants
21 to 44 years old
𝞵 = 30, 𝝈 = 6
13 females
10 males
Watching
video
Study 1: Multimodal Emotion
Recognition in watching video
Saffaryazdi, N., Wasim, S. T., Dileep, K., Nia, A. F., Nanayakkara, S., Broadbent, E., & Billinghurst, M. (2022). Using facial micro-expressions
in combination with EEG and physiological signals for emotion recognition. Frontiers in Psychology, 13, 864047.
● OpenBCI EEG cap
○ EEG
● Shimmer3 GSR+ module
○ EDA
○ PPG
● Realsense camera
○ Facial video
Sensors
13
13
14
15
16
17
All
All
All
All
All
18
● Identified various modalities and recognition methods
● Fusing facial micro-expressions with EEG and physiological signals
● Identified the limitation and challenges
○ Acquiring data from multiple sensors simultaneously
○ Real-time data monitoring
○ Automatic scenario running
○ Personality differences
19
Summary
RQ1
How can the required data for emotion recognition be obtained, captured, and
processed simultaneously in conversation from multiple sensors?
Developing software for simultaneously acquiring, visualizing, and processing
multimodal data.
20
RQ2
● Octopus-sensing
○ Simple unified interface for
● Simultaneous data acquisition
● Simultaneous data Recording
○ Study design components
● Octopus-sensing-monitoring
○ Real-time monitoring
● Octopus-sensing-visualizer
○ Offline synchronous data visualizer
● Octopus-sensing-processing
○ Real-time processing
21
Octopus Sensing
● Multiplatform
● Open-source (https://github.com/octopus-sensing)
● https://octopus-sensing.nastaran-saffar.me/
● Support various sensors
a. OpenBCI
b. Brainflow
c. Shimmer3
d. Camera
e. Audio
f. Network (Unity and matlab)
Saffaryazdi, N., Gharibnavaz, A., & Billinghurst, M. (2022). Octopus Sensing: A Python library for human behavior studies. Journal of
Open Source Software, 7(71), 4045.
Octopus Sensing
22
23
How to use
24
Octopus-sensing-Monitoring
● Monitor data from any machine in the same network.
● monitoring = MonitoringEndpoint(device_coordinator)
monitoring.start()
● pipenv run octopus-sensing-monitoring
● http://machine-IP:8080
25
Octopus-sensing-visualizer
● pipenv run octopus-sensing-visualizer
● http://localhost:8080
● Visualizing Raw or processed data using a config file
26
Octopus-sensing-processing
27
Can a combination of physiological cues be used to recognize emotions in
conversations accurately?
Conducting user studies and creating datasets of multimodal data in
various settings to compare human responses and explore recognition
methods in different settings.
28
RQ3
29
10 minutes conversation for
each emotion topic
Self-report
(Beginning, Middle, End)
Arousal, Valence, Emotion
Creating a conversational settings that people
could feel emotions spontaneously
Multimodal Emotion Recognition in Conversation
30
23
Participants
21 to 44 years old
𝞵 = 30, 𝝈 = 6
13 females
10 males
Face-to-face
Conversation
Multimodal Emotion Recognition in conversation
Study 2: Multimodal Emotion
Recognition in Conversation
Saffaryazdi, N., Goonesekera, Y., Saffaryazdi, N., Hailemariam, N. D., Temesgen, E. G., Nanayakkara, S., ... & Billinghurst, M. (2022, March).
Emotion recognition in conversations using brain and physiological signals. In 27th International Conference on Intelligent User Interfaces (pp.
229-242).
Study 2 - Result
Self-report vs
target emotion
31
Study 2 - Result
Recognition
FScore
33
Comparing Various Conversational Settings
15
Participants
21 to 36 years old
𝞵 = 28.6, 𝝈 = 5.7
7 females
8 males
Face-to-face
And Zoom
Conversation
Study 3:
Face-to-Face vs Remote conversations
35
Saffaryazdi, N., Kirkcaldy, N., Lee, G., Loveys, K., Broadbent, E., & Billinghurst, M. (2024). Exploring the impact of
computer-mediated emotional interactions on human facial and physiological responses. Telematics and Informatics
Reports, 14, 100131.
Study 3: Features
36
Data Features
EDA phasic statistics, tonic, and peaks statistics
PPG Heart rate variability time domain features which are features extracted by
hertpy and neurokit2 library
Face OpenFace action units including these numbers:
1, 2,4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 20, 23, 25, 26,28, and 45
Study 3: Result
3-way Repeated measure ART ANOVA
● Physiological differences between face-to-face and remote conversations
○ Facial Action Units:
■ Action units associated with negative emotions were higher in Face-to-face
■ Action units associated with positive emotions Higher in Remote
○ EDA (tonic, phasic and peaks statistics)
■ Reaction were substantial and immediate in f2f (Phasic mean higher in face-to-face)
○ PPG (HRV time domain features)
■ Higher HRV in remote conversation -> lower stress level, enhanced emotion
regulation, more engagement
37
Study 3: Result
● One-way and two-way Repeated-measure ART ANOVA
● Significant Empathy factors
○ Interviewer to participant
38
Study 3: Result
● Emotion Recognition
○ Feature Extraction
○ Random Forest Classifier
○ Leave-One-Subject-Out
● The high accuracy -> the high similarity
39
Findings
● People showed more positive facial expressions in remote conversations
● People felt stronger emotions and more immediate ones in f2f condition
● People felt lower level of stress in the remote condition
● Limitations
○ Sample size
○ Effect of interviewer
○ Familiarity with remote conversations
● Cross-usage of multimodal dataset is quite successful
● Physiological emotion recognition are effective in conversational emotion recognition
● We can use these datasets to train models for real-time emotions recognition
RQ3
RQ4
Can we increase the level of empathy between humans and machines by using
neural and physiological signals to detect emotions in real-time during
conversations?
● Developing a real-time emotion recognition system using multimodal data.
● Prototyping an empathetic conversational agent by feeding the detected
emotions in real-time.
41
Human - computer conversation
23
Participants
21 to 44 years old
𝞵 = 30, 𝝈 = 6
17 females
6 males
Human-Digital Human
Conversation
Study 4: Interaction with a Digital
human from Soul Machines
42
● Study 4
● Interaction between human and Digital Human
■ Neutral
■ Empathetic
● Real-time emotion recognition based on physiological cues
● Evaluating interaction with digital human
■ Empathy factors
■ Quality of Interaction
■ Degree of Raport
■ Degree of Liking
■ Degree of Social Presence
Human-Machine Conversation
43
● Emotion Recognition in Real-time
○ Arousal % 69.1 and Valence %57.3
● Induction method evaluation
Study 4 - Result
45
● appropriateness of reflected emotions in different agents
○ Appropriate emotion (F(1, 166) = 10, p < 0.002)
● Appropriate time (F(1, 166) = 6, p < 0.01)
Real-time expression evaluation
47
Empathy evaluation
● Overall Empathy: F(1, 166) = 27, p < 0.001
● Cognitive empathy: F(1, 166) = 4.7, p < 0.03
● Affective empathy: F(1, 166) = 5.4, p < 0.02
49
Human-Agent Rapport
● Degree of Rapport (DoR)
○ F(1, 42) = 8.38, p = 0.006
● Degree of Liking (DOL)
○ F(1, 42) = 6.64, p < 0.01
● Degree of Social Presence (DSP)
○ Not significantly different
● Quality of Interaction (QoI)
○ Not significantly different
50
Physiological Responses
● EEG
○ Not significant differences
○ Shared neural processes
● EDA
○ Higher skin conductance in interaction with empathetic agent
■ Higher emotional arousal (excitement, engagement)
● PPG
○ Higher HRV in interaction with empathetic agent
■ Better mental health
■ Lower anxiety
■ Improved regulation of emotional responses
■ Higher Empathy
■ Increase attention / engagement
51
Conclusions
● Multimodal emotion recognition using physiological modalities is a promising approach
● Four Multimodal datasets in various settings
● The Octopus-Sensing software suite
● Improve empathy between humans and machines using neural and physiological data
52
Thank you!

IVE 2024 Short Course Lecture10 - Multimodal Emotion Recognition in Conversational Settings

  • 1.
    Multimodal Emotion Recognitionin Conversational Settings Nastaran Saffaryazdi Empathic Computing Lab
  • 2.
    Motivation ● Emotions aremultimodal processes that play a crucial role in our everyday lives ● Recognizing emotions is becoming more essential in a wide range of application domains ○ Health-care ○ Education ○ Entertainment ○ Advertisement ○ Customer services
  • 3.
    Motivation ● Many ofapplication areas include human-human or human-machine conversations ● The focus of conversational emotion recognition is on facial expressions and text ● Human behavior can be controlled or faked ● Designing a general model using behavioral changes is difficult because of cultural or language differences ● Physiological data are reliable but are very weak
  • 4.
    Research Focus How canwe combine various behavioral and physiological cues to recognize emotions in human conversations and enhance empathy in human-machine interactions? 5
  • 5.
    Research Questions How canhuman body responses be employed to identify emotions? How can the data be obtained, captured, and processed simultaneously from multiple sensors? Can a combination of physiological cues be used to recognize emotions in conversations accurately? Can we increase the level of empathy between humans and machines using neural and physiological signals? 6 RQ 1 RQ 2 RQ 3 RQ 4
  • 6.
    What are thehuman body's responses to emotional stimuli, and how can these diverse responses be employed to identify emotions? Reviewing and replicating existing research in human emotion recognition using various modalities 8 RQ1
  • 7.
    ● Behavioral datasets ○CMU_MOSEI ○ SEMANE ○ IEMOCAP ○ … ● Multimodal datasets with neural or physiological signals ○ DEAP ○ MAHNOB-HCI ○ RECOLA ○ SEED-IV ○ AMIGOS No research has specifically studied brain activity and physiological signals for recognizing emotion in human-human and human-machine conversations. Multimodal Emotion Recognition 11
  • 8.
    Multimodal Emotion Recognition 12 23 Participants 21to 44 years old 𝞵 = 30, 𝝈 = 6 13 females 10 males Watching video Study 1: Multimodal Emotion Recognition in watching video Saffaryazdi, N., Wasim, S. T., Dileep, K., Nia, A. F., Nanayakkara, S., Broadbent, E., & Billinghurst, M. (2022). Using facial micro-expressions in combination with EEG and physiological signals for emotion recognition. Frontiers in Psychology, 13, 864047.
  • 9.
    ● OpenBCI EEGcap ○ EEG ● Shimmer3 GSR+ module ○ EDA ○ PPG ● Realsense camera ○ Facial video Sensors 13 13
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    ● Identified variousmodalities and recognition methods ● Fusing facial micro-expressions with EEG and physiological signals ● Identified the limitation and challenges ○ Acquiring data from multiple sensors simultaneously ○ Real-time data monitoring ○ Automatic scenario running ○ Personality differences 19 Summary RQ1
  • 16.
    How can therequired data for emotion recognition be obtained, captured, and processed simultaneously in conversation from multiple sensors? Developing software for simultaneously acquiring, visualizing, and processing multimodal data. 20 RQ2
  • 17.
    ● Octopus-sensing ○ Simpleunified interface for ● Simultaneous data acquisition ● Simultaneous data Recording ○ Study design components ● Octopus-sensing-monitoring ○ Real-time monitoring ● Octopus-sensing-visualizer ○ Offline synchronous data visualizer ● Octopus-sensing-processing ○ Real-time processing 21 Octopus Sensing
  • 18.
    ● Multiplatform ● Open-source(https://github.com/octopus-sensing) ● https://octopus-sensing.nastaran-saffar.me/ ● Support various sensors a. OpenBCI b. Brainflow c. Shimmer3 d. Camera e. Audio f. Network (Unity and matlab) Saffaryazdi, N., Gharibnavaz, A., & Billinghurst, M. (2022). Octopus Sensing: A Python library for human behavior studies. Journal of Open Source Software, 7(71), 4045. Octopus Sensing 22
  • 19.
  • 20.
  • 21.
    Octopus-sensing-Monitoring ● Monitor datafrom any machine in the same network. ● monitoring = MonitoringEndpoint(device_coordinator) monitoring.start() ● pipenv run octopus-sensing-monitoring ● http://machine-IP:8080 25
  • 22.
    Octopus-sensing-visualizer ● pipenv runoctopus-sensing-visualizer ● http://localhost:8080 ● Visualizing Raw or processed data using a config file 26
  • 23.
  • 24.
    Can a combinationof physiological cues be used to recognize emotions in conversations accurately? Conducting user studies and creating datasets of multimodal data in various settings to compare human responses and explore recognition methods in different settings. 28 RQ3
  • 25.
    29 10 minutes conversationfor each emotion topic Self-report (Beginning, Middle, End) Arousal, Valence, Emotion Creating a conversational settings that people could feel emotions spontaneously Multimodal Emotion Recognition in Conversation
  • 26.
    30 23 Participants 21 to 44years old 𝞵 = 30, 𝝈 = 6 13 females 10 males Face-to-face Conversation Multimodal Emotion Recognition in conversation Study 2: Multimodal Emotion Recognition in Conversation Saffaryazdi, N., Goonesekera, Y., Saffaryazdi, N., Hailemariam, N. D., Temesgen, E. G., Nanayakkara, S., ... & Billinghurst, M. (2022, March). Emotion recognition in conversations using brain and physiological signals. In 27th International Conference on Intelligent User Interfaces (pp. 229-242).
  • 27.
    Study 2 -Result Self-report vs target emotion 31
  • 28.
    Study 2 -Result Recognition FScore 33
  • 29.
    Comparing Various ConversationalSettings 15 Participants 21 to 36 years old 𝞵 = 28.6, 𝝈 = 5.7 7 females 8 males Face-to-face And Zoom Conversation Study 3: Face-to-Face vs Remote conversations 35 Saffaryazdi, N., Kirkcaldy, N., Lee, G., Loveys, K., Broadbent, E., & Billinghurst, M. (2024). Exploring the impact of computer-mediated emotional interactions on human facial and physiological responses. Telematics and Informatics Reports, 14, 100131.
  • 30.
    Study 3: Features 36 DataFeatures EDA phasic statistics, tonic, and peaks statistics PPG Heart rate variability time domain features which are features extracted by hertpy and neurokit2 library Face OpenFace action units including these numbers: 1, 2,4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 20, 23, 25, 26,28, and 45
  • 31.
    Study 3: Result 3-wayRepeated measure ART ANOVA ● Physiological differences between face-to-face and remote conversations ○ Facial Action Units: ■ Action units associated with negative emotions were higher in Face-to-face ■ Action units associated with positive emotions Higher in Remote ○ EDA (tonic, phasic and peaks statistics) ■ Reaction were substantial and immediate in f2f (Phasic mean higher in face-to-face) ○ PPG (HRV time domain features) ■ Higher HRV in remote conversation -> lower stress level, enhanced emotion regulation, more engagement 37
  • 32.
    Study 3: Result ●One-way and two-way Repeated-measure ART ANOVA ● Significant Empathy factors ○ Interviewer to participant 38
  • 33.
    Study 3: Result ●Emotion Recognition ○ Feature Extraction ○ Random Forest Classifier ○ Leave-One-Subject-Out ● The high accuracy -> the high similarity 39
  • 34.
    Findings ● People showedmore positive facial expressions in remote conversations ● People felt stronger emotions and more immediate ones in f2f condition ● People felt lower level of stress in the remote condition ● Limitations ○ Sample size ○ Effect of interviewer ○ Familiarity with remote conversations ● Cross-usage of multimodal dataset is quite successful ● Physiological emotion recognition are effective in conversational emotion recognition ● We can use these datasets to train models for real-time emotions recognition RQ3
  • 35.
    RQ4 Can we increasethe level of empathy between humans and machines by using neural and physiological signals to detect emotions in real-time during conversations? ● Developing a real-time emotion recognition system using multimodal data. ● Prototyping an empathetic conversational agent by feeding the detected emotions in real-time. 41
  • 36.
    Human - computerconversation 23 Participants 21 to 44 years old 𝞵 = 30, 𝝈 = 6 17 females 6 males Human-Digital Human Conversation Study 4: Interaction with a Digital human from Soul Machines 42
  • 37.
    ● Study 4 ●Interaction between human and Digital Human ■ Neutral ■ Empathetic ● Real-time emotion recognition based on physiological cues ● Evaluating interaction with digital human ■ Empathy factors ■ Quality of Interaction ■ Degree of Raport ■ Degree of Liking ■ Degree of Social Presence Human-Machine Conversation 43
  • 38.
    ● Emotion Recognitionin Real-time ○ Arousal % 69.1 and Valence %57.3 ● Induction method evaluation Study 4 - Result 45
  • 39.
    ● appropriateness ofreflected emotions in different agents ○ Appropriate emotion (F(1, 166) = 10, p < 0.002) ● Appropriate time (F(1, 166) = 6, p < 0.01) Real-time expression evaluation 47
  • 40.
    Empathy evaluation ● OverallEmpathy: F(1, 166) = 27, p < 0.001 ● Cognitive empathy: F(1, 166) = 4.7, p < 0.03 ● Affective empathy: F(1, 166) = 5.4, p < 0.02 49
  • 41.
    Human-Agent Rapport ● Degreeof Rapport (DoR) ○ F(1, 42) = 8.38, p = 0.006 ● Degree of Liking (DOL) ○ F(1, 42) = 6.64, p < 0.01 ● Degree of Social Presence (DSP) ○ Not significantly different ● Quality of Interaction (QoI) ○ Not significantly different 50
  • 42.
    Physiological Responses ● EEG ○Not significant differences ○ Shared neural processes ● EDA ○ Higher skin conductance in interaction with empathetic agent ■ Higher emotional arousal (excitement, engagement) ● PPG ○ Higher HRV in interaction with empathetic agent ■ Better mental health ■ Lower anxiety ■ Improved regulation of emotional responses ■ Higher Empathy ■ Increase attention / engagement 51
  • 43.
    Conclusions ● Multimodal emotionrecognition using physiological modalities is a promising approach ● Four Multimodal datasets in various settings ● The Octopus-Sensing software suite ● Improve empathy between humans and machines using neural and physiological data 52
  • 44.