Natural Language Processing:
From Human-Robot Interaction to
Alzheimer’s Detection
Jekaterina Novikova
29 November 2018
Content
1. Spoken Human-Robot Interaction (HRI)
a. Hybrid Chat-Task Dialogue
b. Multimodal dialogue evaluation
2. Evaluation of Natural Language Systems
a. Problems with existing automatic evaluation
b. Human evaluation is not so easy, too
c. Referenceless quality estimation
d. E2E Natural Language Generation (NLG) challenge
3. Alzheimer’s Detection
a. Effect of data
b. Semi-supervised learning
c. Early detection
2
3
Spoken Human-Robot Interaction
HRI - Hybrid Chat and Task Dialogue
MDP policy:
• States = [Distance, TaskCompleted, UserEngaged ...]
• Actions = [PerformTask, Greet, Goodbye, Chat,
GiveDirections, Wait, RequestTask, RequestShop]
• Reward function is optimising for successful task
completion and higher engagement
4
HRI - Hybrid Chat and Task Dialogue
5
*I.Papaioannou, C.Dondrup, J.Novikova, O.Lemon. Hybrid Chat and Task Dialogue for More Engaging HRI Using Reinforcement
Learning, In Ro-MAN 2017
With hybrid chat+task dialogue employed:
• system received higher ratings (sign.)
• the duration of the interaction was longer
(not sign.)
“It was interesting to see that the more I interacted
with the robot the more I could discover new
possible questions and answers. This made me feel
that I could actually try to make a conversation with
the robot.”
HRI - Multimodal Dialogue Evaluation
• Task-related human-robot dialogue with the Pepper robot
• Emotional and dialogue-related features extracted
6*J.Novikova, C.Dondrup, I.Papaioannou and O.Lemon. Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of
Multimodal Features in Spoken Human-Robot Interaction, ACL workshop RoboNLP, 2017
Emotional
Dialogue-related
Linguistic Non-linguistic
Happiness,
Surprise,
Sadness
Utterance len,
w./utterance,
unique w./utterance,
lexical diversity,
# sentences,
w./sent, unique w./sent
Speech duration,
number of turns,
# self-repetitions,
# completed tasks,
tasks / turn
Best
predictor
for:
Likeability Perceived intelligence Perceived intelligence,
interpretability
Highest
correlation
with:
Friendly,
Nice,
Sensible
Conscious,
Humanlike,
Natural
Intelligent / Unintelligent,
Knowledgeable /
Ignorant
• Emotional features
coming from real-
time facial
expression
recognition
could be used as an
online estimator of
a dialogue success
• Use as a reward
signal for RF-based
dialogue
7
Evaluation of Natural Language
Systems
NLG Systems - Automatic Evaluation
• Up to 60% of NLG research in 2012-2015 relies on automatic metrics
(Gkatzia and Mahamood, 2015)
• Large-scale analysis of correlation between automatic metrics and human
ratings
• 3 NLG systems of different approaches
• 3 datasets in two different domains
• 21 automatic metrics
• Detailed error analysis
8
Word-overlap metric (used frequently) Grammar-based metrics (not used frequently)
BLEU, NIST, TER, ROUGE, Lepor,
CIDER, METEOR
Readability (+ cpw, wps, sps, spw, len, pol, ppw)
Grammaticality (prs, msp)
*J.Novikova, O.Dušek, A. Cercas-Curry and V. Rieser. Why We Need New Evaluation Metrics for NLG. In Proceedings of EMNLP
2017
• Do they correlate with human preferences? No*
• BUT can be useful for error analysis: find cases where the system is
performing poorly
NLG Systems - Human Evaluation
• Experimental design has a significant impact
on the reliability as well as the outcomes of
human evaluation.
• RankME combines relative rankings and
magnitude estimation (Bard et al., 1996) with
continuous scales.
• RankME method increases ICC (intra-rater
agreement) significantly.
• Consistent with TrueSkill (Herbrich et al.,
2006) but is more flexible / richer scales
9
*J.Novikova, O.Dusek and V.Rieser. RankME: Reliable Human Ratings for Natural Language Generation. In Proceedings of NAACL,
2018
NLG Systems - Referenceless QE
• NL reference is treated as a gold
standard, correct and complete. These
assumptions are often invalid.
• Referenceless approach*:
• Based on RNN
• Only MR matters
• Increases correlation
10
*O.Dušek, J.Novikova and V.Rieser. Referenceless Quality Estimation for Natural Language Generation. ICML Workshop on
Learning to Generate Natural Language, 2017
• First QE in NLG, but related to QE in dialogue,
MT, grammatical error correction
• System can generalise to unseen NLG systems
in the same domain to some extent
• Limitations:
• Cross-domain generalisability is poor
• Very small amounts of in-domain / in-
system data improve performance a lot
E2E NLG Challenge - Dataset Collection
• Well-known restaurant domain
• Bigger than previous sets
• 50k unaligned, longer MR+ref pairs
11
Loch Fyne is a kid-friendly restaurant serving
cheap Japanese food.
Instance
s
MRs Refs/MR Slots/MR W/Ref Sent/Ref
E2E 51,426 6,039 8.21 5.73 20.34 1.56
SF Restaurants 5,192 1,914 1.91 2.63 8.51 1.05
Bagel 404 202 2.00 5.48 11.55 1.03
name [Loch Fyne],
eatType[restaurant],
food[Japanese],
price[cheap],
kid-friendly[yes]
Serving low cost Japanese style cuisine, Loch
Fyne caters for everyone, including families
with
small children.
*J.Novikova, O.Lemon, V.Rieser. Crowd-Sourcing NLG Data: Pictures Elicit Better Data, In Proceedings of INLG 2016
• More diverse & natural
• partially collected using pictorial MRs
• higher MSTTR, more rare words, more complex syntax
• noisier, but compensated by more refs per MR
E2E Dataset comparison
• vs. BAGEL & SFRest:
• Lexical richness
• higher lexical diversity
(Mean Segmental Token-Type Ratio)
• higher proportion of rare words
• Syntactic richness
• more complex sentences (D-Level)
12
The Vaults is an Indian restaurant.
Cocum is a very expensive restaurant but the quality is
great.
The coffee shop Wildwood has fairly priced food, while
being in the same vicinity as the Ranch.
Serving cheap English food, as well as having a coffee
shop, the Golden Palace has an average customer
ranking and is located along the riverside.
E2E NLG Challenge
• New neural NLG: so far limited to small datasets
• e.g. BAGEL, SF Restaurants/Hotels, RoboCup
• simple sentences, delexicalization (placeholders
instead of values)
• “E2E” NLG: Learning from data without
alignments
• no alignment annotation needed → easier to
collect data
• Our goal: replicate rich dialogue & discourse
phenomena
• as targeted by earlier rule-based & data-driven
approaches
• see how well new approaches fare if given enough
training data
13
Participation:
● 17 participants (⅓
from industry)
● 62 systems,
success!
*J.Novikova, O.Dusek and V.Rieser. The E2E Dataset: New Challenges For End-to-End Generation, In SIGDIAL, 2017 (best paper nominee)
E2E: Lessons learnt
• Semantic control (realize all slots)– crucial for seq2seq systems
• beam reranking works well, attention-only performs poorly
• Open vocabulary – delexicalization easy & good
• other (copy mechanisms, sub-word/character models) also viable
• Diversity – hand-engineered systems seem better
• options for seq2seq: diverse ensembling, sampling
• might hurt naturalness
• Best method: rule-based or seq2seq with reranking
14
*O.Dusek, J.Novikova and V.Rieser. Findings of the E2E NLG challenge. In Proceedings of INLG, 2018
V.Rieser, J.Novikova, O.Dusek. The E2E NLG Challenge. In Journal of Computational Linguistics (in progress)
15
Alzheimer’s Detection from
Language
Winterlight’s Assessment Tools
16
AD Detection - Effect of Heterogeneous
Data
• What data is useful for AD detection?
• Additional same-task data of healthy
subjects improves ML model performance by
13% (Noorian et al., 2017)
17
*A.Balagopalan, J.Novikova, F.Rudzicz and M.Ghassemi. The Effect of Heterogeneous Data for Alzheimer's Disease Detection from
Speech. In: NIPS Workshop on Machine Learning for Health ML4H, Montreal, 2018
• We experiment* with additional healthy samples from different task
(verbal fluency, reading, spontaneous speech). Increase of up to 9% in F1
scores. Effect is especially pronounced when data come from healthy
subjects of age > 60.
AD Detection - Semi-supervised Multimodal
Learning
Motivation:
• Input features come from different modalities (acoustic, linguistic, etc)
Available training data may be unlabeled
18
Transductive Consensus Networks (TCN):
• Interpreter - converts each modality into low-
dimensional representation.
• Noise modality added, to prevent the discriminator
from looking at only superficial aspects for each data
sample.
• Optimization goal:
AD Detection - Semi-supervised Multimodal
Learning
Three steps:
1. Interpreters try to produce indistinguishable representations for each
modality
2. Discriminators try to recognize modal-specific information retained in
representations
3. Classifier trains the networks to make a correct decision
19
AD Detection - Semi-supervised Multimodal
Learning
Results*:
20
Model Modality 80 labeled / DementiaBank
macro F1
Semi-supervised TCN multimodal .7163 ± .0109
Semi-supervised tri-training 3-modal .7025 ± .0305
Supervised CN multimodal .6608 ± .0279
Semi-supervised TSVM unimodal .6857 ± .0260
Supervised classic ML
SVM
Logistic regression
unimodal
.6851 ± .0395
.6857 ± .0149
*Z. Zhu, J. Novikova, and F. Rudzicz. Semi-supervised classification by reaching consensus among modalities. In: NIPS Workshop on
Interpretability and Robustness in Audio, Speech, and Language IRASL, Montreal, 2018
Early Prediction of AD - Famous People
Study
Results*:
21
Gene Wilder
1933-2016
American actor. Diagnosed
with Alzheimer’s disease in
2013; Died from
complications 3 years later.
Paul Newman
1925-2008
American actor. Died from
lung cancer in 2008, with no
signs of cognitive
impairment.
Average noun phrase length Number of clauses per sentence Ratio of pronouns to nouns
Early Prediction of AD - Famous People
Study
Subtraction Module subtracts (i.e. accounts
for) the effects of healthy aging.
Addition Module adds (i.e. combines) the
two feature representations andensure
interpretability of each.
Experiments:
1. Base system. Uses raw features.
2. Base system with longitudinal feature
normalisation.
3. Base system with latent
representations learnt while training
on DementiaBank.
22
Early Prediction of AD - Famous People
Study
23
*J.Novikova, A.Balagopalan, M.Yancheva, F. Rudzicz. Early Prediction of Alzheimer’s Disease from Spontaneous Speech. Under
submission
Thank you!
24

Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Detection

  • 1.
    Natural Language Processing: FromHuman-Robot Interaction to Alzheimer’s Detection Jekaterina Novikova 29 November 2018
  • 2.
    Content 1. Spoken Human-RobotInteraction (HRI) a. Hybrid Chat-Task Dialogue b. Multimodal dialogue evaluation 2. Evaluation of Natural Language Systems a. Problems with existing automatic evaluation b. Human evaluation is not so easy, too c. Referenceless quality estimation d. E2E Natural Language Generation (NLG) challenge 3. Alzheimer’s Detection a. Effect of data b. Semi-supervised learning c. Early detection 2
  • 3.
  • 4.
    HRI - HybridChat and Task Dialogue MDP policy: • States = [Distance, TaskCompleted, UserEngaged ...] • Actions = [PerformTask, Greet, Goodbye, Chat, GiveDirections, Wait, RequestTask, RequestShop] • Reward function is optimising for successful task completion and higher engagement 4
  • 5.
    HRI - HybridChat and Task Dialogue 5 *I.Papaioannou, C.Dondrup, J.Novikova, O.Lemon. Hybrid Chat and Task Dialogue for More Engaging HRI Using Reinforcement Learning, In Ro-MAN 2017 With hybrid chat+task dialogue employed: • system received higher ratings (sign.) • the duration of the interaction was longer (not sign.) “It was interesting to see that the more I interacted with the robot the more I could discover new possible questions and answers. This made me feel that I could actually try to make a conversation with the robot.”
  • 6.
    HRI - MultimodalDialogue Evaluation • Task-related human-robot dialogue with the Pepper robot • Emotional and dialogue-related features extracted 6*J.Novikova, C.Dondrup, I.Papaioannou and O.Lemon. Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction, ACL workshop RoboNLP, 2017 Emotional Dialogue-related Linguistic Non-linguistic Happiness, Surprise, Sadness Utterance len, w./utterance, unique w./utterance, lexical diversity, # sentences, w./sent, unique w./sent Speech duration, number of turns, # self-repetitions, # completed tasks, tasks / turn Best predictor for: Likeability Perceived intelligence Perceived intelligence, interpretability Highest correlation with: Friendly, Nice, Sensible Conscious, Humanlike, Natural Intelligent / Unintelligent, Knowledgeable / Ignorant • Emotional features coming from real- time facial expression recognition could be used as an online estimator of a dialogue success • Use as a reward signal for RF-based dialogue
  • 7.
    7 Evaluation of NaturalLanguage Systems
  • 8.
    NLG Systems -Automatic Evaluation • Up to 60% of NLG research in 2012-2015 relies on automatic metrics (Gkatzia and Mahamood, 2015) • Large-scale analysis of correlation between automatic metrics and human ratings • 3 NLG systems of different approaches • 3 datasets in two different domains • 21 automatic metrics • Detailed error analysis 8 Word-overlap metric (used frequently) Grammar-based metrics (not used frequently) BLEU, NIST, TER, ROUGE, Lepor, CIDER, METEOR Readability (+ cpw, wps, sps, spw, len, pol, ppw) Grammaticality (prs, msp) *J.Novikova, O.Dušek, A. Cercas-Curry and V. Rieser. Why We Need New Evaluation Metrics for NLG. In Proceedings of EMNLP 2017 • Do they correlate with human preferences? No* • BUT can be useful for error analysis: find cases where the system is performing poorly
  • 9.
    NLG Systems -Human Evaluation • Experimental design has a significant impact on the reliability as well as the outcomes of human evaluation. • RankME combines relative rankings and magnitude estimation (Bard et al., 1996) with continuous scales. • RankME method increases ICC (intra-rater agreement) significantly. • Consistent with TrueSkill (Herbrich et al., 2006) but is more flexible / richer scales 9 *J.Novikova, O.Dusek and V.Rieser. RankME: Reliable Human Ratings for Natural Language Generation. In Proceedings of NAACL, 2018
  • 10.
    NLG Systems -Referenceless QE • NL reference is treated as a gold standard, correct and complete. These assumptions are often invalid. • Referenceless approach*: • Based on RNN • Only MR matters • Increases correlation 10 *O.Dušek, J.Novikova and V.Rieser. Referenceless Quality Estimation for Natural Language Generation. ICML Workshop on Learning to Generate Natural Language, 2017 • First QE in NLG, but related to QE in dialogue, MT, grammatical error correction • System can generalise to unseen NLG systems in the same domain to some extent • Limitations: • Cross-domain generalisability is poor • Very small amounts of in-domain / in- system data improve performance a lot
  • 11.
    E2E NLG Challenge- Dataset Collection • Well-known restaurant domain • Bigger than previous sets • 50k unaligned, longer MR+ref pairs 11 Loch Fyne is a kid-friendly restaurant serving cheap Japanese food. Instance s MRs Refs/MR Slots/MR W/Ref Sent/Ref E2E 51,426 6,039 8.21 5.73 20.34 1.56 SF Restaurants 5,192 1,914 1.91 2.63 8.51 1.05 Bagel 404 202 2.00 5.48 11.55 1.03 name [Loch Fyne], eatType[restaurant], food[Japanese], price[cheap], kid-friendly[yes] Serving low cost Japanese style cuisine, Loch Fyne caters for everyone, including families with small children. *J.Novikova, O.Lemon, V.Rieser. Crowd-Sourcing NLG Data: Pictures Elicit Better Data, In Proceedings of INLG 2016 • More diverse & natural • partially collected using pictorial MRs • higher MSTTR, more rare words, more complex syntax • noisier, but compensated by more refs per MR
  • 12.
    E2E Dataset comparison •vs. BAGEL & SFRest: • Lexical richness • higher lexical diversity (Mean Segmental Token-Type Ratio) • higher proportion of rare words • Syntactic richness • more complex sentences (D-Level) 12 The Vaults is an Indian restaurant. Cocum is a very expensive restaurant but the quality is great. The coffee shop Wildwood has fairly priced food, while being in the same vicinity as the Ranch. Serving cheap English food, as well as having a coffee shop, the Golden Palace has an average customer ranking and is located along the riverside.
  • 13.
    E2E NLG Challenge •New neural NLG: so far limited to small datasets • e.g. BAGEL, SF Restaurants/Hotels, RoboCup • simple sentences, delexicalization (placeholders instead of values) • “E2E” NLG: Learning from data without alignments • no alignment annotation needed → easier to collect data • Our goal: replicate rich dialogue & discourse phenomena • as targeted by earlier rule-based & data-driven approaches • see how well new approaches fare if given enough training data 13 Participation: ● 17 participants (⅓ from industry) ● 62 systems, success! *J.Novikova, O.Dusek and V.Rieser. The E2E Dataset: New Challenges For End-to-End Generation, In SIGDIAL, 2017 (best paper nominee)
  • 14.
    E2E: Lessons learnt •Semantic control (realize all slots)– crucial for seq2seq systems • beam reranking works well, attention-only performs poorly • Open vocabulary – delexicalization easy & good • other (copy mechanisms, sub-word/character models) also viable • Diversity – hand-engineered systems seem better • options for seq2seq: diverse ensembling, sampling • might hurt naturalness • Best method: rule-based or seq2seq with reranking 14 *O.Dusek, J.Novikova and V.Rieser. Findings of the E2E NLG challenge. In Proceedings of INLG, 2018 V.Rieser, J.Novikova, O.Dusek. The E2E NLG Challenge. In Journal of Computational Linguistics (in progress)
  • 15.
  • 16.
  • 17.
    AD Detection -Effect of Heterogeneous Data • What data is useful for AD detection? • Additional same-task data of healthy subjects improves ML model performance by 13% (Noorian et al., 2017) 17 *A.Balagopalan, J.Novikova, F.Rudzicz and M.Ghassemi. The Effect of Heterogeneous Data for Alzheimer's Disease Detection from Speech. In: NIPS Workshop on Machine Learning for Health ML4H, Montreal, 2018 • We experiment* with additional healthy samples from different task (verbal fluency, reading, spontaneous speech). Increase of up to 9% in F1 scores. Effect is especially pronounced when data come from healthy subjects of age > 60.
  • 18.
    AD Detection -Semi-supervised Multimodal Learning Motivation: • Input features come from different modalities (acoustic, linguistic, etc) Available training data may be unlabeled 18 Transductive Consensus Networks (TCN): • Interpreter - converts each modality into low- dimensional representation. • Noise modality added, to prevent the discriminator from looking at only superficial aspects for each data sample. • Optimization goal:
  • 19.
    AD Detection -Semi-supervised Multimodal Learning Three steps: 1. Interpreters try to produce indistinguishable representations for each modality 2. Discriminators try to recognize modal-specific information retained in representations 3. Classifier trains the networks to make a correct decision 19
  • 20.
    AD Detection -Semi-supervised Multimodal Learning Results*: 20 Model Modality 80 labeled / DementiaBank macro F1 Semi-supervised TCN multimodal .7163 ± .0109 Semi-supervised tri-training 3-modal .7025 ± .0305 Supervised CN multimodal .6608 ± .0279 Semi-supervised TSVM unimodal .6857 ± .0260 Supervised classic ML SVM Logistic regression unimodal .6851 ± .0395 .6857 ± .0149 *Z. Zhu, J. Novikova, and F. Rudzicz. Semi-supervised classification by reaching consensus among modalities. In: NIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language IRASL, Montreal, 2018
  • 21.
    Early Prediction ofAD - Famous People Study Results*: 21 Gene Wilder 1933-2016 American actor. Diagnosed with Alzheimer’s disease in 2013; Died from complications 3 years later. Paul Newman 1925-2008 American actor. Died from lung cancer in 2008, with no signs of cognitive impairment. Average noun phrase length Number of clauses per sentence Ratio of pronouns to nouns
  • 22.
    Early Prediction ofAD - Famous People Study Subtraction Module subtracts (i.e. accounts for) the effects of healthy aging. Addition Module adds (i.e. combines) the two feature representations andensure interpretability of each. Experiments: 1. Base system. Uses raw features. 2. Base system with longitudinal feature normalisation. 3. Base system with latent representations learnt while training on DementiaBank. 22
  • 23.
    Early Prediction ofAD - Famous People Study 23 *J.Novikova, A.Balagopalan, M.Yancheva, F. Rudzicz. Early Prediction of Alzheimer’s Disease from Spontaneous Speech. Under submission
  • 24.