Language acquisition framework for robots: From grounded language acquisition to spoken dialogues
LCore: A Language Acquisition Framework for Robots
From Grounded Language Acquisition to Spoken Dialogues
2013/12/13
Komei Sugiura and Naoto Iwahashi
National Institute of Information and Communication Technology, Japan
komei.sugiura@nict.go.jp
Open problem: grounded language processing
• Language processing based on non-verbal information (vision,
motion, context, experience, …) is still very difficult
– e.g. “Put the blue cup away”, “Give me the usual”
• What is missing in dialog processing for robots?
– Physical situatedness / symbol grounding
– Shared experience
“blue cup”: multiple candidates
2
“the usual”: umbrella, remote, drink,..
Spoken dialogue system + Robot ≠ Robot dialogue
• Robot dialogue
– Categorization/prediction of real-world information
– Handling real-world properties
– Linguistic interaction
• Why is this difficult?
– Machine learning, CV, manipulation, symbol grounding problem,
speech recognition,…
Tableware
Cup
Tea cup
Cutlery
Fork
Plate
Knife
Robot Language Acquisition Framework
[Iwahashi 10, “Robots That Learn to Communicate: A Developmental Approach…”]
• Task: Object manipulation dialogues
• Key features
– Fully grounded vocabulary
– Imitation learning
– Incremental & interactive learning
– Language independent
4
LCore functions
Phoneme learning
Learning question answering
Word learning
Visual feature learning
Grammar learning
Affordance learning
Disambiguation of word ellipsis
Imitation learning
Utterance understanding
Role reversal imitation
Robot-directed utterance
detection
Active-learning-based dialogue
5
Learning modules
Word
Grammar
Motion-object
relationship
• Learning nouns/adjectives
• Learning verbs
• Learning probabilistic distributions of • Estimation of related objects
visual features
• Learning trajectories
• Learning phoneme sequences
• Learning phoneme sequences
Symbol grounding: Learning nouns and adjectives
• Visual features modeled by Gaussians
– Input: visual features of objects
• Out-of-vocabulary word = phoneme sequence + waveform
– Voice conversion (Eigenvoice GMM) to robot voice
Generative models
BLUE
Unknown object
RED
Imitation learning of object manipulation [Sugiura+ 07]
• Difficulty: Clustering trajectories in the world coordinate system does not work
• Proposed method
– Input: Position sequences of all objects
– Estimation of reference point and coordinate system by EM algorithm
– Number of state is optimized by cross-validation
Place A on B
Imitation learning using reference-point-dependent HMMs
[Sugiura+ 07][Sugiura+ 11]
Searching optimal coordinate system
Coordinate system
type
:Position at time t
…
=
Reference object ID
HMM
parameters
• Delta parameters
=
…
* Sugiura, K. et al, “Learning, Recognition, and Generation of Motion by …”, Advanced Robotics, Vol.25, No.17, 2011
Results: motion learning
No verb is estimated to have WCS
-> Reference-point-dependent verb
Velocity
Motion “place-on”
Log likelihood
Position
Place-on Move-closer
Raise
Jump-over Move-away
Rotate
Move-down
Training-set likelihood
Transformation of reference-point-dependent HMMs [Sugiura+ 11]
• What is the problem?
– Simple HMMs do not generate continuous trajectories
– Situation dependent trajectories
• Reference-point-dependent HMM
– Input: (motion ID, object ID) e.g. <place-on, Object 1, Object 3>
– Output: Maximum likelihood trajectory
Situation
HMM “Place-on”
World CS
Place X on Y
* Sugiura, K. (2011), “Learning, Generation, and Recognition of Reference-Point-Dependent Probabilistic…”
Generating continuous trajectory using delta parameters
[Tokuda+ 00]
Maximum likelihood trajectory
: time series of
(position,velocity,acceleration)
: state sequence
: HMM parameters
: filter (
)
: matrix of covariance
matrices of each OPDF
: time series of position
: vector of mean vectors
*Tokuda, K. et al, “Speech parameter generation algorithms for HMM-based speech synthesis”, 2000
Quantitative results
• Evaluation measure
– Euclidian distance
– Normalized by frame number T
Trajectory by Subject
Trajectory by proposed method
Utterance understanding in LCore (1)
• User utterances are understood by using multimodal
information learned in a statistical learning framework
Vision
Motion
(Bayesian
learning of a
Gaussian)
(HMM)
Speech
(HMM)
Motion-object
relationship
(Bayesian learning
of a Gaussian)
Shared
belief
Context
(MCE Learning)
15
Integration of multimodal information
• Shared belief Ψ: weighted sum of five modules
utterance
action
scene
context
Speech
Motion
Vision
Motion-object relationship
Context
16
Grounded utterance disambiguation
Where to?
• Simple dialog systems
Which “cup”?
U: “Place the cup (on the table).”
R: “You said place the cup.”
-> Risk of motion failure
• Generating confirmation utterances using physical information
R: “I’ll place the red cup on the table, is it OK?”
Multimodal utterance understanding
Place-on Elmo
30th
1st
2nd
…
1st
2nd
…
30th
Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011
19
Multimodal utterance understanding
Place-on Elmo
30th
Margin
1st
2nd
…
1st
2nd
…
30th
Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011
20
Confirmation by paraphrasing user’s utterance
• Learning phase
• Bayesian Logistic Regression
• Input: Margin(d), Output: probability
• Execution phase
– Decision-making on responses
based on expected utility
Probability
21
Margin
Quantitative result: Risk reduction
Baseline
Proposed
Decreased to 1/4
Failure rate
Rejection rate
Confirmation rate
# of confirmation utt
22
Reduction of motion failure in learning phase [Sugiura+ 11]
• So far…
– Learning utterance understanding probabilities
• Idea
• Learning-by-asking
Phase
Operator
Motion executor
Active Learning
Robot
User
(Passive) learning
User
Robot
Execution
User
Robot
Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, Vol. 25, No. 17, 2011
Reduction of motion failure in learning phase
• Problem:
– Motion failure is required in learning
phase to avoid over-fitting
Active Learning
phase
Motion
failure
Motion
success
Learning phase
“Safe” training
data
Motion
failure
Execution phase
Motion
success
What kind of commands are effective for learning?
• Proposed method: Active Learning-based command generation
• Objective: Reduce the number of interactions
• [Input = image], [Output = utterance]
• Expected Log Loss Reduction(ELLR[Roy, 2001]) is used to select
the optimal utterance
Active Learning : A form of supervised learning in which inputs can be
selected by the algorithm
Target action
Robot utterance
Loss
Act=A, Objs = <1,3>
“Place-on Elmo blue box”
35.8
Act=A, Objs = <1,3>
“Place-on Elmo”
12.3
Act=A, Objs= <1, 2>
“Place-on Elmo”
28.1
:
:
:
“Raise box”
332.3
:
:
Act=B, Objs=<2>
:
Reduction of motion failure in learning phase
Test-set likelihood
(1) Proposed
(2) Baseline
Number of episodes
Motion failure risk
reduced
# of motion failure
Fast convergence
Proposed Baseline