More Related Content

Similar to Language acquisition framework for robots: From grounded language acquisition to spoken dialogues(20)

More from Komei Sugiura(20)


Language acquisition framework for robots: From grounded language acquisition to spoken dialogues

  1. LCore: A Language Acquisition Framework for Robots From Grounded Language Acquisition to Spoken Dialogues 2013/12/13 Komei Sugiura and Naoto Iwahashi National Institute of Information and Communication Technology, Japan
  2. Open problem: grounded language processing • Language processing based on non-verbal information (vision, motion, context, experience, …) is still very difficult – e.g. “Put the blue cup away”, “Give me the usual” • What is missing in dialog processing for robots? – Physical situatedness / symbol grounding – Shared experience “blue cup”: multiple candidates 2 “the usual”: umbrella, remote, drink,..
  3. Spoken dialogue system + Robot ≠ Robot dialogue • Robot dialogue – Categorization/prediction of real-world information – Handling real-world properties – Linguistic interaction • Why is this difficult? – Machine learning, CV, manipulation, symbol grounding problem, speech recognition,… Tableware Cup Tea cup Cutlery Fork Plate Knife
  4. Robot Language Acquisition Framework [Iwahashi 10, “Robots That Learn to Communicate: A Developmental Approach…”] • Task: Object manipulation dialogues • Key features – Fully grounded vocabulary – Imitation learning – Incremental & interactive learning – Language independent 4
  5. LCore functions Phoneme learning Learning question answering Word learning Visual feature learning Grammar learning Affordance learning Disambiguation of word ellipsis Imitation learning Utterance understanding Role reversal imitation Robot-directed utterance detection Active-learning-based dialogue 5
  6. Learning modules Word Grammar Motion-object relationship • Learning nouns/adjectives • Learning verbs • Learning probabilistic distributions of • Estimation of related objects visual features • Learning trajectories • Learning phoneme sequences • Learning phoneme sequences
  7. Symbol grounding: Learning nouns and adjectives • Visual features modeled by Gaussians – Input: visual features of objects • Out-of-vocabulary word = phoneme sequence + waveform – Voice conversion (Eigenvoice GMM) to robot voice Generative models BLUE Unknown object RED
  8. Imitation learning of object manipulation [Sugiura+ 07] • Difficulty: Clustering trajectories in the world coordinate system does not work • Proposed method – Input: Position sequences of all objects – Estimation of reference point and coordinate system by EM algorithm – Number of state is optimized by cross-validation Place A on B
  9. Imitation learning using reference-point-dependent HMMs [Sugiura+ 07][Sugiura+ 11] Searching optimal coordinate system Coordinate system type :Position at time t … = Reference object ID HMM parameters • Delta parameters = … * Sugiura, K. et al, “Learning, Recognition, and Generation of Motion by …”, Advanced Robotics, Vol.25, No.17, 2011
  10. Results: motion learning No verb is estimated to have WCS -> Reference-point-dependent verb Velocity Motion “place-on” Log likelihood Position Place-on Move-closer Raise Jump-over Move-away Rotate Move-down Training-set likelihood
  11. Transformation of reference-point-dependent HMMs [Sugiura+ 11] • What is the problem? – Simple HMMs do not generate continuous trajectories – Situation dependent trajectories • Reference-point-dependent HMM – Input: (motion ID, object ID) e.g. <place-on, Object 1, Object 3> – Output: Maximum likelihood trajectory Situation HMM “Place-on” World CS Place X on Y * Sugiura, K. (2011), “Learning, Generation, and Recognition of Reference-Point-Dependent Probabilistic…”
  12. Generating continuous trajectory using delta parameters [Tokuda+ 00] Maximum likelihood trajectory : time series of (position,velocity,acceleration) : state sequence : HMM parameters : filter ( ) : matrix of covariance matrices of each OPDF : time series of position : vector of mean vectors *Tokuda, K. et al, “Speech parameter generation algorithms for HMM-based speech synthesis”, 2000
  13. Quantitative results • Evaluation measure – Euclidian distance – Normalized by frame number T Trajectory by Subject Trajectory by proposed method
  15. Utterance understanding in LCore (1) • User utterances are understood by using multimodal information learned in a statistical learning framework Vision Motion (Bayesian learning of a Gaussian) (HMM) Speech (HMM) Motion-object relationship (Bayesian learning of a Gaussian) Shared belief Context (MCE Learning) 15
  16. Integration of multimodal information • Shared belief Ψ: weighted sum of five modules utterance action scene context Speech Motion Vision Motion-object relationship Context 16
  17. Inter-module learning Multimodal understanding Confidence learning Utterance/Motion generation Place Elmo on box Place Elmo Place it User intension 17
  18. Grounded utterance disambiguation Where to? • Simple dialog systems Which “cup”? U: “Place the cup (on the table).” R: “You said place the cup.” -> Risk of motion failure • Generating confirmation utterances using physical information R: “I’ll place the red cup on the table, is it OK?”
  19. Multimodal utterance understanding Place-on Elmo 30th 1st 2nd … 1st 2nd … 30th Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011 19
  20. Multimodal utterance understanding Place-on Elmo 30th Margin 1st 2nd … 1st 2nd … 30th Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011 20
  21. Confirmation by paraphrasing user’s utterance • Learning phase • Bayesian Logistic Regression • Input: Margin(d), Output: probability • Execution phase – Decision-making on responses based on expected utility Probability 21 Margin
  22. Quantitative result: Risk reduction Baseline Proposed Decreased to 1/4 Failure rate Rejection rate Confirmation rate # of confirmation utt 22
  23. Reduction of motion failure in learning phase [Sugiura+ 11] • So far… – Learning utterance understanding probabilities • Idea • Learning-by-asking Phase Operator Motion executor Active Learning Robot User (Passive) learning User Robot Execution User Robot Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, Vol. 25, No. 17, 2011
  24. Reduction of motion failure in learning phase • Problem: – Motion failure is required in learning phase to avoid over-fitting Active Learning phase Motion failure Motion success Learning phase “Safe” training data Motion failure Execution phase Motion success
  25. What kind of commands are effective for learning? • Proposed method: Active Learning-based command generation • Objective: Reduce the number of interactions • [Input = image], [Output = utterance] • Expected Log Loss Reduction(ELLR[Roy, 2001]) is used to select the optimal utterance Active Learning : A form of supervised learning in which inputs can be selected by the algorithm Target action Robot utterance Loss Act=A, Objs = <1,3> “Place-on Elmo blue box” 35.8 Act=A, Objs = <1,3> “Place-on Elmo” 12.3 Act=A, Objs= <1, 2> “Place-on Elmo” 28.1 : : : “Raise box” 332.3 : : Act=B, Objs=<2> :
  26. Utterance generation by ELLR
  27. Reduction of motion failure in learning phase Test-set likelihood (1) Proposed (2) Baseline Number of episodes Motion failure risk reduced # of motion failure Fast convergence Proposed Baseline