Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

25 views

Published on

This paper focuses on a multimodal language
understanding method for carry-and-place tasks with domestic
service robots. We address the case of ambiguous instructions,
that is, when the target area is not specified. For instance
“put away the milk and cereal” is a natural instruction
where there is ambiguity regarding the target area, considering
environments in daily life. Conventionally, this instruction can
be disambiguated from a dialogue system, but at the cost of time
and cumbersome interaction. Instead, we propose a multimodal
approach, in which the instructions are disambiguated using the
robot’s state and environment context. We develop the Multi-
Modal Classifier Generative Adversarial Network (MMC-GAN)
to predict the likelihood of different target areas considering the
robot’s physical limitation and the target clutter. Our approach,
MMC-GAN, significantly improves accuracy compared with
baseline methods that use instructions only or simple deep
neural networks

Published in: Science
  • Be the first to comment

  • Be the first to like this

A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

  1. 1. Multimodal Language Understanding for Carry and Place tasks Aly Magassouba, Komei Sugiura and Hisashi Kawai National Institute of Information and Communications Tech., Japan
  2. 2. Our target: service robots that understand ambiguous speech Social Background • Shortage of manpower that can physically support people with disability Challenge • Understanding ambiguous instructions from the linguistic and visual context in a end-to-end approach Ambiguity • “Put away the sugar and milk bottle” • Meaning: “Put the sugar on the kitchen shelf and the milk in the fridge”
  3. 3. The difference between our approach and literature is Generative Adversarial Nets (GAN) data augmentation in latent space Related work: • Dialog-based approach [Kollar10] – Time consuming • End-to-end approach [Hatori18] – Grasping task/Large dataset • LAC-GAN [Sugiura17] – Single modality Novelty: – Multimodal spoken language understanding with GAN data augmentation • Key technology – GAN data augmentation in latent space – Different from Classic GAN[Goodfellow14] used for generation [Bousmalis17] fake real Discriminator Generator OR [Zhang17]
  4. 4. Theoretical background of MultiModal Classifier GAN (MMC-GAN) Cost function of Extractor Cost function of Generator based on Wasserstein method Cost function of discriminator• Data augmentation in latent space makes more data-efficient [Sugiura17] • Extractor was fully-connected, not adapted to visual and multimodal inputs
  5. 5. Structure of Extractor
  6. 6. Input (b) • Instruction: “Bring this towel to the kitchen shelf” • Context “the robot is holding the towel” • Depth image Output label • A4= Unlikely target area Building Carry-and-Place Multimodal Dataset for validating our method Input (a) • Instruction: “Put the coke bottle on the table” • Context “the bottle has been grasped” • Depth image Output label • A1= Very likely target area A1 212 A2 432 A3 398 A4 240 Total 1282 Data set distribution
  7. 7. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge Metric = test-set accuracy
  8. 8. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge MMC-GAN outperforms classic DNN Metric = test-set accuracy
  9. 9. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge Metric = test-set accuracy Multimodal approach is required to solve the carry- and-place task
  10. 10. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge WGAN is more stable Metric = test-set accuracy
  11. 11. Sample results: MMC-GAN emphasizes the relationship between linguistic and visual features CorrectpredictionIncorrectprediction Confusion matrix
  12. 12. Summary • Contribution – Multimodal spoken language understanding with GAN data augmentation • Method – A GAN network based on latent space feature that classifies target areas from ambiguous instructions • Results – Our method outperforms DNN – Multimodal inputs are required to solve carry-and-place tasks

×