Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[GTC 2019] Bringing Personal Robots Home: Integrating Computer Vision and Human–Robot Interaction for Real-World Applications

1,226 views

Published on

In this talk, we’ll discuss our latest achievements and challenges in developing personal robot systems. The main focus of the talk is on an autonomous tidying-up robot system, which we have recently announced. We’ll describe how we integrated cutting-edge speech and natural language processing and computer vision technologies to build such an autonomous system that can work on complex real-world applications with high accuracy. The system also deploys our latest object detection model, which was trained using 512 NVIDIA Tesla V100 GPUs and won second prize in the Google AI Open Images - Object Detection Track in August 2018.

Interactive picking: https://pfnet.github.io/interactive-robot/
Tidying-up robot: https://projects.preferred.jp/tidying-up-robot/en/

Published in: Technology
  • Be the first to comment

[GTC 2019] Bringing Personal Robots Home: Integrating Computer Vision and Human–Robot Interaction for Real-World Applications

  1. 1. Bringing Personal Robots Home [S9360] Integrating Computer Vision & Human–Robot Interaction for Real-World Applications NVIDIA GTC 2019 (Mar 18, 2019) Jun Hatori, Preferred Networks
  2. 2. Requirements for Robots Industrial Personal Cost high low Environment fixed, known, structured dynamic, unstructured, unseen Users experts non-experts Goal automation intelligence, personalization
  3. 3. Requirements for Robots Industrial Personal Key technology Cost high low hardware Environment fixed, known, structured dynamic, unstructured, unseen computer vision Users experts non-experts human–robot interaction Goal automation intelligence, personalization task planning
  4. 4. Requirements for Robots Industrial Personal Key technology Cost high low hardware Environment fixed, known, structured dynamic, unstructured, unseen computer vision Users experts non-experts human–robot interaction Goal automation intelligence, personalization task planning
  5. 5. A variety of real-world environments
  6. 6. PR1: Wyrobek et al. 2008
  7. 7. Key Technologies ● Computer Vision: Generalization to different environments and tasks ○ Object detection of thousands of categories ○ Support unseen environments and unseen objects ● Human–robot interface between humans and robots ○ Intuitive interface with spoken and visual language interpretation ○ Spoken and visual feedback from robots
  8. 8. Two Projects ● Interactive picking robot ● Autonomous tidying-up robot
  9. 9. Interactively Picking Real-World Objects https://projects.preferred.jp/interactive-robot/
  10. 10. Challenges ● Variety of Expressions “a bear doll”, “the animal plushie”, “that fluffy thing”, “up-side-down grizzly” “grab X”, “bring together X and Y”, “move X to a diagonal box” ● Ambiguity and errors “that brown one”, “a dog doll?”
  11. 11. Human: the one next to the eraser box. Robot: I got it. Human: hey can you move that brown fluffy thing to the bottom right? Robot: which one do you mean?
  12. 12. Proposed Model embedding MLP speech (transcription) CNN (+feat.) MLP cropped images !pick the brown fluffy thing and put in the lower bin. embedding LSTM vision (RGB) SSD Destination LSTM MLP Target Obj.
  13. 13. Handling Ambiguous Commands ● Trained with hinge loss for correct sentence–object pairs [Yu+ 2017] ● Instruction is considered ambiguous if margin is below threshold CNN MLP CNN MLPMLP LSTM !pick the brown fluffy thing and put it in the lower right bin. 2nd 1st margin
  14. 14. Interactive Picking Dataset grab the human face labeled object and … move the pop red can from the top … move the pink horse plushie … put the box with a 50 written on it that is … Publicly available as PFN-PIC dataset: https://github.com/pfnet-research/picking-instruction 1200 scenes (26k objects in total) 100 types of commodities unconstrained 73k instructions (vocabulary size: 5000)
  15. 15. Results single instruction 88.0% Accuracy of target object matching
  16. 16. Results 4.7% improvement (39% error reduction) by interactive clarification single instruction interactive 88.0% 92.7% Accuracy of target object matching
  17. 17. Summary ● We proposed an interactive picking system that can be controlled by unconstrained spoken language instructions. ● We achieved an object matching accuracy of 92.7%. ● Accuracies for unseen objects are not sufficient (~70%). * Hatori+ 2018. Interactively Picking Real-World Object with Unconstrained Spoken Language Instructions. ICRA-2018 Best Paper on HRI.
  18. 18. Tidying-up Robot https://projects.preferred.jp/tidying-up-robot/
  19. 19. CEATEC JAPAN 2018 (Oct 16–19, 2018)
  20. 20. Environment ● Furnished living room ○ Coffee table, coach, bookshelf, trash bins, laundry bag, toy box ● Two Toyota HSRs working in parallel
  21. 21. Object Recognition ● Sensors ○ HSR’s head camera (RGBD) ○ 4 ceiling cameras (RGB) ● Supported objects: ~300 ● PFDet as CNN base model ○ 2nd place accuracy at Google AI Open Images Challenge – Object Detection (Sep, 2018)
  22. 22. PFDet: Basic Architecture [1] ● Feature Pyramid Network (FPN) (SENet-154 and SE-ResNeXt-101) ● Multi-node batch normalization ● Non-maximum weighted (NMW) suppression [2] ● Global context ○ Additional FPN block ○ PSP (pyramid spatial pooling) module ○ Context head [3] [1] Akiba+ 2018. PFDet: 2nd Place Solution to Open Images Challenge 2018 Object Detection Track. [2] Zhou+. CAD: Scale invariant framework for real-time object detection. ICCVW 2017. [3] Zhu+. CoupleNet: Coupling global structure with local parts for object detection. ICCV 2017.
  23. 23. PFDet: High Scalability Hardware: In-house GPU Cluster NVIDIA Tesla V100 (32GB) × 512 Infiniband Scalability Results ● Training of 16 epochs completed in 33 hours ● Scaling efficiency is 83% compared to 8 GPUs Software Framework
  24. 24. Data Collection
  25. 25. System Performance ● Object detection ○ Accuracy: 0.90 mIoU (segmentation mask) ● Robot system (actual measurement at CEATEC) ○ Tidying-up Speed: 1.9 object / minute ○ Grasp success rate: ~90%
  26. 26. Robustness of Object Detection Sparse Dense
  27. 27. Typical Errors Mango vs. lemonMis-recognition on humans Whiteout False negative in clutter
  28. 28. 34
  29. 29. Human–Robot Interaction (HRI) ● From user to robot ○ Update where the current item should be stored ○ Inquire about object locations ● From robot to user ○ Spoken and audio feedback ○ Tablet App for monitoring ■ User can also provide feedback ■ AR-based visualization ● Technologies involved: speech recognition, NLP, gesture, AR
  30. 30. Needs English subtitles
  31. 31. Needs English subtitles
  32. 32. Tablet UI
  33. 33. Remaining Challenges with Tidying-up ● Standalone computation (no external sensor or computer) ● Recognition of unlimited items in domestic environments ● Generalization to unseen environments ● Easy setup
  34. 34. Robots as Interface with Physical World ● Domestic robots can track household items while tidying-up, connecting everything in physical world to the virtual world. ● Potential applications: ○ E-commerce ○ Recommendations on items purchase or disposal
  35. 35. Key Takeaways ● Robust computer vision and intuitive human–robot interface are prerequisites for successful personal robot applications. ● Some of simple domestic tasks like tidying-up are getting close to a production level. ● Robots are interface with physical world, computerizing household items and connect them to online services.
  36. 36. Thank you! Interactive picking: https://pfnet.github.io/interactive-robot/ Tidying-up robot: https://projects.preferred.jp/tidying-up-robot/en/ Related talks ● S9380 - The Frontier of Define-by-Run Deep Learning Frameworks Wed, Mar 20, 11:00 AM - 11:50 AM – SJCC Room 210E ● S9738 - Using GPU Power for NumPy-syntax Calculations Tue, Mar 19, 2:00 PM - 02:50 PM – SJCC Room 210F

×