Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Setting
Language Learning in an Interactive
and Embodied Setting
RoboticsA Developmental Approach to Machine Intelligence
1. It might be easier than solving all the tasks a human adult can do
2. Learn skills and knowledges unspecified at design time
3. Gradually proceed from easy tasks to difficult tasks
“Instead of trying to produce a program to simulate the adult mind, why
not rather try to produce one which simulates the child's? If this were then
subjected to an appropriate course of education one would obtain the adult
brain.” - Alan Turing (1950)
Language learning in an interactive and embodied setting
Learn from the experiences coming from the
machine’s interactions with its environment
Learn commonsense through the observation
and interaction with the environment
Meaning emerges by “grounding” language in
modalities in our environment
3Language learning in an interactive and embodied setting
Human driving: < 1000 miles
Self-driving: >10 million miles
A useful robot needs to be able to understand
and communicate effectively
It is easier for human to teach machines directly
using language than writing code
Humans are great teachers
Learn the effects of speaking by observing
feedbacks from conversational partner
Learn human value through the interaction
4Language learning in an interactive and embodied setting
RoboticsAnswering Questions and Following Commands
1. Is it possible to learn to follow commands using
end-to-end reinforcement learning without any
pretraining for vision or language?
2. Whether learning question answering can help
3. Can the machine understand words under new
context not seen in training?
Haonan Yu, Haichao. Zhang, Wei Xu “Interactive Grounded Language
Acquisition and Generalization in a 2D World” ICLR 2018
6Answering questions and following commands
east and avocado never
appears together in training
Watermelon only appears in
answers during training
7Answering questions and following commands
8Answering questions and following commands
No QA training
We can generalize to word combinations
never seen in training
We can generalize to questions containing
words never seen in training
Answering questions and following commands
Held out X(%): %X of word/combinations are held out from training
Much longer delay of reward
More visual variations
“Navigate to the dog!”Navigation in a 3D Environment
RoboticsGuided Feature Transformation
Haonan Yu, Xiaochen Lian, Haichao Zhang, Wei. Xu “Guided Feature Transformation (GFT):
A Neural Language Grounding Module for Embodied Agents” CoRL 2018
11Navigation in 3D environment
12Navigation in 3D environment
RoboticsDemo the object besides candle is your target .
please move to the object that is front of the basketball
can you reach the object right of toilet ?go to the object to the right of bike please .reach the location between car and trampoline please.please navigate to the grid between gift and tower .please navigate to the grid between bucket and chair .please move to the object that is front of basketball .
13Navigation in 3D environment
RoboticsLearning to Speak and Remember
1. How to learn to speak by talking with other people?
2. What information should be remembered?
3. How to utilize knowledge in memory?
Haichao Zhang, Haonan Yu, Wei Xu “Interactive Language Acquisition with One-Shot
Visual Concept Learning through a Conversation Game” ACL 2018
Rewards are given for each learner response based on its
15Learning to speak and remember
RoboticsMemory Augmented Imitation + Behavior Shaping
What is this? It is a bird.
16Learning to speak and remember
Trained end-to-end using gradient descent over Imitation Cost + Reinforce Cost
Learning to speak and remember
18Learning to speak and remember
T: Virtual teacher
L: Learner (machine)
T: i see grape
L: watermelon grape watermelon
T: tell what you see
L: see see see see see
T: there is grape
L: grape grape watermelon
T: i can observe coconut
L: fox watermelon watermelon
What we have now:
Learning to understand and use simple
language, memorize useful information, and
execute simple commands from the
interactions with a virtual teacher in virtual
What we will do in the future:
Simple → complex
Virtual → real
RoboticsAI Research at Horizon Robotics
About the company
A leading technology powerhouse of edge AI platform
Provide algorithms, processors and hardware jointly optimized for high-performance, low-
power and low-cost edge AI capabilities
CES 2019 Innovation Reward
General AI Lab @ Silicon Valley
Research towards the company’s long term vision for artificial general intelligence
Build machines that can learn skills and knowledges unspecified at design time
Applied AI Lab @ Silicon Valley
Applied research focusing on near term needs
Developing novel AI technologies that are critical to our current products
Good afternoon everyone. I am Wei Xu from Horizon Robotics. Today I am going to talk about our recent work on language learning in an interactive and embodied setting
In 1950, in the same article where the famous Turing test was proposed, Turing also proposed a solution. “Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain”. There are several advantages of this approach. First, there are so many things that a human adult can do, it will be too expensive and difficult to individually solve each one of them. Second, emphasizing that all the skills and knowledge of the machine are acquired through its own learning can make sure that the machine will be able to learn new skills and new knowledge unspecified at design time. Third, learning in a developmental way lets the machine gradually proceed from easier tasks to more difficult tasks, which can make the learning easier. This is like curriculum learning which is found to be effective in many difficult learning problems.
For embodied learning, the learning experiences are from machine’s physical interactions with its environment. By actually doing things and observing the effects, the machine can learn a lots of common sense knowledge about the environment. These kinds of knowledges are typically very hard to be captured by rules or a static dataset. Self-driving car is a great example. Waymo recently announced that the total mileage of their cars is exceeding 10 million miles. Yet they are still not fully ready for deployment. On the other hand, we all know from our experience that a human can learn to drive very well with a few hundred miles practice. A key difference between the self-driving car and human is that human has a lot of commonsense knowledge about the world. For example, even without learning to drive, a human driver knows what situation is unsafe, what obstacles should be avoided, and so on. But for self-driving, all of these commonsense knowledge has to be either coded by rules or obtained from huge amount of driving data.
Embodied learning is also very help for understanding language. In order for the machine to understand and use language, it needs to connect word sequences with the actual objects and events in the environment. ……..
Why should the machine learn in an interactive way? There are several reasons. First, a useful robot needs to interact with human, so it should be able to understand and communicate effectively with human. Second, it is easier for human to teach machines directly using language than writing code. And human are great teachers because they are good at adjusting the teaching based on the state of the learner. And in order to be able to use language, the machine needs to learn the effects of speaking by observing feedbacks from its conversational partner. Finally, through the interaction with human, the machine can learn the human value, which is very important to make sure it will do things consistent with human value.
So I’ve talked about our motivation of learning language in an interactive and embodied setting. In the rest of the talk, I will talk about our recent work along this direction.
The first one is about learning to answer questions and follow commands. This work was published in this year’s ICLR conference. There are two problems we want to study in this paper.
Here is the problem setup. We developed a 2D simulator. For each session, we generate a random map, question and instructions. The answer is provided as direct supervision. The agent is given reward based on whether it successfully executed the instruction. At test time, the agent will be given commands with words or word combinations never seen in training commands or questions.
This is the high level structure of our model. I won’t go into the detail of the model. What I want to say here we design the structure focusing on its generalization ability.
This is a short video demo showing how the agent navigates following the commands. The current command is “please move to the object that is front of basket ball”. The agent needs to approach the toilet paper from the direction where it is in front of the basket ball. After it finishes a task, a new map and command will be generated.
So far our agent is able to understand some language. In this work, we want the agent to learn to use language through conversation.
Here the problem setup. In initially, the agent has zero language ability, cannot understand nor use it, just like a new born baby.
This is a high level structure of our model. First, it needs to have memory module because it needs to remember information coming from teacher utterances and images. The vision module generates the visual representation. The interpreter module is for understanding teacher utterance and decide whether to store things into memory. The speaker module is responsible for generating responses based understanding of teacher utterances and information retrieved from memory. The whole system is trained by predicting teacher word sequence and the rewards indicating the appropriateness of the response.
I will skip the detail of the model. Just note that it’s trained end-to-end using gradient descent over Imitation Cost + Reinforce Cost
Here I show some dialog examples. This is a dialog before learning. The agent just generates some garbage responses, just like a newborn baby. Then dialogs after the learning. Here I want to mention is that the machine never see these types of object during training. From these dialogs we can see that the machine learned several things. It can confirm the statements from the teacher. It can actively seeking information by asking questions. It can remember relevant information provided by the teacher so later it can use it for answering questions. And it can also answer teacher’s questions if it knows the answer. It somehow learned to uses shape as major cue to differentiate objects. I want to emphasize that, unlike most chatbots, where the behavior of the bot is pretty much designed by human, here non of these behaviors are programmed. The machine learned all these behaviors through its interaction with the teacher, in a similar way as a baby learn from their parents.
In this final slide, I am going to say a little bit about Horizon Robotics. It’s a leading technology powerhouse of edge AI platform. Its current focus is providing algorithms, processors and hardware jointly optimized for high-performance, low-power and low-cost edge AI capabilities. And I want to share a good news with you that we just received the CES 2019 Innovation Reward Vehicle Intelligence and Self-Driving Technology We have two AI Labs in Silicon Valley, one is general AI Lab. It’s doing the kind of research I just talked about, building machines that can learn skills and knowledges unspecified at design time We also have applied AI Lab. It’s doing applied research focusing on near term needs of the company, developing novel AI technologies that are critical to our current products We are actively hiring. If you are interested, please visit either of these two websites.