Adversarial actor-critic method for task and motion planning problems using planning experience

Adversarial actor-critic method for
task-and-motion planning problems
using planning experience
Beomjoon Kim,
Leslie Pack Kaelbling,
Tomas Lozano-Perez
Massachusetts Institute of Technology

Class of problems: sequential manipulation

Class of problems: sequential manipulation
·Example problem:

Action-space representation: high-level operators
·Pick(O,P,G)
O: object to be picked

P: robot base pose

G: grasp
Pick motion

Approaches
Plan
everything
Planning hard Learning hard

An approach for planning
·Random sampling + graph search
·At every node, sample a finite number
of actions
·Search with the sampled actions

Challenge 1: infinite branching factor
·Pick operator example

P: robot base pose
G: grasp
Pick motion

Challenge 1: infinite branching factor

P: robot base pose

G: grasp
Pick motion

Challenge 2: expensive edge evaluation

P: robot base pose

G: grasp
Pick motion
IF
PathExists(P)
IKExists(G,O)

THEN
Picked(O)


P: robot base pose

G: grasp
Pick motion
IF
PathExists(P)

IKExists(G,O)
THEN
Picked(O)


P: robot base pose

G: grasp
Pick motion
IF
PathExists(P)

IKExists(G,O)

THEN
Picked(O)

Approaches
Plan everything
Learn a
complete solution
Slow

Approaches
Plan everything
Learn a
complete solution
Slow Not robust

Approaches
Plan everything
Learn a
complete solution
Learn a stochastic policy
to guide planning

Operator policy learning problem formulation
·Assume that we are given a set of high-level operators
·Our objective is to learn a set of operator policies
·That maximizes

Operator policy learning problem formulation
·Assume that we are given a set of high-level operators
·Our objective is to learn a set of operator policies
·That maximizes
Continuous parameters
of the operator

Use an RL algorithm?
·Our problems are...

Using RL algorithms is difficult in our problems
·Expensive to generate data
·IK and motion planning calls for each edge

Using RL algorithms is difficult in our problems
·Expensive to generate data
·IK and motion planning calls for each edge
·Difficult to explore
·small feasible action regions - difficult to get meaningful reward
·long-term action dependencies

Solution: Use planning experience
· · ·
· · ·

Planning experience dataset
·Planning experience dataset

Two good things about Dpl
·Only feasible operator instances are included

Two good things about Dpl
·Only feasible operator instances are included
·Exploration is guided by the planning strategy

Using Dpl naively: Learning from demonstrations
·Apply supervised learning:
max
✓
X
si,i2Dpl
||⇡✓(si) ki||2

Using Dpl naively: Learning from demonstrations
·Apply supervised learning:
max
✓
X
si,i2Dpl
||⇡✓(si) ki||2
·Problem: Similar in Euclidean space != same
feasibility results

Better LfD: Adversarial training
·Discriminator training (Loss function learning):
·Generator training:
max
↵
X
si,i2Dpl
ˆQ↵(si, i) ˆQ↵(si, ⇡✓(si))

Adversarial training: the good
·The good: Loss function is learned

Adversarial training: the bad
·The bad: not all data points lead to a goal state
(suboptimal demonstrations)

Adversarial Monte-Carlo:
Actor-critic + suboptimal demonstrations
·Adversarial Q-function training:
·Policy training:

·Policy training:
Regression on Q

·Policy training:
Regression on Q Adversarial LfD

Experiments - hypotheses
1. AdMon is more data-efficient than pure RL or pure LfD

2. Learning can improve planning efficiency

2. Learning can improve planning efficiency
3. Planning makes the learned policy robust

Domain description
·Problem instance:
·number of obstacles
·poses of obstacles
·shapes of obstacles

Domain description
·Reward function:
·-1 if obst not cleared
·1 if obst cleared
·0 if picked
·Problem instance:
·number of obstacles
·poses of obstacles
·shapes of obstacles

Executing learned policy by itself: Pure LfD
Pure LfD: GAIL

Executing learned policy by itself: Pure RL
Pure RL: DDPG, PPO

Policy by itself: AdMon is more data-efficient
Both RL + LfD: AdMon

Policy by itself: AdMon is more data-efficient
Both RL + LfD: AdMon
>10x improvement in data efficiency

Learning makes planning efficient
Raw planner

AdMon + planning

>2x improvement in planning efficiency

Planning makes the learned policy robust
AdMon w/o planner

Planning makes the learned policy robust
AdMon w/o planner
>2 times
improvement
in solution quality

Bad local optima without special treatment of Dpl

Learned policies without planning

Learning improves planning efficiency

reached 95% optimal around 600s

reached 95% optimal around 1500s

Possible question - problem setup
·What about which object to pick?
·Assumed to be given
·How do you define rewards?
·1 if object cleared, -1 if not, 0 if simply picked
·What is the state space?
·It is represented by an approximate configuration space
obstacles that we call key-configurations, where we represent the
state with collisions at sparsely-yet-carefully chosen configurations

Possible criticisms
·Your planner is awful
·Yes indeed. But if you give me a good sampling-based planner,
which typically uses a uniform sampler, then as we have shown
in our experiments, we can do much better by learning the
stochastic policy and using it instead of the uniform sampler

Possible criticisms
·Does learning a policy always help in improving planning
efficiency?
·Short answer: No. It only helps you if your problem is
hard enough to the degree that uniform sampling would
suffer.
·This hardness can be measured by two properties:
expensive edge checking, and the small ratio of solution
region vs entire action space. These two characteristics are
what makes uniform policy a bad idea, because in
expectation, you need to try a lot of samples, and each
trial takes a lot of time

Possible questions
·Why does AdMon perform better?
·1. Local optima
·2. Better exploration
·Why does GAIL perform worse than other methods in the
second domain?
·There are too many suboptimal branches, more than the first
domain. It treats everything as optimal demonstrations, and so
the data is very noisy, which hurts the performance

Possible questions
·Your Q-function learning is biased
·Yes it is. If we had infinite amount of resources, then
we perhaps don't need to add the adversarial term. But
like in a typical statistical learning setup, we are trying
to make up for the limited amount of data by guiding
the training of the policy using adversarial bias.
·There is also a good reason to add the term, which
forces the state-action distribution to the ones that we've
encountered in the Dpl dataset.

Possible questions
·How come other learning approaches are worse than
uniform?
·Uniform policy, if sampled enough times, guarantees the
probabilistic completeness.
·Other algorithms have fallen into a local optima - ex. moving
the first few objects but not all - that prevents you from making
a progress

Relation to guided policy search
·Guided policy search:
·Requires a differentiable reward function because it needs to do
a trajectory optimization, whereas our domains have
discontinuous reward function (ex. with steps)
·In GPS, trajectory optimization allows you to perform
supervised learning using them to learn a policy, because
trajectory optimization gives you at least a locally optimal
solution. In our case, each branch in the tree is not guaranteed
to be even locally optimal. All we know is that each operator
instance is feasible, and they are, in fact, suboptimal
demonstrations where the degree of suboptimality varies a lot

Relation to other pure actor critic algorithms
·We use planning experience to explore the reachable
state space, by learning a generator for feasible operator
instances
·Other actor critic algorithms do not have this capability,
and they will waste a lot of exploration efforts in trying to
learn which actions are infeasible

Relation to GAIL
·GAIL is an inverse reinforcement learning algorithm that
assumes that the demonstrations that you have are optimal
·However, our planning experience dataset is not optimal -
only one branch leads you to the solution. So if we treat all
of the dataset as optimal demonstrations, then the dataset
will be very noisy
·Alternatively we can just use the branch that actually lead
to the goal. However, because there is a very few of them
(i.e 5 out of 50 in the conveyor belt domain), it is difficult
to learn anything meaningful

GAIL
·Performs by:
·Lear reward by maximizing the rewards of the
demonstration dataset, and minimizing the current policy's
actions
·Do DDPG gradient step to maximize the sum of the
learned rewards

Relation to AlphaX
·AlphaGo or AlphaZero is the system for solving two-player
games
·I think if you take the view point that AlphaX is solving the
two-player games, then any difference in the two-player game
vs a sequential mobile manipulation problem applies:
·Continuous vs discrete action spaces
·Expensive feasibility checks vs lookup in the rulebook,
·Key configuration representation vs. Image plane on the game
board
·Source of data: calling an external planner to solve the problem vs.
making RL episodes with the minimax trick,

AlphaX as a general learning algorithm
·If you take the view AlphaX as a general learning
algorithm for guiding search, then there are some
similarities in terms of the algorithm
·Learning from planning experience
·AlphaZero learn a network that predicts both value and a
stochastic policy. We also learn them both, although there is a
bias term in the value function
·Once the learning is done, then we used it in the heuristic
forward search, whereas AlphaZero uses MCTS. We can also
use it in MCTS.
·

·One key similarity:
·The policy data, pi, for AlphaZero is generated by the result of search in the MCT
·In our case, all of the policy data is a result of feasibility checks

Adversarial actor-critic method for task and motion planning problems using planning experience

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Adversarial actor-critic method for task and motion planning problems using planning experience