Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20181125 pybullet

1,249 views

Published on

大江橋Pythonの会#4での発表スライドです。

Published in: Engineering
  • Be the first to comment

20181125 pybullet

  1. 1. Reinforcement learning with bullet simulator 25 Nov 2018 Taku Yoshioka
  2. 2. Disclaimer • Equations in slides are notationally inconsistent; many of the equations are adapted from the textbook of Sutton and Barto, while equations from other documents are also included.
  3. 3. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  4. 4. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  5. 5. Markov Decision Process • Agent’s policy: stochastic action selection conditioned on current state • Environment: stochastic state transition and stochastic reward conditioned on state-action pair Distribution of state transition and reward Policy
  6. 6. State value function and expected reward • State value function: expectation of the discounted sum of rewards obtained in future given MDP starting from an initial state • Expected reward: expectation of value function over initial states State value function Expected reward • Training of agent: maximize the expected reward with respect to policy
  7. 7. Example: pole balancing • State: • angle of the pole • angular velocity of the pole • vision as shown above • Action: force to move the cart right or left • Reward: +1 if the pole is nearly vertical with a threshold of angle
  8. 8. Reinforcement learning objective • Optimize the policy of the agent through interaction with the environment • Reward function is given (e.g., game) or designed (e.g., robot) • MDP model might be (partially) given (e.g., Go) or not given (e.g., robot)
  9. 9. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  10. 10. Action value function • Expected reward given current state-action pair • Can be used to select an action by maximizing it with respect to action • Implicit representation of policy (value iteration) • Brute-force for discrete (small) action space • Gradient-free optimization like CEM • Can be used as a guide to improve policy (policy iteration, later)
  11. 11. Bellman equation • Recursive relationship of value function with a fixed policy pi • For optimal policy • Basis of learning value function (DP, Q-learning, actor-critic)
  12. 12. Bellman equation for action value function • For optimal policy • Relationship with state value function
  13. 13. Updating action value function • SARSA: using sampled state and action (on-policy) • Q-learning: using optimal action with current policy (off-policy)
  14. 14. Function approximation for value function • DQN: training deep network that represents action value function https://towardsdatascience.com/deep-double-q-learning-7fca410b193a • Select an action based on the output of the deep network https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_q_learning.html
  15. 15. Parametrized policy • Representation of policy by parametric function Example: linear function + softmax • Policy gradient theorem (PGT) • Applicable to continuous action space
  16. 16. Actor-critic methods • Plugging parametric value function instead of sample reward into policy gradient equation : n-step cumulative reward • is considered as improvement of the value by taking the sampled policy compared to the value of the current policy (advantage function) • The above update rule is used in advantage actor critic (A2C)
  17. 17. Deterministic policy gradient • More efficient than training stochastic policy with PGT • Deterministic policy gradient theorem (DPGT) (Silver et al., 2014) • Applied to an intensive study of controlling robot with deep network (Lillicrap et al., 2015) • Exploration with action noise or parameter noise for deterministic policy
  18. 18. TRPO and PPO • Ensuring improvement of policy with a lower bound of the objective • Trust region policy optimization (TRPO): maximize the lower bound in a trust region, which is close to the current policy parameters (Schulman et al., 2015) • Proximal policy optimization (PPO): using simple clipped constraint instead of KL divergence (Schulman et al., 2017)
  19. 19. Expected reward Difference after policy update Written with state distribution on current policy Lower bound of expected reward Constrained maximum Importance sampling TRPO objective
  20. 20. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  21. 21. OpenAI gym • Providing interface of RL environment with a base class Env class Env(object): """The main OpenAI Gym class. """ def step(self, action): """Run one timestep of the environment's dynamics. Returns observation, reward, done and info. """ raise NotImplementedError def reset(self): """Resets the state of the environment and returns an initial observation. """ raise NotImplementedError def render(self, mode='human'): """Renders the environment.""" raise NotImplementedError def seed(self, seed=None): """Sets the seed for this env's random number generator(s).""" logger.warn("Could not seed environment %s", self)
  22. 22. • Collection of environments to compare RL algorithms • Minimal example interacting with CartPole-v0 environment • python scripts/cartpole.py # https://gym.openai.com/docs/#environments import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample()) # take a random action
  23. 23. • Show list of environments from gym import envs print(envs.registry.all()) #> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0), # EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0), # EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0), # EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0), # EnvSpec(Gopher-ram-v0), ...
  24. 24. OpenAI baselines • Set of high-quality implementations of RL algorithms: A2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPO • There is a fork “Stable Baselines”: unified structure for algorithms, PEP8 compliant, documented, more tests & more code coverage
  25. 25. TensorFlow Agent • Optimized infrastructure for reinforcement learning • Multiple parallel environments, batch PPO • Validated with environments
  26. 26. PyBullet • Python wrapper of Bullet physics simulator • URDF/SDF support, forward dynamics, inverse kinematics, collision check, 2D/depth cameras, virtual reality • Used in simulation-to-real transfer of controller for quadruped robot (Tan et al., 2018)
  27. 27. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  28. 28. Installation and testing • Python 3.6, venv $ brew install cmake openmpi $ cd $WORKDIR $ python3 -m venv pybullet-env $ source pybullet-env/bin/activate $ pip install tensorflow==1.12 $ pip install gym==0.10.9 $ git clone https://github.com/openai/baselines.git $ cd baselines $ pip install -e . $ cd .. $ pip install pybullet==2.3.8 $ pip install ruamel-yaml==0.15.76 $ pip install stable-baselines==2.2.1 $ brew install ffmpeg # for making video • Test pybullet and gym $ cd $WORKDIR/pybullet-env/lib/python3.6/site-packages/pybullet_envs/examples $ python kukaGymEnvTest.py $ python kukaCamGymEnvTest.py # much slower
  29. 29. import pybullet as p import time import pybullet_data physicsClient = p.connect(p.GUI)# or p.DIRECT for non-graphical version p.setAdditionalSearchPath(pybullet_data.getDataPath()) #optionally p.setGravity(0,0,-10) planeId = p.loadURDF("plane.urdf") cubeStartPos = [0,0,1] cubeStartOrientation = p.getQuaternionFromEuler([0,0,0]) boxId = p.loadURDF("r2d2.urdf",cubeStartPos, cubeStartOrientation) for i in range (10000): p.stepSimulation() time.sleep(1./240.) cubePos, cubeOrn = p.getBasePositionAndOrientation(boxId) print(cubePos,cubeOrn) p.disconnect() • python scripts/hello_pybullet.py
  30. 30. kukaGymEnvTest.py
  31. 31. kukaCamGymEnvTest.py
  32. 32. # Gym environment environment = KukaGymEnv(renders=True, isDiscrete=False) # GUI motorsIds=[] dv = 0.01 motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3)) done = False while (not done): action=[] for motorId in motorsIds: action.append(environment._p.readUserDebugParameter(motorId)) # 1 step forward state, reward, done, info = environment.step(action) obs = environment.getExtendedObservation() # Get more state info kukaGymEnvTest.py Note: code modified from the original for comparison
  33. 33. # Gym environment with cameras environment = KukaCamGymEnv(renders=True, isDiscrete=False) # GUI motorsIds=[] dv = 1 motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3)) done = False while (not done): action=[] for motorId in motorsIds: action.append(environment._p.readUserDebugParameter(motorId)) # 1 step forward state, reward, done, info = environment.step(action) obs = environment.getExtendedObservation() # Get more state info kukaCamGymEnvTest.py Note: code modified from the original for comparison
  34. 34. Run PPO for pendulum environment $ python -m pybullet_envs.agents.train_ppo --config=pybullet_pendulum --logdir=pendulum # In another terminal $ tensorboard --logdir=pendulum --port=2222 • Pybullet + OpenAI gym + Tensorflow agent • Train policy for pendulum environment with PPO • Save the result as tensorflow result files (log, model)
  35. 35. • Learning curve (mean score)
  36. 36. Create video for an episode with the trained policy $ python -m pybullet_envs.agents.visualize_ppo --logdir=pendulum/xxxx-pybullet_pendulum/ --outdir=pendulum_video
  37. 37. Configurations • Environment parameters pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py def pybullet_pendulum(): locals().update(default()) env = 'InvertedPendulumBulletEnv-v0' max_length = 200 steps = 5e7 # 50M return locals() • Register gym-compatible pybullet environment pybullet-env/lib/python3.6/site-packages/pybullet_envs/__init__.py register( id='InvertedPendulumBulletEnv-v0', entry_point='pybullet_envs.gym_pendulum_envs:InvertedPendulumBulletEnv', max_episode_steps=1000, reward_threshold=950.0, )
  38. 38. Configurations pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py def default(): """Default configuration for PPO.""" # General algorithm = ppo.PPOAlgorithm num_agents = 30 eval_episodes = 30 use_gpu = False # Network network = networks.feed_forward_gaussian weight_summaries = dict( all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*') policy_layers = 200, 100 value_layers = 200, 100 init_mean_factor = 0.1 init_logstd = -1 # Optimization update_every = 30 update_epochs = 25 optimizer = tf.train.AdamOptimizer update_epochs_policy = 64 update_epochs_value = 64 learning_rate = 1e-4 # Losses discount = 0.995 kl_target = 1e-2 kl_cutoff_factor = 2 kl_cutoff_coef = 1000 kl_init_penalty = 1 return locals() • Algorithm (PPO) parameters
  39. 39. Gym-compatible pendulum environment pybullet-env/lib/python3.6/site-packages/pybullet_envs/ gym_pendulum_envs.py • Make a subclass of Env
  40. 40. # Modified from the original code to use PyBullet API directly # for explanation basic idea, class InvertedPendulumBulletEnv: def __init__(self): # Load robot model, wrap interfaces self.robot = InvertedPendulum() def _step(self, a): #self.robot.apply_action(a) self._p.setJointMotorControl2(self.robot.objects[0], jointIndex=index_slider, controlMode=self._p.TORQUE_CONTROL, force=a) self._p.stepSimulation() state = self._p.getBasePositionAndOrientation(self.robot.objects[0])[0, 1] # Return value is (x, y, z), (a, b, c, d) if self.robot.swingup: reward = np.cos(self.robot.theta) done = False else: reward = 1.0 done = np.abs(self.robot.theta) > .2 return state, reward, done, {}
  41. 41. • python scripts/hello_stable_baselines.py import gym from stable_baselines.common.policies import MlpPolicy from stable_baselines.common.vec_env import DummyVecEnv from stable_baselines import PPO2 env = gym.make('CartPole-v1') # The algorithms require a vectorized environment to run env = DummyVecEnv([lambda: env]) model = PPO2(MlpPolicy, env, verbose=1) model.learn(total_timesteps=10000) obs = env.reset() for i in range(1000): action, _states = model.predict(obs) obs, rewards, dones, info = env.step(action) env.render()

×