Reinforcement Learning using OpenAI Gym

Reinforcement Learning using
OpenAI Gym
Muhammad Aleem
Siddiqui

What is REINFORCEMENT
LEARNING ?
•In reinforcement learning, an agent takes actions
in an environment & receives rewards. The
ultimate goal is to maximize rewards over time
•A reinforcement learning algorithm, or agent,
learns by interacting with its environment. The
agent receives rewards by performing correctly
and penalties for performing incorrectly. The
agent learns without intervention from a human
by maximizing its reward and minimizing its
penalty.

Exploration And Exploitation
The only way to uncover the correct signal
is
to assume nothing, try out diﬀerent things
(explore), and learn to act optimally
(exploit) based on environmental feedback.
Balancing exploration & exploitation is
what reinforcement learning is all about.

What is an Agent, Environment,
Action, Policy and Reward ?
Agent: An Algorithm / A Robot / A Game
Environment: What agent interact with
Action: What the agent can do
Policy: What action to be chosen
Reward: What agent receives by performing
correctly

Agent and Environment
Agent:
An agent can be a program or robot,
which may receive inputs based off the
environment and perform some action
based on that input.
Environment:
An environment is the actual setting the
agent is interacting with. An environment
need to be able to represented in a way,
an agent can understand. it often be a
game in examples, but it can be any real
world or artificial environment.

Action, Policy and Reward
Action:
The actual interaction an agent will perform
on the environment. Moving in an
environment, choosing the next move in a
game, etc.
Policy:
The policy is the strategy of choosing an
action given a state in expectation of better
outcomes.
Reward:
The metric that allows an agent to understand
whether or not the previous sets of action
helped or hurt in its overall goal

Reinforcement Learning
Process
Reinforcement Learning is the science of
making optimal decisions using
experiences. Breaking it down, the process
of Reinforcement Learning involves these
simple steps:
1. Observation of the environment
2. Deciding how to act using some
strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and
refining our strategy
6. Iterate until an optimal strategy is found

OpenAI Gym
OpenAI is a non-profit AI Research
company, discovering and enacting the path
to safe artificial general intelligence.
Gym is a toolkit for developing and
comparing reinforcement learning
algorithms.
OpenAI Gym Library is a python library with
a collection of environment that can be used
with the reinforcement learning algorithms.
Link: gym.openai.com
Gym

OpenAI Gym’s Environment
Here is an example of getting
something running. This will
run an instance of the
“CartPole-v0” environment for
1000 time-steps, rendering
the environment at each step.
CODE:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())# take a random action
env.close()

Environment (Contd.)
Gym’s “CartPole-v0”
environment is a numpy array
with 4 floating point values :
1. Horizontal Position
2. Horizontal Velocity
3. Angle of Pole
4. Angular Velocity

Functions of Gym
make(): Used to create environment.
reset(): Setting the environment to default
starting stage.
render(): It creates a popup window to
display Simulation of Agent interacting with
environment
step(): Action taken by the agent. it returns
an observation. (4 valued numpy array,
<observations, reward, done, info> )
sample(): Random samples input for the
agent.
close(): Close the environment after action
performed.
CODE:
import gym
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())
env.close()

OpenAI Gym’s Observations
Observations are the environment specific information variables:
Observation (object): An environment-specific object representing your observation of
the environment. For example, joint angles and joint velocities of a robot, or the
board state in a board game.
Reward (float): Amount of reward achieved by the previous action. The scale varies
between environments, but the goal is always to increase your total reward.
Done (boolean): Whether it’s time to reset the environment again. Most tasks are
divided into well-defined episodes, and done being True indicates the episode has
terminated. For example, perhaps the pole tipped too far, or you lost your last life.
Info (dict): Diagnostic information useful for debugging. It can sometimes be useful
for learning. For example, it might contain the raw probabilities behind the
environment’s last state change. However, official evaluations of your agent are not
allowed to use this for learning.

Observations (Contd.)
The process gets started by
calling reset(), which returns an
initial observation. So a more
proper way of writing the
previous code with respect to
the episodes and done flag:
CODE:
import gym
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} time steps".format(t+1))
break
env.close()
OUTPUT:
[-0.03327757 0.5649743 -0.0374682 -0.87239967]
[-0.02197809 0.7605852 -0.05491619 -1.17662316]
[-0.00676638 0.95637585 -0.07844866 -1.48600365]
[ 0.01236114 1.15236136 -0.10816873 -1.80211802]
[ 0.03540836 1.3485132 -0.14421109 -2.12636579]
[ 0.06237863 1.15508926 -0.18673841 -1.88149287]
Episode finished after 11 time steps

Making a Hard-Coded Policy for
Agent
CODE:
import gym
observation = env.reset()
for t in range(1000):
env.render()
# Defining a Hard-Coded Policy
cart_pos , cart_vel , pole_ang , ang_vel = observation
# Move Cart Right if Pole is Falling to the Right
# Angle is measured off straight vertical line
if pole_ang > 0:
# Move Right
action = 1
else:
# Move Left
action = 0
# Perform Action
observation , reward, done, info = env.step(action)
env.close()

Using Neural Network in
Reinforcement Learning
ReLu – f(x)

Using Neural Network In
TensorFlow for Reinforcement
Learning (Contd.)
Let's design a simple Neural
Network that takes in the
observation array passes it through
a hidden layer and output
probability for left. (for right = left-1)
CODE:
import tensorflow as tf
import gym
import numpy as np
# PART ONE: NETWORK VARIABLES #
# Observation Space has 4 inputs
num_inputs = 4
num_hidden = 4
# Outputs the probability it should go left
num_outputs = 1
initializer = tf.contrib.layers.variance_scaling_initializer()
# PART TWO: NETWORK LAYERS #
X = tf.placeholder(tf.float32, shape=[None, num_inputs])
hidden_layer_one = tf.layers.dense(X,n um_hidden,activation = tf.nn.relu, kernel_initializer=initializer)
hidden_layer_two = tf.layers.dense(hidden_layer_one, num_hidden, activation=tf.nn.relu, kernel_initializer=initializer)
# Probability to go left
output_layer = tf.layers.dense(hidden_layer_one, num_outputs, activation=tf.nn.sigmoid, kernel_initializer=initializer)
# [ Prob to go left , Prob to go right]
probabilties = tf.concat(axis=1, values=[output_layer, 1 - output_layer])
# Sample 1 randomly based on probabilities
action = tf.multinomial(probabilties, num_samples=1)
init = tf.global_variables_initializer()
# PART THREE: SESSION #
saver = tf.train.Saver()
epi = 50
step_limit = 500
avg_steps = []
env = gym.make("CartPole-v1")
with tf.Session() as sess:
init.run()
for i_episode in range(epi):
obs = env.reset()
for step in range(step_limit):
env.render()
action_val = action.eval(feed_dict={X: obs.reshape(1, num_inputs)})
obs, reward, done, info = env.step(action_val[0][0])
if done:
avg_steps.append(step)
print('Done after {} steps'.format(step))
break
print("After {} episodes the average cart steps before done was {}".format(epi, np.mean(avg_steps)))
env.close()Note: 2x zoom required.

Reinforcement Learning using OpenAI Gym

More Related Content

What's hot

Similar to Reinforcement Learning using OpenAI Gym

More from Muhammad Aleem Siddiqui

Recently uploaded

Reinforcement Learning using OpenAI Gym