Reinforcement Learning using
OpenAI Gym
Muhammad Aleem
Siddiqui
What is REINFORCEMENT
LEARNING ?
•In reinforcement learning, an agent takes actions
in an environment & receives rewards. The
ultimate goal is to maximize rewards over time
•A reinforcement learning algorithm, or agent,
learns by interacting with its environment. The
agent receives rewards by performing correctly
and penalties for performing incorrectly. The
agent learns without intervention from a human
by maximizing its reward and minimizing its
penalty.
Exploration And Exploitation
The only way to uncover the correct signal
is
to assume nothing, try out different things
(explore), and learn to act optimally
(exploit) based on environmental feedback.
Balancing exploration & exploitation is
what reinforcement learning is all about.
What is an Agent, Environment,
Action, Policy and Reward ?
Agent: An Algorithm / A Robot / A Game
Environment: What agent interact with
Action: What the agent can do
Policy: What action to be chosen
Reward: What agent receives by performing
correctly
Agent and Environment
Agent:
An agent can be a program or robot,
which may receive inputs based off the
environment and perform some action
based on that input.
Environment:
An environment is the actual setting the
agent is interacting with. An environment
need to be able to represented in a way,
an agent can understand. it often be a
game in examples, but it can be any real
world or artificial environment.
Action, Policy and Reward
Action:
The actual interaction an agent will perform
on the environment. Moving in an
environment, choosing the next move in a
game, etc.
Policy:
The policy is the strategy of choosing an
action given a state in expectation of better
outcomes.
Reward:
The metric that allows an agent to understand
whether or not the previous sets of action
helped or hurt in its overall goal
Reinforcement Learning
Process
Reinforcement Learning is the science of
making optimal decisions using
experiences. Breaking it down, the process
of Reinforcement Learning involves these
simple steps:
1. Observation of the environment
2. Deciding how to act using some
strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and
refining our strategy
6. Iterate until an optimal strategy is found
OpenAI Gym
OpenAI is a non-profit AI Research
company, discovering and enacting the path
to safe artificial general intelligence.
Gym is a toolkit for developing and
comparing reinforcement learning
algorithms.
OpenAI Gym Library is a python library with
a collection of environment that can be used
with the reinforcement learning algorithms.
Link: gym.openai.com
Gym
OpenAI Gym’s Environment
Here is an example of getting
something running. This will
run an instance of the
“CartPole-v0” environment for
1000 time-steps, rendering
the environment at each step.
CODE:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())# take a random action
env.close()
Environment (Contd.)
Gym’s “CartPole-v0”
environment is a numpy array
with 4 floating point values :
1. Horizontal Position
2. Horizontal Velocity
3. Angle of Pole
4. Angular Velocity
Functions of Gym
make(): Used to create environment.
reset(): Setting the environment to default
starting stage.
render(): It creates a popup window to
display Simulation of Agent interacting with
environment
step(): Action taken by the agent. it returns
an observation. (4 valued numpy array,
<observations, reward, done, info> )
sample(): Random samples input for the
agent.
close(): Close the environment after action
performed.
CODE:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())
env.close()
OpenAI Gym’s Observations
Observations are the environment specific information variables:
Observation (object): An environment-specific object representing your observation of
the environment. For example, joint angles and joint velocities of a robot, or the
board state in a board game.
Reward (float): Amount of reward achieved by the previous action. The scale varies
between environments, but the goal is always to increase your total reward.
Done (boolean): Whether it’s time to reset the environment again. Most tasks are
divided into well-defined episodes, and done being True indicates the episode has
terminated. For example, perhaps the pole tipped too far, or you lost your last life.
Info (dict): Diagnostic information useful for debugging. It can sometimes be useful
for learning. For example, it might contain the raw probabilities behind the
environment’s last state change. However, official evaluations of your agent are not
allowed to use this for learning.
Observations (Contd.)
The process gets started by
calling reset(), which returns an
initial observation. So a more
proper way of writing the
previous code with respect to
the episodes and done flag:
CODE:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} time steps".format(t+1))
break
env.close()
OUTPUT:
[-0.03327757 0.5649743 -0.0374682 -0.87239967]
[-0.02197809 0.7605852 -0.05491619 -1.17662316]
[-0.00676638 0.95637585 -0.07844866 -1.48600365]
[ 0.01236114 1.15236136 -0.10816873 -1.80211802]
[ 0.03540836 1.3485132 -0.14421109 -2.12636579]
[ 0.06237863 1.15508926 -0.18673841 -1.88149287]
Episode finished after 11 time steps
Making a Hard-Coded Policy for
Agent
CODE:
import gym
env = gym.make('CartPole-v0')
observation = env.reset()
for t in range(1000):
env.render()
# Defining a Hard-Coded Policy
cart_pos , cart_vel , pole_ang , ang_vel = observation
# Move Cart Right if Pole is Falling to the Right
# Angle is measured off straight vertical line
if pole_ang > 0:
# Move Right
action = 1
else:
# Move Left
action = 0
# Perform Action
observation , reward, done, info = env.step(action)
env.close()
Using Neural Network in
Reinforcement Learning
ReLu – f(x)
Using Neural Network In
TensorFlow for Reinforcement
Learning (Contd.)
Let's design a simple Neural
Network that takes in the
observation array passes it through
a hidden layer and output
probability for left. (for right = left-1)
CODE:
import tensorflow as tf
import gym
import numpy as np
# PART ONE: NETWORK VARIABLES #
# Observation Space has 4 inputs
num_inputs = 4
num_hidden = 4
# Outputs the probability it should go left
num_outputs = 1
initializer = tf.contrib.layers.variance_scaling_initializer()
# PART TWO: NETWORK LAYERS #
X = tf.placeholder(tf.float32, shape=[None, num_inputs])
hidden_layer_one = tf.layers.dense(X,n um_hidden,activation = tf.nn.relu, kernel_initializer=initializer)
hidden_layer_two = tf.layers.dense(hidden_layer_one, num_hidden, activation=tf.nn.relu, kernel_initializer=initializer)
# Probability to go left
output_layer = tf.layers.dense(hidden_layer_one, num_outputs, activation=tf.nn.sigmoid, kernel_initializer=initializer)
# [ Prob to go left , Prob to go right]
probabilties = tf.concat(axis=1, values=[output_layer, 1 - output_layer])
# Sample 1 randomly based on probabilities
action = tf.multinomial(probabilties, num_samples=1)
init = tf.global_variables_initializer()
# PART THREE: SESSION #
saver = tf.train.Saver()
epi = 50
step_limit = 500
avg_steps = []
env = gym.make("CartPole-v1")
with tf.Session() as sess:
init.run()
for i_episode in range(epi):
obs = env.reset()
for step in range(step_limit):
env.render()
action_val = action.eval(feed_dict={X: obs.reshape(1, num_inputs)})
obs, reward, done, info = env.step(action_val[0][0])
if done:
avg_steps.append(step)
print('Done after {} steps'.format(step))
break
print("After {} episodes the average cart steps before done was {}".format(epi, np.mean(avg_steps)))
env.close()Note: 2x zoom required.

Reinforcement Learning using OpenAI Gym

  • 1.
    Reinforcement Learning using OpenAIGym Muhammad Aleem Siddiqui
  • 2.
    What is REINFORCEMENT LEARNING? •In reinforcement learning, an agent takes actions in an environment & receives rewards. The ultimate goal is to maximize rewards over time •A reinforcement learning algorithm, or agent, learns by interacting with its environment. The agent receives rewards by performing correctly and penalties for performing incorrectly. The agent learns without intervention from a human by maximizing its reward and minimizing its penalty.
  • 3.
    Exploration And Exploitation Theonly way to uncover the correct signal is to assume nothing, try out different things (explore), and learn to act optimally (exploit) based on environmental feedback. Balancing exploration & exploitation is what reinforcement learning is all about.
  • 4.
    What is anAgent, Environment, Action, Policy and Reward ? Agent: An Algorithm / A Robot / A Game Environment: What agent interact with Action: What the agent can do Policy: What action to be chosen Reward: What agent receives by performing correctly
  • 5.
    Agent and Environment Agent: Anagent can be a program or robot, which may receive inputs based off the environment and perform some action based on that input. Environment: An environment is the actual setting the agent is interacting with. An environment need to be able to represented in a way, an agent can understand. it often be a game in examples, but it can be any real world or artificial environment.
  • 6.
    Action, Policy andReward Action: The actual interaction an agent will perform on the environment. Moving in an environment, choosing the next move in a game, etc. Policy: The policy is the strategy of choosing an action given a state in expectation of better outcomes. Reward: The metric that allows an agent to understand whether or not the previous sets of action helped or hurt in its overall goal
  • 7.
    Reinforcement Learning Process Reinforcement Learningis the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps: 1. Observation of the environment 2. Deciding how to act using some strategy 3. Acting accordingly 4. Receiving a reward or penalty 5. Learning from the experiences and refining our strategy 6. Iterate until an optimal strategy is found
  • 8.
    OpenAI Gym OpenAI isa non-profit AI Research company, discovering and enacting the path to safe artificial general intelligence. Gym is a toolkit for developing and comparing reinforcement learning algorithms. OpenAI Gym Library is a python library with a collection of environment that can be used with the reinforcement learning algorithms. Link: gym.openai.com Gym
  • 9.
    OpenAI Gym’s Environment Hereis an example of getting something running. This will run an instance of the “CartPole-v0” environment for 1000 time-steps, rendering the environment at each step. CODE: import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample())# take a random action env.close()
  • 10.
    Environment (Contd.) Gym’s “CartPole-v0” environmentis a numpy array with 4 floating point values : 1. Horizontal Position 2. Horizontal Velocity 3. Angle of Pole 4. Angular Velocity
  • 11.
    Functions of Gym make():Used to create environment. reset(): Setting the environment to default starting stage. render(): It creates a popup window to display Simulation of Agent interacting with environment step(): Action taken by the agent. it returns an observation. (4 valued numpy array, <observations, reward, done, info> ) sample(): Random samples input for the agent. close(): Close the environment after action performed. CODE: import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample()) env.close()
  • 12.
    OpenAI Gym’s Observations Observationsare the environment specific information variables: Observation (object): An environment-specific object representing your observation of the environment. For example, joint angles and joint velocities of a robot, or the board state in a board game. Reward (float): Amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward. Done (boolean): Whether it’s time to reset the environment again. Most tasks are divided into well-defined episodes, and done being True indicates the episode has terminated. For example, perhaps the pole tipped too far, or you lost your last life. Info (dict): Diagnostic information useful for debugging. It can sometimes be useful for learning. For example, it might contain the raw probabilities behind the environment’s last state change. However, official evaluations of your agent are not allowed to use this for learning.
  • 13.
    Observations (Contd.) The processgets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code with respect to the episodes and done flag: CODE: import gym env = gym.make('CartPole-v0') for i_episode in range(20): observation = env.reset() for t in range(100): env.render() print(observation) action = env.action_space.sample() observation, reward, done, info = env.step(action) if done: print("Episode finished after {} time steps".format(t+1)) break env.close() OUTPUT: [-0.03327757 0.5649743 -0.0374682 -0.87239967] [-0.02197809 0.7605852 -0.05491619 -1.17662316] [-0.00676638 0.95637585 -0.07844866 -1.48600365] [ 0.01236114 1.15236136 -0.10816873 -1.80211802] [ 0.03540836 1.3485132 -0.14421109 -2.12636579] [ 0.06237863 1.15508926 -0.18673841 -1.88149287] Episode finished after 11 time steps
  • 14.
    Making a Hard-CodedPolicy for Agent CODE: import gym env = gym.make('CartPole-v0') observation = env.reset() for t in range(1000): env.render() # Defining a Hard-Coded Policy cart_pos , cart_vel , pole_ang , ang_vel = observation # Move Cart Right if Pole is Falling to the Right # Angle is measured off straight vertical line if pole_ang > 0: # Move Right action = 1 else: # Move Left action = 0 # Perform Action observation , reward, done, info = env.step(action) env.close()
  • 15.
    Using Neural Networkin Reinforcement Learning ReLu – f(x)
  • 16.
    Using Neural NetworkIn TensorFlow for Reinforcement Learning (Contd.) Let's design a simple Neural Network that takes in the observation array passes it through a hidden layer and output probability for left. (for right = left-1) CODE: import tensorflow as tf import gym import numpy as np # PART ONE: NETWORK VARIABLES # # Observation Space has 4 inputs num_inputs = 4 num_hidden = 4 # Outputs the probability it should go left num_outputs = 1 initializer = tf.contrib.layers.variance_scaling_initializer() # PART TWO: NETWORK LAYERS # X = tf.placeholder(tf.float32, shape=[None, num_inputs]) hidden_layer_one = tf.layers.dense(X,n um_hidden,activation = tf.nn.relu, kernel_initializer=initializer) hidden_layer_two = tf.layers.dense(hidden_layer_one, num_hidden, activation=tf.nn.relu, kernel_initializer=initializer) # Probability to go left output_layer = tf.layers.dense(hidden_layer_one, num_outputs, activation=tf.nn.sigmoid, kernel_initializer=initializer) # [ Prob to go left , Prob to go right] probabilties = tf.concat(axis=1, values=[output_layer, 1 - output_layer]) # Sample 1 randomly based on probabilities action = tf.multinomial(probabilties, num_samples=1) init = tf.global_variables_initializer() # PART THREE: SESSION # saver = tf.train.Saver() epi = 50 step_limit = 500 avg_steps = [] env = gym.make("CartPole-v1") with tf.Session() as sess: init.run() for i_episode in range(epi): obs = env.reset() for step in range(step_limit): env.render() action_val = action.eval(feed_dict={X: obs.reshape(1, num_inputs)}) obs, reward, done, info = env.step(action_val[0][0]) if done: avg_steps.append(step) print('Done after {} steps'.format(step)) break print("After {} episodes the average cart steps before done was {}".format(epi, np.mean(avg_steps))) env.close()Note: 2x zoom required.