Reinforcement Learning

DigiPen Machine Learning
Internship
Summer 2017
Christopher Eicher | Aakash Chotrani | Johann Saumer

Contents
Exploratory research.............................................................................................................................................4
Goal...................................................................................................................................................................4
Work .................................................................................................................................................................4
Wrap-up............................................................................................................................................................4
Tic-Tac-Toe............................................................................................................................................................4
Goal...................................................................................................................................................................4
Work .................................................................................................................................................................5
Agents:-.........................................................................................................................................................5
Environment:-...............................................................................................................................................5
State:-............................................................................................................................................................5
Recap: ...........................................................................................................................................................5
Unintended consequences: Dangers of Reinforcement Learning................................................................6
Example:- ......................................................................................................................................................6
Tic-Tac-Toe....................................................................................................................................................6
Wrap-up............................................................................................................................................................8
Dodging Agent ......................................................................................................................................................8
Goal:..................................................................................................................................................................8
Work Done:.......................................................................................................................................................8
Wrap-up:.........................................................................................................................................................10
Frozen Lake.........................................................................................................................................................10
Goal.................................................................................................................................................................10
Work Done......................................................................................................................................................10
Wrap-up..........................................................................................................................................................11
Cart-Pole .............................................................................................................................................................11
Goal.................................................................................................................................................................11
Work Done......................................................................................................................................................11
Tensor Flow Notes:-....................................................................................................................................12
Final thoughts:-...........................................................................................................................................14
Wrap Up..........................................................................................................................................................15
Lunar Lander [In progress]..................................................................................................................................15
Goal.................................................................................................................................................................15

Work Done......................................................................................................................................................15
Wrap-up..........................................................................................................................................................15
Resources............................................................................................................................................................16
Resources Websites:.......................................................................................................................................16
Tutorials:.........................................................................................................................................................16
Articles:...........................................................................................................................................................16
White Papers: .................................................................................................................................................17
Data sets: ........................................................................................................................................................17
Books: .............................................................................................................................................................17

Exploratory research
Chris, Aakash, Hans
Goal
To compile resources for future projects and figure out what kind environment, library, and resources
the team will use moving forward
Work
Decided to use TensorFlow because it’s open source, pretty well documented, and plenty of tutorials
on its website and on YouTube. We want to use the GPU version so that we can run experiments
faster
Considered R programming language because it’s commonly used in data science
Decided to use python language because it’s used a lot in machine learning, is widely used so there
would be lots of learning materials for python, and used a lot anyways, great for personal
development. Had some difficulties managing python, this turned out to be the result of having
multiple versions of python installed. (python vs python 3, 32-bit vs. 64-bit)
Found OpenAi, a non-profit research company that has a library called Gym which was made to fill a
need in the reinforcement learning research community for benchmarks and standardized
environments. There were some installation problems that took a while to fix. Problems with using
the correct version of Anaconda and cmake.
Decided to use Anaconda to manager our python libraries for us, since there are a lot libraries that
TensorFlow and OpenAI Gym rely on and Anaconda makes it easy to manage these libraries
We’ve begun to compile a list of websites that have relevant tutorials for TensorFlow, Gym, python,
and reinforcement learning in general.
Started compiling a list of white papers related to machine learning, specifically reinforcement
learning.
Started taking important notes on the whiteboards and uploading them to slack. This is a great way
to share very specific and technical information with the team.
Compiled a list of websites that host data sets, these will be great if we get to unsupervised learning
and supervised learning.
Wrap-up
We will continue to update our lists of resources
Tic-Tac-Toe
Aakash
Goal
The goal of the project was to apply reinforcement learning to solve a game of Tic-Tac-Toe instead
of using hard coded rules. Explore more about specific type of reinforcement learning called Q-
Learning

Work
Reinforcement learning is way different than Supervised Learning and Unsupervised Learning
Supervised Learning interface : def Fit(X,Y) and def Predict(X)
Unsupervised Learning Interface : def Fit(X) and sometimes def Transform(X)<---- turn X input into
different representation to Z
Interface for reinforcement learning is broader, it's entire environment (real world or simulated world).
Supervised learning needs labelled data from the humans which is time consuming and costly.
Reinforcement learning: no need for hand labelled data.
Agents:-
RL agents train in complete different way. Many references to psychology.
-model animal behavior
-objective is goal.
AlphaGo's goal is to win GO
Goal of video game AI is to win the game and achieve highest score.
Animals/Humans: "selfish gene"(Richard Dawkins) Evolutionary psychologist have said that
our genes are selfish and they only want to make more of themselves.
Example: why do people want to be rich?
-leads to better healthcare or social status, led to genes maximizing their goal. Richness has
no physical relationship to genes, yet it's a novel solution to the problem.
Environment:-
-Agent gets feedback interacting with environment.
State:-
Humans/AIs alike never sense the entire world/universe at once.
We have sensors(sight, sound, touch) which feed signals to our brain from the environment.
These measurement we get from sensors make up a "state"
Tic-tac-toe game: how many states?
each location has 3 possibilities : empty, X,O
9 locations on the board
#states = 3*3*....*3 = 3^9 = 19683
Recap:
1)Agent
2)Environment
3)State
4)rewards/punishment: How good or bad the ai is doing. always a real number
5)Actions : Finite state of action. 2d video game: up,down,left,right,jump

Unintended consequences: Dangers of Reinforcement Learning
Example:-
Goal: Minimize human deaths
Ai decides since # humans grows exponentially. More people will die in future, best to destroy
everyone now to minimize dying in the future.
SAR triples:-
(State, Action Reward)
Notation: (s,a,r)
Timing is important in RL:
S(t),A(t) --------> S(t+1)
Notation: (s,a,s')
Tic-Tac-Toe
How will a first year computer science student program a tic-tac-toe game.
Programming all the general rules
Example: if board is empty then first move should be middle or corner.
Example: if there are two pieces in the row of opponent then bock it's third position so that they don't
win the game.
Example: if we have 2 pieces in a row , add a 3rd to win the game.
It will look like bunch of if-else statements. The agent will only be able to play tic-tac-toe.
Which goes against the idea of machine learning. We want to have one algorithm that can play
different games. We need something better hence reinforcement learning.
New terms:
Episode: represents one run of tic-tac-toe game. Until win-lost.
Our RL agent will learn through 1000,10000 or 100000 (it depends on how long the game is how
complicated states are in the game)
Terminal State: no more action can be taken. Episode ended.
How to give rewards and punish for bad behavior.
Not to build any prior knowledge into the AI.
Tell the agent WHAT you want to achieve and not HOW you want it to be achieved.
Intro to scenario:-
Planning Scenario:-
Suppose there is exam tomorrow.
hangout with friends--------> Feel happy (positive)
Study ------------> Feel Bored (negative)
Why study?
We don't think of immediate rewards but future rewards too. Hence we want to assign value to the
current state that reflects the future too. Call this "Value Function"

Credit/Assignment Scenario:-
Suppose you got your dream job at the company.
What actions did you take in the past so that you are receiving reward right now.
Delayed Rewards:-
Two directions of thinking the delayed reward:-
Credit assignment: present(receiving reward)---------->(because of action in)past
Planning : present(do the action now)--------->(to receive reward)Future
Value Function : Measure of future rewards we might get. The value tells us the future goodness of
the state.
Reward V/S Value:-
Value is future measure goodness. Example: Standing in front of Goomba will put you in a position
to jump in the next few states.
Reward is immediate goodness. Example: jumping on a Goomba will immediately increase your
score.
Reward is the goal but we can't use only the reward to guide actions because they don't tell us about
the future rewards.
V(s) = E[all future rewards | S(t) = s]
where,
s = state(input)
E = expected value, E[X] = average of X
Finding V(s)
Algorithm:-
step1) Initialize V(s)
V(s) = 1 if s = winning state
V(s) = 0 if s = lose or draw
V(s) = 0.5 otherwise
Step2) Update V(s) in each episode.
V(s) <- V(s) + alpha(V(s') - V(s))
note: terminal state never gets updated since it doesn't have next state.
Pseudocode:-
for t in range(max_iterations) //loop through max number of episodes
state_history = play_game
s = state_history[0]
for s' in state_history[1]
V(s) = V(s) + learning_rate*(V(s')-V(s))
s = s'
Playing the game:-

How do we actually play the game?
Take random action------>NO!!!!
We have a value function
pseudocode:-
maxV = 0
maxA = None
for a , s' in possible_next_states:
if V(s') > maxV:
maxV = V(s')
maxA = a
perform action maxA
Wrap-up
By implementing this project I came across different types of learning strategies and how each one
is different from each other. I learnt about different terms in reinforcement like agent, environment,
states. Why we need to consider future reward instead of just relying on current reward and how to
implement value function for learning. I would like to improve the layout of the outcome of the project
in future by implementing a better looking GUI.
Dodging Agent
Hans
Goal:
The goal was to develop a small, discrete problem that could be easily solved with machine learning
so that I could focus on understanding the implementation details.
Work Done:
The problem that I made was confined to a small grid, 2 rows by 3 columns. An agent would inhabit
the top row, and could move between the adjacent spaces. It could also choose to do nothing and
not move. A projectile would be spawned in the bottom row and would move into the top row on the
next update. The agent would then try to dodge the arrow by moving out of the way or not moving
into the arrow’s path.
The first thing I did was conduct some research into the topic of reinforcement learning. I only had a
basic understanding of the topic. I studied some of the general problems involved in every machine
learning problem, such as the trade-off between exploration and exploitation.
I also researched extensively the use of genetic algorithms in reinforcement learning problems. I was
very interested in this topic and how it transformed the topics of reinforcement learning in order to
develop its own strategy of learning through reinforcement. This involved rewarding the agent by
giving it a fitness score. The exploring aspect was done by a crossover operation and a mutation
operation. These too are affected by different rates associated with it. As much as this topic intrigued

me, I decided it would be easier and more beneficial to start with a more traditional approach to
solving reinforcement problems.
The topic of exploration and exploitation seemed very crucial to developing a decent learning
program so I took some extra time to explore the various methods and how they impacted the
learning. As explained later on, I wanted to use a method that was keen to explore early on, but then
exploit more often as the program had exhausted its options.
I also researched Markov Decision Processes because random actions without some sort of
probability distribution will lead to poor performance. MDPs are used in these cases as they
incorporate probability into the action/reward system.
After learning a good portion of material i got started with my application. I chose to implement it in
C++ because it is the language I am most familiar with. I had to program the problem and the
learning system, but since the problem was simple it did not take long to write up the whole thing. I
also wrote a driver so that I can run different tests to get data from these tests.
The first iteration of the system is as was described in the abstract: an agent trying to dodge a single
projectile. Every timestep, the agent will either explore a random action or perform the most
rewarding action based on a function that determines the explore rate. For this problem I am using
the given explore rate raised to the timestep, so that it diminishes over time. A value of 1 for the
base explore rate results in the agent always choosing a random action. An explore rate of 0 results
in the agent always choosing the best action. If the agent tries to exploit but does not know a single
best action, it will select a random action from the best actions available.
To determine the best action, I paired every action with total reward from choosing that action.
These rewards were all initiated to 0 to avoid having bias in the system. In a discrete problem it
would be simple to input the values that we consider correct, however the goal is to allow the
program to learn so I did not input these values. The reward would get updated after every timestep
when it determines if the agent got hit. If the agent did not get hit, the reward got incremented by 1. If
it did get hit, then the reward was decremented by 1.
The agent knew what position it was, and the position of the projectile. It did not know what the next
state would be. In this iteration of the problem, this did not present any problems. The agent did not
have to know the probability of the next state it would end up in. It was very short-sighted. It only
decided to move depending on what the current state was. This allowed the problem to remain
extremely small.
Having a very small, discrete, short-sighted problem is the main reason why I chose a method that
caused the program to stop exploring after a short number of iterations. I knew that all of the options
would be exhausted very quickly when exploring so that it would be able to start choosing the best
option.
I ran a few tests, in order to see that the agent was learning correctly. I first ran it with an explore
factor of 1, to analyze what happens with random inputs. The results appear as I suspected, about a
third of the time the agent got hit. I also ran tests with an explore factor of 0. These results were also
as I had suspected. The agent would get hit between 0 and 9 times. If it did get hit it would try a
different action and be successful. Every time after this it would perform the successful action. I also
ran test with an explore factor of .99 to allow for the explore algorithm to work its magic. This allowed
the agent to experiment with finding some alternative solutions other than the first solution it finds as
it did with the previous step. The data appears very similar, this is do to the fact that it is always
training and updating the rates of success. This causes it to tend towards a single solution as well.

This test also reveals that with the current method of exploring the agent will eventually not get hit at
all as it found a solution.
Wrap-up:
This project was a decent entry point into the topic of machine learning. I was able to implement
basic reinforcement learning that allowed an agent to learn how to optimally solve a problem. The
first thing that I would do with this project to move forward is to write-up different methods of
selecting action, and different methods of exploring. I believe that how these methods are
implemented can greatly influence the outcome of the problem.
I would also try to make the problem more robust by allowing additional projectiles to spawn. This
adds in many more states and overall creates a more interesting problem. The agent would then
have to think about how its current action will impact the next action it takes. It will have to use
probability of advancing state in its decision of action. Currently the probability of a projectile being in
a space is distributed equally, but I would like to investigate the effects of using a different probability
distribution of spawning projectiles.
By implementing these changes I would be able to get a much larger grasp of how these parameters
can affect the agent’s capability of reaching an optimal solution.
Frozen Lake
Chris
Goal
To dive head first into machine learning. I found a tutorial that used Q-Learning to solve the Frozen
lake problem
Work Done
Spent some time how Q-learning works and understanding a variation of the Bellman equation, to
understand intuitively why it works. Refactored, tweaked and played with some example code from a
tutorial that implements Q-Learning with a table.
Used MatPlotLib to visual the Q-Learning process. This allowed me to see how the agent moved
around its environment and made me realize that there were some interesting problems with how it
was making decisions, like the fact it doesn’t care about the shortest path. It can feel incentivised to
move around in circles or make effectively noop moves because in the simple implementation it
would didn’t get penalized for that.
Learned a lot about how to used MatPlotLib, I plan on using what I know to help visual data for us in
the path so that as we run experiments and we can see how the algorithms are working over time.
Refactored and tweaked code that used TensorFlow to do nearly the same thing, was basically
wrapping around a table, except it was far less accurate because we couldn’t update the table
directly, had to go through an optimizer.

Wrap-up
Was a good primer for learning how to use the Bellman equation and TensorFlow. I wanted to tweak
it parameters more and use a few more special techniques to see if I could get the success rate
higher but I felt it’s important to move on with using TensorFlow to solve more interesting problems
with the rest of the team.
Cart-Pole
Aakash
Goal
To use TensorFlow to build a neural net that solves OpenAI Gym’s Cart Pole.
Work Done
Started with exploring Open Ai gym classic control environment. Read all the documentation to get started
and installing gym. Initially I had lot of problems getting open AI setup on my machine because there were
multiple versions of python installed. Hence had to remove previous versions.
I had to follow pip install procedure in the documentation:-
git clone https://github.com/openai/gym
cd gym
pip install -e . # minimal instal
I couldn’t download all the gym package. The previous snippet downloads the classic controller version of
gym but if we need to install all the gym packages such as atari games, box2d package, etc we need to write
pip install gym[all].
After getting everything setup I copy pasted the code snipped from the open ai documentation which creates
an environment and plays 20 random games. The documentation clearly explains what an observation,
reward, done and info means.
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:

print("Episode finished after {} timesteps".format(t+1))
break
There is open ai leaderboard which shows different algorithms which are uploaded by people to solve the
problem. Most of the solutions were using neural network hence it was difficult to grasp. Hence decided to
install tensor flow and explore getting started documentation.
Then had to follow buch of tensor flow tutorials online on how to install it. We have 2 different types of
tensorflow packages gpu version and cpu version. The cpu version is easy to download but is slower. The gpu
version requires Nvidia graphics card. Also I had to update all the graphics card driver and had to install
Nvidia cuda package before installing tensorflow.
After getting tensorflow working followed their documentation to learn the basic terms and took notes:
https://www.tensorflow.org/get_started/get_started
And explored MNIST program which is hello world for tensor flow.
Tensor Flow Notes:-
Tensorflow provides multiple APIs.
Lower level api -------------------> tensorflow core
Higher level api built on top of lower level api and are easier to learn and make repetitive task easier to
implement.
TENSOR: central unit of data in tensorflow. Set of primitive values into an array of any number of dimensions.
RANK of TENSOR:-
3-------------------------------------------> #rank 0, scalar shape[]
[1,2,3]-------------------------------------> #rank 1, vector shape[3]
[[1,2,3],[4,5,6]]---------------------------> #rank 2, matrix shape[2,3]
[[[1,2,3]],[[7,8,9]]]-----------------------> #rank 3, shape[2,1,3]
Tensorflow core programs 2 sections:-
1) Building computational graph
2) Running computational graph
Computational graph: series of tensorflow operations arrranged into a graph of nodes.
Each node takes 0 or more tensors as input-------------> tensor as output

constant node: no input-------> output is a value that is stored internally
MNIST is like hello world for starting tensor flow it consists of number of labelled images and our goal is to
build a tensorflow model to predict.
Data is split into 3 parts:-
1) 55,000 datapoints for training data
2) 10,000 datapoints for test data
3) 5,000 datapoints for validation data
MNIST dataset has 2 parts:-
1) Image of handwritten digit --------- X
2) Corresponding label ------------------ Y
Each image is 28*28 pixels hence is big array of numbers.(28*28 == 784 numbers)
mnist.train.images is a tensor (an n-dimensional array) with a shape of [55000, 784]
The first dimension is an index into the list of images and the second dimension is the index for each pixel in
each image.
Each entry in the tensor is a pixel intensity between 0 and 1, for a particular pixel in a particular image.
Each image in MNIST has a corresponding label, a number between 0 and 9 representing the digit drawn in
the image.
mnist.train.labels is a [55000, 10] array of floats.
Softmax Regression:-
If you want to assign probabilities to an object being one of several different things, softmax is the thing to
do,
because softmax gives us a list of values between 0 and 1 that add up to 1.
Even later on, when we train more sophisticated models, the final step will be a layer of softmax.
2 steps of softmax regression
1) add up the evidence of our input being in certain classes(NOTE: weighted sum of pixel intensity. If the pixel
matches then positive weight else negative weight)
2) convert that evidence into probabilities

Softmax: exponentiating its inputs and then normalizing them
exponentiation : that one more unit of evidence increases the weight given to any hypothesis
multiplicatively.
Softmax then normalizes these weights, so that they add up to one, forming a valid probability distribution.
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
In machine learning we typically define what it means for a model to be bad. We call this the cost, or the loss,
and it represents how far off our model is from our desired outcome. We try to minimize that error, and the
smaller the error margin,
the better our model is.
CROSS-ENTROPY : is measuring how inefficient our predictions are for describing the truth
Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent.
Final thoughts:-
It is still difficult to understand softmax regression, cross entropy. I started learning more about the algorithm
but it required higher level understanding of fuzzy logic.
After getting familiar with tensorflow and basic concepts of neural networks. I implemented neural network
in my algorithm to solve cart pole problem.
The algorithm plays 1000 games by taking random action. Moving the cart left or right at each frame. If the
particular game scores more than 50 then the algorithm saves the observation and the actions the cart took
to achieve it. This data serves as the training data for the neural network. NOTE: I did not render any games
during training because it will be very slow to render 1000 games.
Then I created a model which contains 5 layers with 128,256,512,256,128 neurons at each layer. It uses
softmax regression as learning algorithm.

Then the training data is used to train the neural network.
After the network is trained it is used to play 10 games which we can see how well it performs by rendering
each game. In the end the algorithm prints out the final score and which action it took how many times. How
many times it decided to go left or go right.
It was very difficult to and time consuming tweaking each parameters to find a better output. I tried using
different learning rates to see the final result.
In the end I uploaded my solution to open ai gym official website:
https://gym.openai.com/evaluations/eval_HhJcddvETPu16QBl1hjYw
You can take a look at my solution on github:
https://github.com/akuchotrani/MyOpenAIGym/blob/master/CartPoleWithTensorFlow.py
Wrap Up
I still do not fully understand how softmax regression works. Looked at various tutorials online but it requires
higher understanding of fuzzy logic and neural networks. I will try to implement the same algorithm to lunar
lander by changing the environment and set of action spaces and minimum score requirements.
Lunar Lander [In progress]
Chris, Aakash, Hans
Goal
We want to use TensorFlow to build a neural net that solves OpenAI Gym’s Lunar Lander.
Work Done
Got Lunar Lander and TensorFlow working on everyone’s machine.
Got a simple neural network running. We got a GitHub set up so we can collaborate together.
Wrap-up

Resources
Resources Websites:
R (https://cran.r-project.org/ )
Tensor flow https://www.tensorflow.org/
OpenAI Gym https://gym.openai.com/
OpenAI Universe https://blog.openai.com/universe/
Kaggle - Data science competitionshttps://www.kaggle.com/
RStudio (https://www.rstudio.com/products/rstudio/download2/ )
Anaconda (https://www.continuum.io/downloads#windows Python 3.6 version)
Keras: The Python Deep Learning library https://keras.io/
Practical Deep Learning For Coders—18 hours of lessons for free http://course.fast.ai/
OpenCV (Open Source Computer Vision Library) http://opencv.org/
Python Programming Tutorials https://pythonprogramming.net/
Python 3.6.2rc1 Documentation https://docs.python.org/3/
TFLearn: Deep learning library featuring a higher-level API for TensorFlow http://tflearn.org/
Matplotlib 1.5.1 documentation https://matplotlib.org/1.5.1/index.html
NumPy http://www.numpy.org/
SciPy https://scipy.org/
Tutorials:
Q-Learning https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-
learning-with-tables-and-neural-networks-d195264329d0
Pacman and reinforcement learning
https://inst.eecs.berkeley.edu/~cs188/sp12/projects/reinforcement/reinforcement.html
Youtube Python tutorial:
https://www.youtube.com/watch?v=kkQlyDMa-h0
https://www.jetbrains.com/pycharm/
Articles:
Artificial neural network https://en.wikipedia.org/wiki/Artificial_neural_network
Reinforcement learning https://en.wikipedia.org/wiki/Reinforcement_learning
Markov decision process https://en.wikipedia.org/wiki/Markov_decision_process

Deep Learning https://en.wikipedia.org/wiki/Deep_learning
Q-learning https://en.wikipedia.org/wiki/Q-learning
Pac-Man:
http://www.ias.tu-darmstadt.de/uploads/Site/EditPublication/Hochlaender_BScThesis_2014.pdf
Demystifying Deep Reinforcement Learning https://www.intelnervana.com/demystifying-deep-
reinforcement-learning/
White Papers:
Continuous Control with Deep Reinforcement Learning https://arxiv.org/pdf/1509.02971.pdf
Playing Atari with Deep Reinforcement Learning https://arxiv.org/pdf/1312.5602.pdf
Mining Muscle Use Data for Fatigue Reduction in IndyCar http://www.sloansportsconference.com/wp-
content/uploads/2017/02/1622.pdf
Real-Time Decision Making in Motorsports:
Analytics for Improving Professional Car Race Strategy
https://dspace.mit.edu/bitstream/handle/1721.1/100310/931596281-MIT.pdf?sequence=1
Data sets:
http://www.image-net.org/
Titanic https://www.kaggle.com/c/titanic
Quandl Financial, Economic and Alternative Data https://www.quandl.com/
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
https://github.com/niderhoff/nlp-datasets
List of datasets for machine learning research - Wikipedia
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
Books:
AI techniques for game programming by Mat Buckland
Bayesian Reasoning and Machine Learning by David Barber
Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and
Code, Third Edition, Video-Enhanced Edition
By: Zed A. Shaw http://proquestcombo.safaribooksonline.com/book/programming/python/9780133124316

Reinforcement Learning

Recommended

Recommended

More Related Content

Similar to Reinforcement Learning

Similar to Reinforcement Learning (20)

More from Aakash Chotrani

More from Aakash Chotrani (7)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning