The document presents an agenda for a talk on reinforcement learning and creating a bot to play FlappyBird. The agenda includes introducing reinforcement learning concepts like Markov decision processes, value functions, and deep Q-learning. It also demonstrates using OpenAI Gym to build a bot that learns to play FlappyBird through trial and error without being explicitly programmed.
5. No supervisor, only the reward signal.
Feedback is delayed, not instantaneous.
Sequential data, time is master.
Agent’s actions affect the subsequent data it receives.
Difficulties of RL
7. History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt
State is the information used to determine what happens next
St = f(Ht)
Agent state vs Environment state (Sa
t vs Se
t)
Fully Observable and Partially Observable environment.
State
8. Policy
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Value function
vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)
Model
Pa
ss’ = P[St+1 = s’ | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Major components of an agent
9. Value based
Value function
No policy (Implicit)
Policy based
No value function
Policy
Actor Critic
Value function
Policy
Categorizing RL agents
10. Model free
Value function and/or policy
No model
Model based
Value function and/or policy
Model
Categorizing RL agents
11. Exploration finds more information about the environment
Exploitation exploits known information to maximize reward
Exploration vs Exploitation
if np.random.uniform() < eps:
action = random_action()
else:
action = get_best_action()
12. Markov state contains all useful information from the history.
P[St+1 | St] = P[St+1 | S1,…, St]
Some examples:
Se
t is Markov.
The history Ht is Markov.
Markov state (Information state)
13. A Markov Decision Process is a tuple (S, A, P, R, γ).
S: a finite set of states.
A: a finite set of actions
P: a state transition probability matrix
Pa
ss’ = P [St+1 = s’ | St = s, At = a]
R: reward function
Ra
s = E [Rt+1 | St = s, At = a]
γ: discount factor, γ ∈ [0, 1]
Markov Decision Process (MDP)
15. The state-value function vπ(s) is the expected return
starting from state s, and then following policy π.
The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy
π.
vπ(s) = Eπ [Gt | St = s]
qπ(s, a) = Eπ [Gt | St = s, At = a]
Gt = Rt+1 + γRt+2 + γ2Rt+3 + …
Value function of MDP
Real world reinforcement learning: learn from experience to maximize the rewards.
Dog watches the actions of the trainer, hears her command and react based on those information.
If the react is good, dog receives rewards (lure, compliment…). If the react is not good, dog will not receive any reward.
Dog will learn from its experience to find the way to get as many rewards as possible.
AlphaGo: defeated Ke Jie (Kha Khiết) (other game playing: Atari, chess…)
Waymo: Self driving car (Google)
DeepMind AI Reduces Google Data Centre Cooling Bill by 40% (https://goo.gl/JbcH5n)
Robotics
SpaceX reuses rocket.
Financial (Investment)
Supervised learning, unsupervised learning?
We usually don’t receive the reward immediately. When playing chess, we win or lose because of some moves in the past For the self driving car problem, right before the accident, driver often hits the brake.
Observation -> action -> reward -> new observation -> new action -> new reward.
The actions of agent can change the environment and affect to the future observation.
At step t: do action At, see new observation Ot and receive reward Rt
History is a series of observations, rewards and actions from the beginning to current time.
State is a function of history.
Env state is environment’s private representation, usually not visible to the agent. If it’s visible, it may contain the irrelevant information.
In fully observable env, agent directly observes the environment. (Sa = Se)
In particially observable env, agent indirectly observes env (Sa != St)
Policy is the agent’s behavior, it maps from state to action.
Value function is a prediction of future reward, used to evaluate the goodness/badness of states choose the action.
A model predicts what the environment will do next
P predict the next state
R predict the next immediately reward. (not the Rt+1, just the expected value)
If gamma = 0 just care about immediately reward, if gamma =1 don’t discount.
Categorizing : value based, policy based, actor critic
Categorizing : model free, model based
Reinforcement learning is like trial-and-error learning
The agent discover the good policy from its experiences of the environment without losing too much reward along way.
Reduce epsilon during training time.
When at test mode, just choose the best action.
Epsilon is a small number (1-> 0.1)
When the state is known, the history can be thrown away.
Can convert or create the Markov state by adding more information.
Some more examples: chess board and know the player will move next, drive a car -> just need to know the current conditions: position, speed…, don’t need to care about history.
Why do we need the gamma discount factor?
The discount γ is the present value of future rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future
Like the bank, the money today is better than tomorrow.
Animal/human behavior shows preference for immediatereward
The example is from David Silver’s course.
Circles and squares are states (square: terminal state)
Some actions: Facebook, Quit, Study…
From the 3rd state, if we chose action Pub, it may ends with different states.
From state s, we can do many action, the probability of each action is π(a|s)
After that, we receive reward then it can move to other state s’ with the probability Pass’
From state s, we choose action a, receive reward Ras , then can move to many new states.
After that, we can do many actions based on π(a’|s’)
The optimal state-value function v∗(s) is the maximum value function over all policies
The optimal action-value function q∗(s, a) is the maximum action-value function over all policies
An MDP is “solved” when we know the optimal value The optimal value function specifies the best possible performance in the MDP
If we know q∗(s; a), we immediately have the optimal policy
Follow the q*, we will find the optimal policy
Input: state
Output: vector for q value (size : nb_actions).
Dueling DQN: the first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.