1. Reinforcement learning concepts like state, action, reward, and policy are introduced. Dynamic programming and Monte Carlo methods are covered as ways to estimate value functions through bootstrapping or sampling.
2. The Bellman equation and methods like Q-learning, SARSA, and TD learning are discussed for estimating value functions. These combine aspects of dynamic programming and Monte Carlo methods.
3. Game theory concepts are introduced including games, strategies, Nash equilibrium. Examples like prisoner's dilemma and battle of the sexes are covered. Impartial games like Nim are analyzed in depth.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
We propose a distributed deep learning model to learn control policies directly from high-dimensional sensory input using reinforcement learning (RL). We adapt the DistBelief software framework to efficiently train the deep RL agents using the Apache Spark cluster computing framework.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
The link of the original article: https://ai.intel.com/demystifying-deep-reinforcement-learning/
This review summarizes:
How do I learn reinforcement learning?
Reinforcement Learning is Hot!
What is the RL?
General approach to model the RL problem
Maximize the total future reward
A function Q(s,a) = the maximum DFR
How to get Q-function?
Deep Q Network
Experience Replay
Exploration-Exploitation
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Exploration Strategies in Reinforcement LearningDongmin Lee
I presented about "Exploration Strategies in Reinforcement Learning" at AI Robotics KR.
- Exploration strategies in RL
1. Epsilon-greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration (e.g., Entropy Regularization in RL)
Thank you.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
It will give a short overview of Reinforcement Learning and its combination with Neural Networks (Deep Reinforcement Learning) in a brief and simple way
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
We propose a distributed deep learning model to learn control policies directly from high-dimensional sensory input using reinforcement learning (RL). We adapt the DistBelief software framework to efficiently train the deep RL agents using the Apache Spark cluster computing framework.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
The link of the original article: https://ai.intel.com/demystifying-deep-reinforcement-learning/
This review summarizes:
How do I learn reinforcement learning?
Reinforcement Learning is Hot!
What is the RL?
General approach to model the RL problem
Maximize the total future reward
A function Q(s,a) = the maximum DFR
How to get Q-function?
Deep Q Network
Experience Replay
Exploration-Exploitation
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Exploration Strategies in Reinforcement LearningDongmin Lee
I presented about "Exploration Strategies in Reinforcement Learning" at AI Robotics KR.
- Exploration strategies in RL
1. Epsilon-greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration (e.g., Entropy Regularization in RL)
Thank you.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
It will give a short overview of Reinforcement Learning and its combination with Neural Networks (Deep Reinforcement Learning) in a brief and simple way
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
Multi-armed bandit queen" is likely a playful twist on the term "multi-armed bandit problem." The multi-armed bandit problem is a classic dilemma in probability theory and decision-making. It's named after a hypothetical scenario where a gambler faces multiple slot machines (bandits) with different payout probabilities and must decide which machines to play in order to maximize their total reward over time
An Analytical Study of Puzzle Selection Strategies for the ESP GameAcademia Sinica
“Human Computation” represents a new paradigm of applications that take advantage of people’s desire to be entertained and produce useful metadata as a by-product. By creating games with a purpose, human computation has shown promise in solving a variety of problems that computer computation cannot currently resolve completely. Using the ESP game as an example, we propose a metric, called system gain, for evaluating the performance of human computation systems, and also use analysis to study the properties of the ESP game. We argue that human computation systems should be played with a strategy. To this end, we implement an Optimal Puzzle Selection Strategy (OPSA) based on our analysis to improve human computation. Using a comprehensive set of simulations, we demonstrate that the proposed OPSA approach can effectively improve the system gain of the ESP game, as long as the number of puzzles in the system is sufficiently large.
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...Joonhyung Lee
An introduction to DeepMind's newest board-game playing AI, AlphaZero.
I have improved significantly on my previous presentation in https://www.slideshare.net/ssuserc416e2/alphago-zero-mastering-the-game-of-go-without-human-knowledge, which had several errors (some rather glaring, such as the temperature equation for simulated annealing). Also, DeepMind released far more details in their new Science paper for AlphaZero.
One comment I would like to add is that the AlphaGo Zero used for comparison in this paper is a very weak version, not the final version. Thus, AlphaGo Zero is still SOTA for Go.
This presentation explains a model-free machine learning algorithm, named "Q-learning"
Idea in Simple Words
In Technical Terms
Parameters
Examples with Real-life Usage of the Algo
Implementation (Q-Table)
Process
Complications
Summary
Anyone who is interested in Reinforcement Learning can have a look. It covers Markov Reward Processes, Markov Decision Processes and Dynamic Programming!
The Reinforcement Learning (RL) is a particular type of learning. It is useful when we try to learn from an unknown environment. Which means, that our model will have to explore the environment in order to collect the necessary data to use for its training. The model is represented as an Agent, trying to achieve a certain goal in a particular environment. The Agent affects this environment by taking actions that change the state of the environment and generate rewards produced by this later one.
The learning relies on the generated rewards, and the goal will be to maximize them. To choose the actions to apply, the agents use a policy. It can be defined as the process that the agent use to choose the actions that will permit it to optimize the overall rewards. In this course, we will see two methods used to develop these polices: policy gradient and Q-Learning. We will implement our examples using the following libraries: OpenAI gym, keras , tensorflow and keras-rl.
[Notebook 1](https://colab.research.google.com/drive/1395LU6jWULFogfErI8CIYpi35Y00YiRj)
[Notebook 2](https://colab.research.google.com/drive/1MpDS5rj-PwzzLIZtAGYnZ_jjEwhWZEdC)
I am using DL & Actor critic tools for solving Variational inference problem. The intriguing part from my hand is that the likelihood has a Beta distribution.Thus we handle both VI issues and a non common distributions
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
2. RL- Objectives
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
3. Basic Concepts
• Policy (π)- The “strategy” in which the agent decides which action to
take. Abstractly speaking the policy is simply a probability function that is
defined for each state
• Episode – A sequence of states and their actions
The Objective:
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
Value function (𝑉π (𝑠) ) – The expected reward of the episode given a policy
π and a state 𝑠
=>Bellman Equation
5. Optimal Function
We wish to find the optimal policy:
𝑉∗(s) = max
π
𝑉π(𝑠)
Our aim:
1. Finding the optimal policy
2. Finding the value of this policy for each state.
6. Dynamic Programing
𝑉(𝑠𝑡) =𝐸(𝑟𝑡+1 + γ𝑉(𝑠𝑡+1)│𝑠𝑡 = 𝑠)
• Learning is based on perfect model of the environment
(assuming Markov decision model )
• W𝑒 𝑢𝑝𝑑𝑎𝑡𝑒 𝑉𝜋
(𝑠𝑡) upon values of 𝑉𝜋
(𝑠𝑡+1). Where both of values are
estimated : bootstrapping
• The navigation follows mostly the ε-greedy methods
• Policy improvement
7. Estimating 𝑅𝑡 using Monte-Carlo
We estimate the value function by averaging the
obtained rewards. These methods are used in episodic
tasks.
We need to complete the episode for updating the value
function and learning is not “step by step” but episode by
episode.
8. Monte Carlo (Cont)
• Algorithm:
1. Choose policy
2. Run episode upon the policy .
In each step get the reward 𝑟𝑡 and add to the list Returns (𝑠𝑡)
3. V(s)=average(Returns (𝑠)) (Denoted as 𝑅𝑡)
The basic formula
𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑅𝑡- 𝑉 𝑠𝑡 ]
9. Summary
Summary
1. Dynamic programming- bootstrapping : They learn
estimators upon other estimators.
Assumptions on the model
2. Monte Carlo –Learning from the experience between
episodes. The target for update is 𝑅𝑡
10. TD Methods
• The motivation is to combine sampling with bootstrapping, (DP with MC). The simplest formula
TD(0)
𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑟𝑡 + γ𝑉(𝑠𝑡+1)- 𝑉 𝑠𝑡 ]
We do both bootstrapping (using estimated values of V) and sampling (using 𝑟𝑡)
TD->DP - Advantages
• No models of the environment (rewardprobabilities)
TD->MC – Advantages
• They can work online
• Independent in the length of the episode
• They don’t need to ignore or discount episodes with high experimental options since they learn from every
visit
• Convergence is guaranteed
11. The function -Q
• We discussed value functions V which are defined on
the episode states
• In some applications we wish to learn these functions
on the state-action space (s, a).
• The analysis for DP & MC is similar (you can check)
12. SARSA
• We learn each state upon the sequence:
S->A->R->S->A :
Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ𝑄(𝑠𝑡+1, 𝑎𝑡+1)- Q 𝑠𝑡, 𝑎𝑡 ]
We use the actions of time t+1 to learn action at time t
• We learn the 𝑄π
online and use the visits that it produced to update.
The policy goes toward greediness.
13. Off Policy- QLearning
• The basic formula:
Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ max
𝑎
𝑄(𝑠𝑡+1, 𝑎)- Q 𝑠𝑡, 𝑎𝑡 ]
• The idea is that we update the previous state not upon the current
state (it does not follow the policy) but upon the optimal state
• The pick of the option works in the same way, namely it follows
ε -greedy.
14. Game Theory
• What is Game?
1. Two or more players
2. Strategies
3. Rewards
18. Nash Equilibrium
• If f payoff function and 𝑥1 , 𝑥2, … . , 𝑥𝑛 strategies. Then 𝒙𝒊
∗
is an
equilibrium if for every player i and every strategy 𝒙𝒊
f𝑖 (𝒙𝒊
∗
, 𝑥−𝑖) > f𝑖 (𝒙𝒊, 𝑥−𝑖)
Existence Theorem
Every game with finite number of players has a Nash equilibrium in
mixed strategies.
21. Impartial Games
• Games in which in each state both players can take the same actions
and get the same rewards
Chess ,Go?
No ! In these games every player has his own pieces
What is the difference in impartial games between the players?
• Nothing! simply one player starts and the other does not !
• Both players have complete information
22. Impartial Games (Cont.)
Basic Concepts
Normal Games - The last player to play wins
Misere Games - The last player to play loses
NP- Positions
N-position - The set of states in which the first player wins
P-position - The set of states in which the second player wins
A word on terminology: We assume that each state is a starting state, thus the first
player is the Next player and the second player is the Previous player
23. Sprague Grundy Function
• Let G the graph of the game.
1. The followers of N-positions are P-Positions
2. The followers of P-positions are N-Positions
The Sum of Games
If 𝐺1 = 𝑉1 , 𝐸1 , 𝐺2 = 𝑉2 , 𝐸2 are games then Their Sum is:
G={𝑉1 × 𝑉2 ,E} and if 𝑣1 ,𝑤1 ∊ 𝑉1 , 𝑣2,𝑤2 ∊ 𝑉2
E={(𝑣1𝑣2, 𝑤1𝑣2) }∪{(𝑣1𝑣2,𝑣1𝑤2) }
24. Sprague Grundy Function
What Can we say about the sum?
P+P->P
N+P->N
N+N->?
We should do better in order to know the game!
25. The mex Function
• Let N the integers and S s.t. S⊊ N
mex(S)= min(N/ S)
• Let the graph: G={𝑉,E} we define a function
g: V-> N
g (v) = mex{g(w) |(v,w) ∊E}
• g (v) =0 ⇔ v is P-position
• For 𝐺1, 𝐺2 𝑠𝑢b − games and G their sum we have
g (v)=g(𝑣1) ⊕ g(𝑣2)
26. Impartial Games (Cont.)
Sprague Grundy Theorem
Every normal impartial game is isomorphic to Nim !!
(Hence Nim is not only a game but a class of games)
27. What is Nim?
• A two players game originally from China.
• It has been played in Europe since the 16th century
• 1901 Bouton gave the game its name and..
A nice mathematical study.
28. How do we play?
• Two players
• N heaps of items where the number of items for heap i is 𝑘𝑖
• In each turn a player should pick an amount of items :
1. He must pick!
2. He can pick only from a single heap!
• Winning Strategy –Simply xor the heap sizes (By induction)
29. Examples
• A single heap with M items .
Who is going to win?
• Two heaps one of K items and one of 1..
• Two heaps one of M and one of K (M >K)
30. The Three Heaps Example
• We have three heaps 3,4,5 items
Let’s Play
(1,4,5)
(1,2,5)
(1,2,3)
(1,2,2)
(0,2,2)
…
31. Nim & QLearning
• There is a natural similarity between games and reinforcement:
1. A collections of states S
2. For each s ∊ S There are potential actions that a player can take.
3. Every step provides a reward
32. QLearning for Nim-Design & implementation
We build a Nim engine for three heaps (Eric Jarleberg)
State
• A set of actions that are available
• A Boolean variable that indicates whether this is a goal (true (0,0,0))
• A function that performs the XOR
• A function that performs an action and thus jumps to another state
Actions
• A number that indicates the heap in which we perfrom the action
• A number that indicates the amount of items
33. QLearning for Nim-Design & implementation
Agent
Agent Types
1. Random Agent- Simply gets the state and randomly picks an action
2. Optimal Agent –The one that plays the optimal strategy
3. Qlearning Agent – An agent that learns to play through Q functions
Agent Implementation
1. A function that given a state recommends an action for the agent
2. A function that provides a feedback
34. QLearning for Nim-Design & implem.
• What is our objective?
1. We wish to learn as fast as possible
2. We wish to have an accurate Q function
Remarks
1. The learning process is sensitive to opponents
2. We may obtain an optimal policy and not having good Q
35. QLearning for Nim
• Loss function:
We learn a policy π and we have for each game state s :
𝑈π 𝑠 = 𝑖=0
∞
γ𝑖
R(𝑆𝑖)
• Originally we have for each 𝑠 its 𝑈 and we estimate the learning by
minimizing :
│𝑈π -𝑈∗│ (i.e. min 𝐿1 ) .
36. How to Estimate?
1. 𝐿1 -Simply minimize │𝑈π -𝑈∗│
2. For every state there is only one “right action”, We can measure
thus the amount of failures rather the utility function. What we do
is counting the wrong actions for each s ∈ N . (Due to remark 2)
43. Remarks
1. The most important factor is the opponent
2. The Q function maps positive values to states in N
i.e.: Learning to play is equivalent to classify states to N & P
Corollary: We can use Qlearning to any impartial game
3. An interesting step can be to perform actor-critic and compare
policies through KL-divergence.
44. Good Sources
• https://github.com/jamesbroadhead/Axelrod/blob/master/axelrod/strategies/qlearner.py
• https://github.com/NimQ/NimRL
• http://www.csc.kth.se/utbildning/kth/kurser/ -RL for Nim Erik Jarelberg
• https://papers.nips.cc/paper/2171-reinforcement-learning-to-play-an-optimal-nash-equilibrium-in-team-markov-
games.pdf
• https://www.researchgate.net/profile/J_Enrique_Agudo/publication/221174538_Reinforcement_Learning_for_the_N-
Persons_Iterated_Prisoners'_Dilemma/links/5535ec0c0cf268fd0015f0ac/Reinforcement-Learning-for-the-N-Persons-
Iterated-Prisoners-Dilemma.pdf
• Gabriel Nivasch –Ariel University
• Sutton & Barto