This document is a final report for a CS799 course that explores using reinforcement learning to train an agent to play a chasing game. The author defines the game environment and mechanics, then uses Q-learning with an epsilon-greedy exploration strategy to train an agent to maximize its score by collecting vegetables while avoiding walls, minerals, and other players. The agent is trained in multiple phases to first avoid walls, then minerals, and finally other players while collecting vegetables. Results are presented comparing training with different exploration vs exploitation settings.
The document discusses challenges in reinforcement learning. It defines reinforcement learning as combining aspects of supervised and unsupervised learning, using sparse, time-delayed rewards to learn optimal behavior. The two main challenges are the credit assignment problem of determining which actions led to rewards, and balancing exploration of new actions with exploitation of existing knowledge. Q-learning is introduced as a way to estimate state-action values to learn optimal policies, and deep Q-networks are proposed to approximate Q-functions using neural networks for large state spaces. Experience replay and epsilon-greedy exploration are also summarized as techniques to improve deep Q-learning performance and exploration.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
A reinforcement learning approach for designing artificial autonomous intelli...Université de Liège (ULg)
The document describes reinforcement learning and its application to designing artificial autonomous intelligent agents. It defines key concepts such as reinforcement learning, intelligent agents, and Markov decision processes. The challenges of reinforcement learning include the inference problem of dealing with an unknown environment, computational complexity, and balancing exploration vs exploitation. Dynamic programming techniques can be used, such as estimating state-action value functions (Q-functions) which converge to the optimal values and selecting the optimal action for each state. The goal of reinforcement learning is to find policies that maximize the expected cumulative reward by interacting with the environment without prior knowledge of it.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
This document summarizes an efficient use of temporal difference techniques in computer game learning. It discusses reinforcement learning and some key concepts including the agent-environment interface, types of reinforcement learning tasks, elements of reinforcement learning like policy, reward functions, and value functions. It also describes algorithms like dynamic programming, policy iteration, value iteration, and temporal difference learning. Finally, it mentions some applications of reinforcement learning in benchmark problems, games, and real-world domains like robotics and control.
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
This paper proposes a method called SDQN (Sequential Deep Q-Network) to solve continuous action problems using a value-based reinforcement learning approach. SDQN discretizes continuous actions into sequential discrete steps. It transforms the original MDP into an "inner MDP" between consecutive discrete steps and an "outer MDP" between states. SDQN uses two Q-networks - an inner Q-network to estimate state-action values for each discrete step, and an outer Q-network to estimate values between states. It updates the networks using Q-learning for the inner networks and regression to match the last inner Q to the outer Q. The method is tested on a multimodal environment and several MuJoCo tasks, outperform
This document summarizes different approaches for multi-agent deep reinforcement learning. It discusses training multiple independent agents concurrently, centralized training with decentralized execution, and approaches that involve agent communication like parameter sharing and multi-agent deep deterministic policy gradient (MADDPG). MADDPG allows each agent to have its own reward function and trains agents centrally while executing decisions in a decentralized manner. The document provides examples of applying these methods to problems like predator-prey and uses the prisoners dilemma to illustrate how agents can learn communication protocols.
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
The document discusses challenges in reinforcement learning. It defines reinforcement learning as combining aspects of supervised and unsupervised learning, using sparse, time-delayed rewards to learn optimal behavior. The two main challenges are the credit assignment problem of determining which actions led to rewards, and balancing exploration of new actions with exploitation of existing knowledge. Q-learning is introduced as a way to estimate state-action values to learn optimal policies, and deep Q-networks are proposed to approximate Q-functions using neural networks for large state spaces. Experience replay and epsilon-greedy exploration are also summarized as techniques to improve deep Q-learning performance and exploration.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
A reinforcement learning approach for designing artificial autonomous intelli...Université de Liège (ULg)
The document describes reinforcement learning and its application to designing artificial autonomous intelligent agents. It defines key concepts such as reinforcement learning, intelligent agents, and Markov decision processes. The challenges of reinforcement learning include the inference problem of dealing with an unknown environment, computational complexity, and balancing exploration vs exploitation. Dynamic programming techniques can be used, such as estimating state-action value functions (Q-functions) which converge to the optimal values and selecting the optimal action for each state. The goal of reinforcement learning is to find policies that maximize the expected cumulative reward by interacting with the environment without prior knowledge of it.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
This document summarizes an efficient use of temporal difference techniques in computer game learning. It discusses reinforcement learning and some key concepts including the agent-environment interface, types of reinforcement learning tasks, elements of reinforcement learning like policy, reward functions, and value functions. It also describes algorithms like dynamic programming, policy iteration, value iteration, and temporal difference learning. Finally, it mentions some applications of reinforcement learning in benchmark problems, games, and real-world domains like robotics and control.
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
This paper proposes a method called SDQN (Sequential Deep Q-Network) to solve continuous action problems using a value-based reinforcement learning approach. SDQN discretizes continuous actions into sequential discrete steps. It transforms the original MDP into an "inner MDP" between consecutive discrete steps and an "outer MDP" between states. SDQN uses two Q-networks - an inner Q-network to estimate state-action values for each discrete step, and an outer Q-network to estimate values between states. It updates the networks using Q-learning for the inner networks and regression to match the last inner Q to the outer Q. The method is tested on a multimodal environment and several MuJoCo tasks, outperform
This document summarizes different approaches for multi-agent deep reinforcement learning. It discusses training multiple independent agents concurrently, centralized training with decentralized execution, and approaches that involve agent communication like parameter sharing and multi-agent deep deterministic policy gradient (MADDPG). MADDPG allows each agent to have its own reward function and trains agents centrally while executing decisions in a decentralized manner. The document provides examples of applying these methods to problems like predator-prey and uses the prisoners dilemma to illustrate how agents can learn communication protocols.
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
The lecture slides in DSAI 2018, National Cheng Kung University. It's about famous deep reinforcement learning algorithm: Actor-Critc. In this slides, we introduce advantage function, A3C/A2C.
The document presents a deep reinforcement learning approach that uses a convolutional neural network trained with Q-learning to learn control policies directly from raw pixel input in complex Atari games. The network is trained to estimate the optimal action-value function by minimizing temporal-difference errors between its predictions and targets generated from rewards and subsequent predictions. Experience replay is used to alleviate problems with correlated data and non-stationary distributions by randomly sampling from a replay memory of past transitions. When applied to seven Atari games, the approach outperforms previous methods on six games and surpasses a human expert on three games using a single network architecture and hyperparameters across all games.
This document provides an introduction to reinforcement learning. It defines reinforcement learning as finding a policy that maximizes the sum of rewards by interacting with an environment. It discusses key concepts like Markov decision processes, value functions, temporal difference learning, Q-learning, and deep reinforcement learning. The document also provides examples of applications in games, robotics, economics and comparisons of model-based planning versus model-free reinforcement learning approaches.
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
The document provides an overview of reinforcement learning and artificial neural networks. It discusses key concepts in reinforcement learning including Markov decision processes, the Q-learning algorithm, temporal difference learning, and challenges in reinforcement learning like exploration vs exploitation. It also covers basics of artificial neural networks like linear and sigmoid units, backpropagation for training multi-layer networks, and applications of neural networks to problems like image recognition.
1. The document discusses challenges with standard reinforcement learning formulations due to large state and action spaces. It proposes representing actions as operators that induce state transitions rather than discrete choices.
2. It introduces a generalized reinforcement learning framework using kernel methods to compare "decision contexts" or state-action pairs. Value functions are represented as vectors in a Reproducing Kernel Hilbert Space rather than concrete mappings.
3. Gaussian process regression is used to predict values for unseen state-action pairs by comparing them to stored samples, enabling generalization beyond explored contexts. Hyperparameters are tuned to best explain sample data using marginal likelihood optimization.
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
The link of the original article: https://ai.intel.com/demystifying-deep-reinforcement-learning/
This review summarizes:
How do I learn reinforcement learning?
Reinforcement Learning is Hot!
What is the RL?
General approach to model the RL problem
Maximize the total future reward
A function Q(s,a) = the maximum DFR
How to get Q-function?
Deep Q Network
Experience Replay
Exploration-Exploitation
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
MADDPG is a multi-agent actor-critic reinforcement learning algorithm that can operate in mixed cooperative-competitive environments. It uses a decentralized actor and centralized critic architecture. The centralized critic takes the observations and actions of all agents as input to guide learning, even though each agent only controls its own actor. To deal with non-stationary environments, it approximates other agents' policies when they are unknown. It also trains with policy ensembles to prevent overfitting to competitors' strategies. Experiments show MADDPG outperforms decentralized methods on cooperative tasks and its performance benefits from approximating other agents and using policy ensembles in competitive settings.
An agent interacts with an environment to maximize rewards. Reinforcement learning algorithms learn through trial and error by taking actions and receiving rewards or penalties. The document discusses reinforcement learning concepts like the agent, environment, actions, policy, and rewards. It also summarizes OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms with different environments like CartPole. Code examples are provided to interact with environments using a hardcoded policy and a basic neural network.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
The document proposes combining Q-learning with Adaptive Resonance Theory (ART) to develop a reinforcement learning agent that can learn to play games more effectively. Q-learning allows the agent to learn which actions yield the best long-term rewards. ART enables the agent to cluster similar situations together and generalize its learning. This helps the agent learn in larger, more complex environments. The approach is tested on Ms. Pac-Man and Bomberman, with the agent learning successful strategies to collect items while avoiding enemies and using bombs strategically.
Temporal-difference (TD) learning combines ideas from Monte Carlo and dynamic programming methods. It updates estimates based in part on other estimates, like dynamic programming, but uses sampling experiences to estimate expected returns, like Monte Carlo. TD learning is model-free, incremental, and can be applied to continuing tasks. The TD error is the difference between the target value and estimated value, which is used to update value estimates through methods like Sarsa and Q-learning. N-step TD and TD(λ) generalize the idea by incorporating returns and eligibility traces over multiple steps.
QMIX: monotonic value function factorization paper review민재 정
QMIX is a deep multi-agent reinforcement learning method that allows for centralized training with decentralized execution. It represents the joint action-value function as a factored and monotonic combination of individual agent value functions. This ensures greedy policies over the individual value functions correspond to greedy policies over the joint value function. Experiments in StarCraft II micromanagement tasks show QMIX outperforms independent learners and value decomposition networks by effectively learning cooperative behaviors while ensuring scalability.
This document describes a social dilemma between two individuals, Robert and Stuart, who must decide how much effort to contribute to a joint project. It presents a game theory model to analyze their strategic situation. The Nash equilibrium is identified as both choosing 1 unit of effort, but their rational self-interest leads them to contribute less than the overall optimal outcome of both choosing the highest effort level. This demonstrates how independent actions in social dilemmas can result in suboptimal collective outcomes.
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...INFOGAIN PUBLICATION
In this paper, an emergency group decision method is presented to cope with internet public opinion emergency with interval intuitionistic fuzzy linguistic values. First, we adjust the initial weight of each emergency expert by the deviation degree between each expert’s decision matrix and group average decision matrix with interval intuitionistic fuzzy numbers. Then we can compute the weighted collective decision matrix of all the emergencies based on the optimal weight of emergency expert. By utilizing the interval intuitionistic fuzzy weighted arithmetic average operator one can obtain the comprehensive alarm value of each internet public opinion emergency. According to the ranking of score value and accuracy value of each emergency, the most critical internet public emergency can be easily determined to facilitate government taking related emergency operations. Finally, a numerical example is given to illustrate the effectiveness of the proposed emergency group decision method.
Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection with Deep Reinforcement Learning." In Deep Reinforcement Learning Workshop (NIPS). 2016.
We present a method for performing hierarchical object detection in images guided by a deep reinforcement learning agent. The key idea is to focus on those parts of the image that contain richer information and zoom on them. We train an intelligent agent that, given an image window, is capable of deciding where to focus the attention among five different predefined region candidates (smaller windows). This procedure is iterated providing a hierarchical image analysis.We compare two different candidate proposal strategies to guide the object search: with and without overlap. Moreover, our work compares two different strategies to extract features from a convolutional neural network for each region proposal: a first one that computes new feature maps for each region proposal, and a second one that computes the feature maps for the whole image to later generate crops for each region proposal. Experiments indicate better results for the overlapping candidate proposal strategy and a loss of performance for the cropped image features due to the loss of spatial resolution. We argue that, while this loss seems unavoidable when working with large amounts of object candidates, the much more reduced amount of region proposals generated by our reinforcement learning agent allows considering to extract features for each location without sharing convolutional computation among regions.
https://imatge-upc.github.io/detection-2016-nipsws/
HOW TO FIND A FIXED POINT IN SHUFFLE EFFICIENTLYIJNSA Journal
In electronic voting or whistle blowing, anonymity is necessary. Shuffling is a network security technique
that makes the information sender anonymous. We use the concept of shuffling in internet-based lotteries,
mental poker, E-commerce systems, and Mix-Net. However, if the shuffling is unjust, the anonymity,
privacy, or fairness may be compromised. In this paper, we propose the method for confirming fair
mixing by finding a fixed point in the mix system and we can keep the details on ‘how to shuffle’ secret.
This method requires only two steps and is efficient.
This document provides an overview of reinforcement learning and some key algorithms used in artificial intelligence. It introduces reinforcement learning concepts like Markov decision processes, value functions, temporal difference learning methods like Q-learning and SARSA, and policy gradient methods. It also describes deep reinforcement learning techniques like deep Q-networks that combine reinforcement learning with deep neural networks. Deep Q-networks use experience replay and fixed length state representations to allow deep neural networks to approximate the Q-function and learn successful policies from high dimensional input like images.
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
This document discusses reinforcement learning, an approach to machine learning where an agent learns behaviors through trial and error interactions with its environment. The agent receives positive or negative feedback based on its actions, allowing it to maximize rewards. Specifically:
1) In reinforcement learning, an agent performs actions in an environment and receives feedback in the form of rewards or punishments to learn behaviors without a teacher directly telling it what to do.
2) The goal is for the agent to learn a policy to map states to actions that will maximize total rewards. It must figure out which of its past actions led to rewards through the "credit assignment problem."
3) Reinforcement learning has been applied to problems like game playing, robot control
Reinforcement learning is a machine learning technique that involves trial-and-error learning. The agent learns to map situations to actions by trial interactions with an environment in order to maximize a reward signal. Deep Q-networks use reinforcement learning and deep learning to allow agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels. DQN uses experience replay and target networks to stabilize learning from experiences. DQN has achieved human-level performance on many Atari 2600 games.
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
The lecture slides in DSAI 2018, National Cheng Kung University. It's about famous deep reinforcement learning algorithm: Actor-Critc. In this slides, we introduce advantage function, A3C/A2C.
The document presents a deep reinforcement learning approach that uses a convolutional neural network trained with Q-learning to learn control policies directly from raw pixel input in complex Atari games. The network is trained to estimate the optimal action-value function by minimizing temporal-difference errors between its predictions and targets generated from rewards and subsequent predictions. Experience replay is used to alleviate problems with correlated data and non-stationary distributions by randomly sampling from a replay memory of past transitions. When applied to seven Atari games, the approach outperforms previous methods on six games and surpasses a human expert on three games using a single network architecture and hyperparameters across all games.
This document provides an introduction to reinforcement learning. It defines reinforcement learning as finding a policy that maximizes the sum of rewards by interacting with an environment. It discusses key concepts like Markov decision processes, value functions, temporal difference learning, Q-learning, and deep reinforcement learning. The document also provides examples of applications in games, robotics, economics and comparisons of model-based planning versus model-free reinforcement learning approaches.
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
The document provides an overview of reinforcement learning and artificial neural networks. It discusses key concepts in reinforcement learning including Markov decision processes, the Q-learning algorithm, temporal difference learning, and challenges in reinforcement learning like exploration vs exploitation. It also covers basics of artificial neural networks like linear and sigmoid units, backpropagation for training multi-layer networks, and applications of neural networks to problems like image recognition.
1. The document discusses challenges with standard reinforcement learning formulations due to large state and action spaces. It proposes representing actions as operators that induce state transitions rather than discrete choices.
2. It introduces a generalized reinforcement learning framework using kernel methods to compare "decision contexts" or state-action pairs. Value functions are represented as vectors in a Reproducing Kernel Hilbert Space rather than concrete mappings.
3. Gaussian process regression is used to predict values for unseen state-action pairs by comparing them to stored samples, enabling generalization beyond explored contexts. Hyperparameters are tuned to best explain sample data using marginal likelihood optimization.
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
The link of the original article: https://ai.intel.com/demystifying-deep-reinforcement-learning/
This review summarizes:
How do I learn reinforcement learning?
Reinforcement Learning is Hot!
What is the RL?
General approach to model the RL problem
Maximize the total future reward
A function Q(s,a) = the maximum DFR
How to get Q-function?
Deep Q Network
Experience Replay
Exploration-Exploitation
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
MADDPG is a multi-agent actor-critic reinforcement learning algorithm that can operate in mixed cooperative-competitive environments. It uses a decentralized actor and centralized critic architecture. The centralized critic takes the observations and actions of all agents as input to guide learning, even though each agent only controls its own actor. To deal with non-stationary environments, it approximates other agents' policies when they are unknown. It also trains with policy ensembles to prevent overfitting to competitors' strategies. Experiments show MADDPG outperforms decentralized methods on cooperative tasks and its performance benefits from approximating other agents and using policy ensembles in competitive settings.
An agent interacts with an environment to maximize rewards. Reinforcement learning algorithms learn through trial and error by taking actions and receiving rewards or penalties. The document discusses reinforcement learning concepts like the agent, environment, actions, policy, and rewards. It also summarizes OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms with different environments like CartPole. Code examples are provided to interact with environments using a hardcoded policy and a basic neural network.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
The document proposes combining Q-learning with Adaptive Resonance Theory (ART) to develop a reinforcement learning agent that can learn to play games more effectively. Q-learning allows the agent to learn which actions yield the best long-term rewards. ART enables the agent to cluster similar situations together and generalize its learning. This helps the agent learn in larger, more complex environments. The approach is tested on Ms. Pac-Man and Bomberman, with the agent learning successful strategies to collect items while avoiding enemies and using bombs strategically.
Temporal-difference (TD) learning combines ideas from Monte Carlo and dynamic programming methods. It updates estimates based in part on other estimates, like dynamic programming, but uses sampling experiences to estimate expected returns, like Monte Carlo. TD learning is model-free, incremental, and can be applied to continuing tasks. The TD error is the difference between the target value and estimated value, which is used to update value estimates through methods like Sarsa and Q-learning. N-step TD and TD(λ) generalize the idea by incorporating returns and eligibility traces over multiple steps.
QMIX: monotonic value function factorization paper review민재 정
QMIX is a deep multi-agent reinforcement learning method that allows for centralized training with decentralized execution. It represents the joint action-value function as a factored and monotonic combination of individual agent value functions. This ensures greedy policies over the individual value functions correspond to greedy policies over the joint value function. Experiments in StarCraft II micromanagement tasks show QMIX outperforms independent learners and value decomposition networks by effectively learning cooperative behaviors while ensuring scalability.
This document describes a social dilemma between two individuals, Robert and Stuart, who must decide how much effort to contribute to a joint project. It presents a game theory model to analyze their strategic situation. The Nash equilibrium is identified as both choosing 1 unit of effort, but their rational self-interest leads them to contribute less than the overall optimal outcome of both choosing the highest effort level. This demonstrates how independent actions in social dilemmas can result in suboptimal collective outcomes.
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...INFOGAIN PUBLICATION
In this paper, an emergency group decision method is presented to cope with internet public opinion emergency with interval intuitionistic fuzzy linguistic values. First, we adjust the initial weight of each emergency expert by the deviation degree between each expert’s decision matrix and group average decision matrix with interval intuitionistic fuzzy numbers. Then we can compute the weighted collective decision matrix of all the emergencies based on the optimal weight of emergency expert. By utilizing the interval intuitionistic fuzzy weighted arithmetic average operator one can obtain the comprehensive alarm value of each internet public opinion emergency. According to the ranking of score value and accuracy value of each emergency, the most critical internet public emergency can be easily determined to facilitate government taking related emergency operations. Finally, a numerical example is given to illustrate the effectiveness of the proposed emergency group decision method.
Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection with Deep Reinforcement Learning." In Deep Reinforcement Learning Workshop (NIPS). 2016.
We present a method for performing hierarchical object detection in images guided by a deep reinforcement learning agent. The key idea is to focus on those parts of the image that contain richer information and zoom on them. We train an intelligent agent that, given an image window, is capable of deciding where to focus the attention among five different predefined region candidates (smaller windows). This procedure is iterated providing a hierarchical image analysis.We compare two different candidate proposal strategies to guide the object search: with and without overlap. Moreover, our work compares two different strategies to extract features from a convolutional neural network for each region proposal: a first one that computes new feature maps for each region proposal, and a second one that computes the feature maps for the whole image to later generate crops for each region proposal. Experiments indicate better results for the overlapping candidate proposal strategy and a loss of performance for the cropped image features due to the loss of spatial resolution. We argue that, while this loss seems unavoidable when working with large amounts of object candidates, the much more reduced amount of region proposals generated by our reinforcement learning agent allows considering to extract features for each location without sharing convolutional computation among regions.
https://imatge-upc.github.io/detection-2016-nipsws/
HOW TO FIND A FIXED POINT IN SHUFFLE EFFICIENTLYIJNSA Journal
In electronic voting or whistle blowing, anonymity is necessary. Shuffling is a network security technique
that makes the information sender anonymous. We use the concept of shuffling in internet-based lotteries,
mental poker, E-commerce systems, and Mix-Net. However, if the shuffling is unjust, the anonymity,
privacy, or fairness may be compromised. In this paper, we propose the method for confirming fair
mixing by finding a fixed point in the mix system and we can keep the details on ‘how to shuffle’ secret.
This method requires only two steps and is efficient.
This document provides an overview of reinforcement learning and some key algorithms used in artificial intelligence. It introduces reinforcement learning concepts like Markov decision processes, value functions, temporal difference learning methods like Q-learning and SARSA, and policy gradient methods. It also describes deep reinforcement learning techniques like deep Q-networks that combine reinforcement learning with deep neural networks. Deep Q-networks use experience replay and fixed length state representations to allow deep neural networks to approximate the Q-function and learn successful policies from high dimensional input like images.
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
This document discusses reinforcement learning, an approach to machine learning where an agent learns behaviors through trial and error interactions with its environment. The agent receives positive or negative feedback based on its actions, allowing it to maximize rewards. Specifically:
1) In reinforcement learning, an agent performs actions in an environment and receives feedback in the form of rewards or punishments to learn behaviors without a teacher directly telling it what to do.
2) The goal is for the agent to learn a policy to map states to actions that will maximize total rewards. It must figure out which of its past actions led to rewards through the "credit assignment problem."
3) Reinforcement learning has been applied to problems like game playing, robot control
Reinforcement learning is a machine learning technique that involves trial-and-error learning. The agent learns to map situations to actions by trial interactions with an environment in order to maximize a reward signal. Deep Q-networks use reinforcement learning and deep learning to allow agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels. DQN uses experience replay and target networks to stabilize learning from experiences. DQN has achieved human-level performance on many Atari 2600 games.
It will give a short overview of Reinforcement Learning and its combination with Neural Networks (Deep Reinforcement Learning) in a brief and simple way
This document provides an introduction to reinforcement learning. It defines reinforcement learning and compares it to machine learning. Key concepts in reinforcement learning are discussed such as policy, reward function, value function and environment. Examples of reinforcement learning applications include chess, robotics, petroleum refineries. Model-free and model-based methods are introduced. The document also discusses Monte Carlo methods, temporal difference learning, and Dyna-Q architecture. Finally, it provides examples of reinforcement learning problems like elevator dispatching and job shop scheduling.
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
This document discusses deep reinforcement learning and concept network reinforcement learning. It begins with an introduction to reinforcement learning concepts like Markov decision processes and value-based methods. It then describes Concept-Network Reinforcement Learning which decomposes complex tasks into high-level concepts or actions. This allows composing existing solutions to sub-problems without retraining. The document provides examples of using concept networks for lunar lander and robot pick-and-place tasks. It concludes by discussing how concept networks can improve sample efficiency, especially for sparse reward problems.
1) The lecture discusses the exploration-exploitation dilemma in reinforcement learning using the multi-armed bandit problem as a simplified example.
2) In the multi-armed bandit problem, an agent must choose between multiple actions ("arms") to maximize rewards, balancing exploring new actions with exploiting the currently most rewarding action.
3) The key challenge is optimally balancing exploration of actions to gain more information with exploitation of the currently best action based on existing information.
This document describes an implementation of Q-learning using an off-the-shelf rover to navigate a "rover-in-a-box" environment. The rover's camera feed is processed to reduce the state space before Q-learning is applied using a simulated environment. The Q-learning model is trained in simulation and then used by the physical rover. The goal is to learn an optimal policy to navigate from its starting position to the reward state in as few moves as possible. Image processing techniques like morphological operations and comparing across frames are used to filter noise from the camera feed and identify the relevant colors on the box walls.
1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
Here are the key steps to run a REINFORCE algorithm on the CartPole environment using SLM Lab:
1. Define the REINFORCE agent configuration in a spec file. This specifies things like the algorithm name, hyperparameters, network architecture, optimizer, etc.
2. Define the CartPole environment configuration.
3. Initialize SLM Lab and load the spec file:
```js
const slmLab = require('slm-lab');
slmLab.init();
const spec = require('./reinforce_cartpole.js');
```
4. Create an experiment with the spec:
```js
const experiment = new slmLab.Experiment(spec
Reinforcement learning is a computational approach for learning through interaction without an explicit teacher. An agent takes actions in various states and receives rewards, allowing it to learn relationships between situations and optimal actions. The goal is to learn a policy that maximizes long-term rewards by balancing exploitation of current knowledge with exploration of new actions. Methods like Q-learning use value function approximation and experience replay in deep neural networks to scale to complex problems with large state spaces like video games. Temporal difference learning combines the advantages of Monte Carlo and dynamic programming by bootstrapping values from current estimates rather than waiting for full episodes.
Reinforcement learning algorithms like Q-learning, SARSA, DQN, and A3C help agents learn optimal behaviors through trial-and-error interactions with an environment. Q-learning uses a model-free approach to estimate state-action values without a transition model. SARSA is similar to Q-learning but is on-policy, learning the value function from the current policy. DQN approximates Q-values using a neural network to handle large state spaces. A3C uses multiple asynchronous agents interacting with individual environments to learn diversified policies through an actor-critic framework.
This document discusses using reinforcement learning and meta-level reasoning to develop adaptive high-level strategies in StarCraft. It introduces Q-learning as a reinforcement learning technique to update state-action values. A meta-level reinforcement learning approach is then used to select exploration rates and learning rates over time based on the agent's need to learn. Experiments inject code into StarCraft using BWAPI to implement different Terran strategies and apply meta-level reasoning to select strategies and dynamically adjust hyperparameters. Results showed this approach enabled the development of adaptive high-level strategies in the complex StarCraft environment.
Reinforcement learning allows an agent to learn behaviors without being directly supervised. The agent learns through trial-and-error interactions with its environment. It discovers which actions yield the most reward and learns to maximize its long-term reward. The key components are the agent, environment, rewards, states, and actions. Q-learning is a model-free technique that finds the optimal action selection policy using a Q-function to estimate expected rewards for state-action pairs. The agent explores various actions and updates its Q-values based on rewards received to gradually learn the optimal policy. Reinforcement learning has applications in game playing, robot control, and other domains.
Deep reinforcement learning from scratchJie-Han Chen
1. The document provides an overview of deep reinforcement learning and the Deep Q-Network algorithm. It defines the key concepts of Markov Decision Processes including states, actions, rewards, and policies.
2. The Deep Q-Network uses a deep neural network as a function approximator to estimate the optimal action-value function. It employs experience replay and a separate target network to stabilize learning.
3. Experiments applying DQN to the Atari 2600 game Space Invaders are discussed, comparing different loss functions and optimizers. The standard DQN configuration with MSE loss and RMSProp performed best.
This document provides an overview of reinforcement learning. It defines reinforcement learning as learning through trial-and-error to maximize rewards over time. The document discusses key reinforcement learning concepts like the agent-environment interaction, Markov decision processes, policies, value functions, and the Q-learning algorithm. It also provides examples of applying reinforcement learning to problems like career choices and the Atari Breakout video game.
Short walk-through on building learning agents.
Reinforcement learning covers a family of algorithms with the purpose of maximize a cumulative reward that an agent can obtain from an environment.
It seems like training crows to collect cigarette butts in exchange for peanuts, or paraphrasing an old say, the carrot and stick metaphor for cold algorithms instead of living donkeys.
See more on https://gfrison.com
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
This document discusses reinforcement learning. It defines reinforcement learning as a learning method where an agent learns how to behave via interactions with an environment. The agent receives rewards or penalties based on its actions but is not told which actions are correct. Several reinforcement learning concepts and algorithms are covered, including model-based vs model-free approaches, passive vs active learning, temporal difference learning, adaptive dynamic programming, and exploration-exploitation tradeoffs. Generalization methods like function approximation and genetic algorithms are also briefly mentioned.
1. CS799 Final Report
Reinforcement Learning to Train Agent
Abhanshu Gupta
Department of Computer Sciences
University of Wisconsin Madison
Email: abhanshu@cs.wisc.edu
Abstract—Reinforcement learning is the learning of a mapping
from situations to actions so as to maximize a scalar reward or
reinforcement signal. The learner is not told which action to
take, as in most forms of learning, but instead must discover
which actions yield the highest reward by trying them. In the
most interesting and challenging cases, actions affect not only
the immediate reward, but also the next situation, and through
that all subsequent rewards. These two characteristics trial-and-
error search and delayed reward are the two most important
distinguishing features of reinforcement learning. In this report,
a study about applying Reinforcement Learning to design a
automatic agent to play a chasing game on given Agent World
test-bed is presented. One of the challenge in the game is how to
handle the agent in the complex and dynamic game environment.
By abstracting the game environment into a state vector defined
by sensor readings and using Q learning an algorithm oblivious
to transitional probabilities I achieve tractable computation time
and fast convergence. In the initial phase-I train, my agent is
trained to avoid walls. After that in the next phase the agent is
trained to avoid minerals and to maximize its output by taking
vegetables. In the last phase of training, the agent is trained to
avoid other players, minerals and walls while maximizing the
score by taking vegetables. I also compared and analysed the
results for training agent with exploration v/s exploitation choice
keeping learning rate α and discount factor γ constant.
I. INTRODUCTION
Using artificial intelligence (AI) and machine learning
(ML) algorithms to play computer games has been widely
discussed and investigated, because valuable observations can
be made on the ML play pattern vs that of a human player, and
such observations provide knowledge on how to improve the
algorithms. Agent-World[1] provides the framework to play
the classic chasing game.
Reinforcement Learning (RL) [2] is one widely-studied and
promising ML method for implementing agents that can sim-
ulate the behaviour of a player. The reinforcement learning
problem is summarized in Figure 1. On some short time cycle,
a learning agent receives sensory information from its envi-
ronment and chooses an action to send to the environment. In
addition, the learning agent receives a special signal from the
environment called the reward. Unlike the sensory information,
which may be a large feature vector, or the action, which
may also have many components, the reward is a single real
valued scalar number. The goal of learning is the maximization
of the cumulative reward received over time. Reinforcement
learning systems can be defined as learning systems designed
for and that perform well on this problem. Informally, we
define reinforcement learning as learning by trial and error
from performance feedback i.e., from feedback that evaluates
Fig. 1: The Reinforcement Learning Problem. The goal is to
maximize cumulative reward [3]
the behaviour generated by the learning agent but does not
indicate correct behaviour.
In this project, I study how to construct an RL controller
agent, which can learn from the game environment. One of
the difficulties of using RL is how to define state, action, and
reward. In addition, playing the game within the framework
requires real-time response, therefore the state space cannot
be too large. I use a state representation sense by sensors,
that abstracts the whole environment description into several
discrete-valued key attributes. I used Q-Learning algorithm to
evolve the decision strategy that aims to maximize the reward.
The rest of this report is organized as follows: Section 2
provides a brief overview of the Agent-World Test-bed and
the Q-Learning algorithm; Section 3 explains how I define
the state, action, reward to be used in RL algorithm; Section 4
provides evaluation results; Section 5 concludes current results
and discusses continued future work on the project.
II. BACKGROUND
In this section, I briefly introduce the Agent-World frame-
work interface and the Q-learning algorithm I used.
A. Game Mechanics and Agent-World Test-bed
The goal of the game is to control the agent to maximize
the score while playing collecting vegetables and beating other
animals with minerals. An agent senses the environment using
his sensors and than makes an intelligent decision. In each
iteration agent moves a unit step in a particular direction to
maximize its score. These set of permissible directions are
aligned with directions of sensors. The game is over when the
allotted time for each game ends. At the end the player with
maximum score is designated as winner.
The provided test-bed Agent-World is conducive for testing
artificial intelligence techniques in a simplified universe. The
2. Fig. 2: Agent-World Test-Bed Environment
universe of this test-bed consists of four types of entities: ani-
mals (also know as, ’agents’ or ’players’), minerals, vegetables,
and walls. One or more agents will be in a universe at a time.
For each action an agent takes, it receives a reward. Negative
rewards include running into walls, pushing minerals, bumping
into other agents, and running into sliding minerals. Positive
rewards include eating vegetables and pushing minerals into
other agents. During the game, when performing each step, the
Agent-World framework interface call could return the com-
plete observation of environment through sensor readings. The
number of sensors can be configured and has a fixed range. The
environment of this test-bed along with the sensor locations
are shown in Figure 2. These sensor readings are an array
containing the positions and types of enemies/items/platforms
within this range. The framework interface call can also
provide the reward for the agent for being in the current state.
This is the whole available information for my agent.
B. -Greedy Q-Learning
Q-learning treats the learning environment as a state ma-
chine, and performs value iteration to find the optimal policy. It
maintains a value of expected total (current and future) reward,
denoted by Q, for each pair of (state, action) in a table. It may
at first seem surprising that one can choose globally optimal
action sequences by reacting repeatedly to the local values
of Q for the current state. This means the agent can choose
the optimal action without ever conducting a lookahead search
to explicitly consider what state results from the action. Part
of the beauty of Q learning is that the evaluation function
is defined to have precisely this property-the value of Q for
the current state and action summarizes in a single number all
the information needed to determine the discounted cumulative
reward that will be gained in the future if action a is selected
in state s. Thus the Q value for each state-action transition
equals the reward value for this transition plus the maximum
value for the resulting state discounted by y. So for each action
in a particular state, a reward r will be given and the Q-value
is updated by the following rule for deterministic case:
Q(st, at) ← r + γMax(Q(st+1, at+1)) (1)
Above we considered Q learning in deterministic environ-
ments. But the gaming environment in most of the games
is non-deterministic and hence resulting state from st due
to action at is not known. Our earlier training rule derived
for the deterministic case (1) fails to converge in this non-
deterministic setting. A non-deterministic reward function r(s,
a) that produces different rewards each time the transition (s,
a) is repeated. In this case, the training rule will repeatedly
alter the values of Q(S, a), even if we initialize the Q table
values to the correct Q function. In brief, this training rule does
not converge. This difficulty can be overcome by modifying
the training rule so that it takes a decaying weighted average
of the current Q value and the revised estimate. Writing Q,
to denote the agent’s estimate on the nth iteration of the
algorithm, the following revised training rule is sufficient to
assure convergence of Q:
Q(st, at) ← (1−αs,a)Q(st, at)+αs,a(r+γMax(Q(st+1, at+1)))
(2)
In the Q-learning algorithm, there are four main factors: current
state, chosen action, reward and future state. In (2), Q(st,at)
denotes the Q-value of current state and Q(st+1, at+1) denotes
the Q-value of future state. α [0,1] is the learning rate, γ
[0,1] is the discount rate, and r is the reward. (2) shows that for
each current state, I update the Q-value as a combination of
current value, current rewards and max possible future value.
I chose Q-learning for two reasons:
1) Although I have modelled the game as approximate
Markov Model, the specific transitional probabilities
between the states is not known. Had I used the
normal reinforcement learning value iteration, I will
have to train the state transitional probabilities as
well. On the other hand, Q-learning can converge
without using state transitional probabilities (”model
free”). Therefore Q-learning suits my need well.
2) When updating value, normal value iteration needs
to calculate the expected future state value, which
requires reading the entire state table. In comparison,
Q learning only needs fetching two rows (values for
standst+1) in the Q table. With the dimension of
the Q table in thousands, Q learning update is a lot
faster, which also means given the computation time
and memory constraint, using Q table allows a larger
state space design.
The learning rate α affects how fast learning converges. I use a
decreasing learning rate αs,a different for different (s, a) pairs.
The value of α is given by:
αst,at
= 1/(1 + visted(st, at)) (3)
the equation is chosen based on the criteria proposed by
Watkins original Q-learning paper[4]. He shows the following
properties of α is sufficient for the Q values to converge.
1) α(st, at) → 0 as t → ∞.
2) α(st, at) monotonically decreases with t.
3)
∞
t=1 α(st, at) = ∞
One can easily verify the series (3) satisfy all the properties.
The discount factor γ denotes how much future state is taken
into account during optimization. I evaluated under several
values and chose 0.9 as the final value. When training my
agent, I used -greedy Q-learning to explore more states. The
algorithm is a small variation of Q-learning: each step the
3. algorithm chooses the best action according to the Q table
with probability (1 - ). After performing the action, the Q
table is updated as in (2).
III. AGENT CONTROLLER DESIGN
In this section, I have described the design of the state,
action, and rewards used in the Q-learning algorithm.
A. Agent State
State of the agent at any point is the game is defined as the
combination vector of reading of each sensor.The state vector
is computed using one-of-k encoding on the sensed value of
each sensor. The magnitude of each entry of this vector is the
measure of proximity of the sensed object to the agent. This is
computed using distance between the agent and sensed object
after normalizing with the range of the sensor. In this project, I
have used 36 sensors with a uniform angular separation of 10o
.
These sensors are configured to sense 5 different possibilities
{nothing, animal,vegetable,mineral and wall}. Hence, each of
these sensors sense these five possibilities in a unique direction
to define agent’s environment at a particular state. This state
vector is necessary and sufficient to define the physical state
of the agent at any point of time. The Q function values for
each of the actions are then computed for the state using
a perceptron neural network model. The training for these
models is done using back-propagation rule using a back-
propagation learning rate η of 0.1. Thus I have initialized
a perceptron model for each of the actions which generate
different Q values for different state input vectors based on
the learning.
B. Action
In this game, an action is defined as a step taken by the
agent in a particular direction which leads to a transition in its
state from sttost+1. In this project I have considered 36 valid
actions, each corresponding to a step taken in the direction of
the 36 sensors.
C. Rewards
The reward function for this game has been provided by
the test-bed framework interface. For each action an agent
takes, it receives a reward. The reward r(st, at) for state st
is determined by the test-bed based on the new state st+1
that agent will be in after taking action at on the current
state. Negative rewards include running into walls (-1), pushing
minerals (-2), bumping into other agents (-3), and running
into sliding minerals (-25). Positive rewards include eating
vegetables (+5), and pushing minerals into other agents (+25).
If an agent does not run into any other object, it receives a
reward of zero.
IV. EVALUATION RESULTS
For training the -greedy Q-learning algorithm, I have used
perceptron with squared error objective function and ReLU
activation function to compute the Q values for each action.
The training of the agent is done in three phases with each
phase using the learned network of the previous phase as
starting point. For the first phase training, perceptron weights
Fig. 3: Learning Curve for Phase-I Training
for each action are initialized with a uniform distribution [-
0.1,0.1]. Each training game was defined as the set of 5000
iterations. Each game of training was done with 36 perceptrons
for parameter values of η = 0.1, = 0.2 and γ = 0.9. The
training terrain used for each phase along with its motivation
is described below:
A. Phase I
The goal of this phase of training is to train the agent
to avoid walls and hence training is done on barren field for
around 50 games using a constant α value of 1.0. The learning
curve for this training is shown in Figure 3. This curve is
generate by using the learned network weights after 10 games
each and than testing those networks with α = 0 for 5 games
and taking the average score. This phase trained the agent to
avoid walls.
B. Phase II
The goal of this phase of training is to train the agent
to avoid walls and minerals while maximizing the score by
taking vegetables and hence training is done on sparse field
with 50 minerals and 50 vegetables for around 200 games
using a constant α value of 0.8 for the first 100 games and
value of 0.4 for the last 100 games. The learning curve for this
training is shown in Figure 4. This curve is generate by using
the learned network weights after 10 games each and than
testing those networks with α = 0 for 5 games and taking the
average score. This phase trained the agent to take as many
vegetables as possible while avoiding minerals and walls.
C. Phase III
The goal of this phase of training is to train the agent to
maximize the score generated by taking vegetables and hitting
other players while avoiding a collision with walls, minerals
and other players. For this training I have selected a topography
with 100 minerals and 100 vegetables along with few players. I
have used 2 anonymous players, 1 random walker and a smart
4. Fig. 4: Learning Curve for Phase-II Training
Fig. 5: Learning Curve for Phase-III Training
player to train my RL- player. The training for this phase has
been carried out for 500 games using a constant α value of
1.0 for the first 100 games and than the value was periodically
decremented by 0.05 for each 20 games of training for the rest
of the training. The learning curve for this training is shown in
Figure 5. This curve is generate by using the learned network
weights after 20 games each and than testing those networks
with α = 0 for 10 games and taking the average score. This
phase trained the agent to take as many vegetables as possible
while hitting other players. This training helped the agent to
recognize other players and to avoid itself from colliding avoid
with them.
Fig. 6: Learning Curve Comparison for Exploration v/s
Exploitation
D. Exploitation v/s Exploration
For this comparison I trained a new model following the
same training procedure in three phases as described below
but with an = 1.0. This value of ensures that every step
the agent takes is in a random direction and thus the agent
just explores the environment in this type of training. After
the completion of third phase of training for this method we
calculated data points for learning curve for this method by
taking trained network weights at interval of 20 games and
than calculating the average score for those weights by testing
on 10 games with agent moving in the direction of maximum
Q value with α = 0.0. The difference between the two learning
curves can be seen in Figure 6.
V. CONCLUSION
In this project, I designed an automatic agent using Q-
learning to play the game on Agent-World test-bed. My learn-
ing algorithm demonstrates fast convergence to the optimal
Q-value with high success rate. The optimal policy has good
positive score and can be improved further by training the
agent for more games. In addition, my results show that the
state description is general enough, that my optimal policy
can tackle different, random environments. Further, we observe
that long term reward maximization over performs short term
reward maximization. I concluded the following aspects about
Q-Learning through my experiments on Agent-World test-bed:
• Using decaying learning rate (α) converges to a better
policy than using fixed learning rate.
• Further many more training episodes are needed to
train the agent than used in the study.
• Training agent in phases helps us to divide the learning
tasks into phases and thus helps in converging the
weights at a much faster rate
5. • Exploration is a good option for training in initial
stages but will not be beneficial later on due to very
slow convergence compared to Exploitation scheme.
I believe that my work provides a good introduction to this
problem and will benefit the people with interests in using re-
inforcement learning to play computer games. Some continued
future work on the project includes:
• Further training the agent for more training episodes
to get a more detailed view of the learning curve
• Using hidden layers in neural nets instead of percep-
trons for Q value functions computation and compare
the effect.
• Introducing other types of smart players and RL-
players in the training process and notice their effect
on the agents training.
REFERENCES
[1] Framework:http://pages.cs.wisc.edu/˜shavlik/cs540/html/agent-world.html
[2] Tom M. Mitchell, Machine Learning, McGraw-Hill Science,1997.
[3] Richard S. Sutton, Reinforcement Learning Architectures, GTE Labora-
tories Incorporated, Waltham, MA.
[4] Watkins and Dayan, Q-learning.Machine Learning, 1992.
[5] J. Shavlik lecture notes, CS 760 - Machine Learning, Univ of Wisconsin
- Madison, 2010.
[6] David Page lecture notes, CS 760 - Machine Learning, Univ of Wisconsin
- Madison, 2015.
[7] Mark Craven lecture notes, CS 760 - Machine Learning, Univ of
Wisconsin - Madison, 2016.