SlideShare a Scribd company logo
1 of 60
Download to read offline
University Of Freiburg
Department Of Computer Science
Machine Learning Lab
Master’s Thesis
Echo State Fitted-Q Iteration
Approach To Learn Delayed
Control Systems
Ramin Zohouri
16. December 2015
First Reviewer: Dr. Joschka Boedecker
Second Reviewer: Prof. Dr. Moritz Diehl
Supervisor: Thomas Lampe
Machine Learning Lab
Department Of Computer Science
Systems Control and Optimization Laboratory Department of Microsystem
Engineering
University Of Freiburg
Author : Ramin Zohouri
Master’s Degree Program in Computer Science
Master’s Thesis
Echo State Fitted-Q Iteration Approach To Learn Delayed Control Systems
First Reviewer: Dr. Joschka Boedecker
Second Reviewer: Prof. Dr. Moritz Diehl
Supervisor: Thomas Lampe
Submitted : 16.12.2015
II
ABSTRACT
Stability and performance of dynamically controlled systems tend to decay by the pres-
ence of delays in their applied actions or measured system states. The presence of such
delays magnifies the difficulties of learning control systems, particularly when their dy-
namic models are not known to us. In this work, we introduce an effective and efficient
Q-learning algorithm, Echo State Fitted-Q Iteration (ESFQ) to learn delayed control
systems. Our method employs reservoir computing to hold the history of the raw system
states measurement and estimate a proper Q-value for the applied actions. To config-
ure our algorithm and achieve the high performance we use hyper-parameter optimiza-
tion tools. Experimental results, on different simulated benchmarks with various delay
lengths, report improvement in the performance in comparison to the standard tapped
delay-line algorithm. Furthermore, our results illustrate the benefits of a nonlinear read-
out layer for the echo state Q-function on learning delayed control tasks with complex
dynamics.
07.2015 - 12.2015
Ramin Zohouri
III
ACKNOWLEDGEMENT
I have taken efforts in this project. However, it would not have been possible without
the kind support and help of many individuals and organizations. I would like to extend
my sincere thanks to all of them. I would like to express my appreciation and special
thanks to my advisers, Dr. Joschka Boedecker and Mr. Thomas Lampe, you have helped
me a tremendous amount ! I would like to thank you for encouraging my research ideas
and guiding me to grow as a researcher. Your advice on my research and the extended
discussions we had in the past few months were priceless for me and I am truly grateful
for everything; to you in the first place, but also the University of Freiburg for providing
such a fruitful and motivating environment. I would like to express my gratitude towards
Prof. Dr. Moritz Diehl for his kind co-operation and encouragement which helped me
in the completion of this project. I would like also to thank my friends and members
of the machine learning lab Manuel Blum, Jost Tobias Springenberg, Manuel Watter,
and Jan W¨ulfing who also helped me a lot with giving me fresh ideas and discussing
my problems when I needed it the most. Last but not least, I want to give my special
thanks to Thorsten Engesser, Robin Tibor Schirrmeister, and Martin Goth who did the
proofreading of my thesis and kindly helped me with translating the abstract of this
report to The German language.
IV
CONTENTS
1 Introduction 3
1.1 Dynamic Control Systems And Reinforcement Learning . . . . . . . . . . 3
1.2 Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Works 7
2.1 Non-Markovian Reinforcement Learning . . . . . . . . . . . . . . . . . . 7
2.1.1 Markov Decision Process With Delays And Asynchronous Cost Col-
lection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Learning And Planning In Environments With Delayed Feedback 8
2.1.3 Control Delay in Reinforcement Learning for Real-Time Dynamic
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Recurrent Neural Networks In Reinforcement Learning . . . . . . . . . . 9
3 Reinforcement Learning In Feedback Control 11
3.1 Learning In Feedback Control . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Batch Reinforcement Leaning . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Neural Fitted Q-Iteration (NFQ) . . . . . . . . . . . . . . . . . . 13
3.3.3 Resilient Propagation (Rprop) . . . . . . . . . . . . . . . . . . . . 14
4 ESN For Function Approximation 16
4.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 A Brief Overview On Echo State Networks . . . . . . . . . . . . . . . . . 17
4.2.1 Learning In Echo State Networks . . . . . . . . . . . . . . . . . . 19
4.2.2 Spectral Radius and Echo State Property . . . . . . . . . . . . . . 19
4.3 Echo State Fitted-Q Iteration . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Types Of Readout Layers For Echo State Q-Function . . . . . . . . . . . 21
4.4.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Multilayer Perceptron Readout Layer . . . . . . . . . . . . . . . . 22
4.5 SMAC For Hyper Parameter Optimization . . . . . . . . . . . . . . . . . 23
1
CONTENTS CONTENTS
5 Experiments And Results 25
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Simulation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.1 Mountain Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.2 Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Experimental Procedure And Parametrisation . . . . . . . . . . . . . . . 27
5.3.1 Echo State Q-Function . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Hyper-parameter Optimization . . . . . . . . . . . . . . . . . . . 30
5.4 Comparisons To Delay-Line NFQ Algorithm . . . . . . . . . . . . . . . . 34
5.5 Why Not Linear Readout Layers . . . . . . . . . . . . . . . . . . . . . . 38
5.5.1 Ridge Regression On Mountain Car . . . . . . . . . . . . . . . . . 38
5.5.2 Ridge Regression On Inverted Pendulum . . . . . . . . . . . . . . 41
5.5.3 Very Large Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusion 46
6.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2
CHAPTER 1
INTRODUCTION
1.1 Dynamic Control Systems And Reinforcement
Learning
There are an increasing demands for building automatic controllers for systems with
complex dynamics such as aircraft, magnetic levitation trains, and chemical plants, but
also in objects of daily life such as air conditioners and computer drives [51, 18]. These
systems are characterized as having an output or feedback, that can be measured, inputs
that can be manipulated, and internal dynamics. Feedback control involves computing
and applying suitable control inputs to minimize the differences between the observed
and desired behaviour of a dynamic system. Feedback control design approaches include
classical design methods for linear systems, multivariate control, nonlinear control, op-
timal control, robust control, H-infinity control, adaptive control, and others [77, 8, 35].
Designing a controller has the advantage of being general and applicable to various sys-
tems. However, the design of good controllers is a tedious and demanding job, due
to unknown dynamics and underactuation, modelling errors, and various sorts of dis-
turbances, uncertainties, and noise. Furthermore, since there is no learning involved, for
a new control task with relatively similar dynamics a new system model has to be created.
In contrast to the classical design process, reinforcement learning is geared towards
learning appropriate closed-loop controllers by simply interacting with the process and
incrementally improving control behaviour. Reinforcement learning is learning to act by
trial and error. In this paradigm, an agent can perceive its state and perform actions.
After each action, a numerical reward is given. The goal of the agent is to maximize the
total reward it receives over time. Reinforcement learning has been successful in appli-
cations as diverse as autonomous helicopter flight, robot-legged locomotion, cell-phone
network routing, marketing strategy selection, factory control, and efficient web page
indexing. Despite these successes, there are some challenges and problems which make
learning a control task difficult.
In an online control system, the data is sent and received by network nodes of different
types and manufacturers. Sensor nodes measure process values and transmit these over
the communication network. Actuator nodes receive new values for the process inputs
3
CHAPTER 1. INTRODUCTION 1.1. DYNAMIC CONTROL SYSTEMS AND REINFORCEMENT LEARNING
over the communication network and apply these to the process input. Controller nodes
read process values from sensor nodes. A control algorithm calculates the control signals
and sends them to the actuator nodes. Communication networks inevitably introduce
delays, both due to limited bandwidth, but also due to overhead in the communicating
nodes and the network. In many systems, there are various delay types. From the control
perspective, a control system with varied delay types will no longer be time-invariant.
Therefore, the standard optimal control theory can not be used to analyze and design
these systems. We can categorize different types of control delays based on their place of
occurrence.
• Communication delay between the sensor and the controller.
• Computational delay in the controller.
• Communication delay between the controller and the actuator.
The effect of time delay on the stability and performance of control systems has drawn
the attention of many investigators in different engineering disciplines, including optimal
control system [1, 45, 10, 43, 76], reinforcement learning [5, 39]. In general, the time delay
in active control systems causes an unsynchronized application of the control forces, and
this unsynchronized control not only degrades the system performance but also causes
instability in the system response [61]. Dynamic computational models with delays re-
quire the ability to store and access the time history of their inputs and outputs. And
there are various difficulties in learning and controlling such systems. The first problem
is, there is no explicit teacher signal that indicates the correct output at each time step.
In addition, the temporal delay of the reward signal implies that the learning system
must assign temporal reward/cost to each of the states and actions that resulted in the
final outcome of the sequence.
To overcome this restriction, several, more general approaches are considered that re-
tain state information over time. We refer to these as stored-state methods. The simplest
of these approaches is one which augments immediate sensory inputs with a delay line to
achieve a crude form of short term memory [37]. This approach has been successful in
certain speech recognition tasks [69]. The most common dynamic neural architecture is
the time delayed neural networks [70]. The time delayed neural networks couples delay
lines with a nonlinear static architecture where all the parameters (weights) are adapted
with the Backpropagation algorithm. Another alternative is called the method of pre-
dictive distinctions [2, 13, 59]. Following this approach, the system learns a predictive
model of the sensory inputs i.e. environmental observables, and then uses the internal
state of this model to drive action selection.
A third approach uses a recurrent neural network [50, 21, 73, 57, 28, 19] in combina-
tion with existing reinforcement learning methods to learn a recurrent state-dependent
control policy directly [26]. In the sense of keeping the history of the system states the
recurrent neural networks naturally are powerful tools that can be employed to learn
delayed control systems. Formerly, training of recurrent neural networks was performed
using Backpropagation through time [73], which actually means unfolding the network in
time and constructing a much bigger network, then performing Backpropagation on this
new network. However, besides the fact that this process is very slow, it does not always
4
CHAPTER 1. INTRODUCTION 1.2. GOALS AND CONTRIBUTIONS
guarantee a good solution because of the fading gradient issue.
1.2 Goals and Contributions
As we described in the previous section, in this work we aimed to learn the behaviour
of control systems which have unknown delays without knowing their underlying dy-
namics. For the purpose of this task Q-learning, a particular model-free reinforcement
learning algorithm would be a feasible solution. Basically, The Q-learning algorithm uses
function approximation to choose the best control action given the current system state
and minimizes the overall cost of the reaching the goal area. The presence of delay in
control systems violates the Markov property which is a fundamental method for formu-
lating a successful reinforcement learning algorithm. The Markov property simply says
that a conditional probability distribution of future states of the process only depends
on its present state. Hence, for a delayed task the proper function approximation i.e.
Q-function needs to be able to preserve the Markov property by holding the history of
a system state. Furthermore, the learning algorithm for a control system, in general,
should be able to deal with problems of a long learning time and convergence. Last but
not least, the learning algorithm we described above will have various parameters which
are interdependent and must be selected carefully to boost the learning performance.
To achieve the objectives we mentioned above, we designed and implemented the Echo
State Fitted-Q Iteration (ESFQ) algorithm. It is a type of offline batch reinforcement
learning algorithm. The ESFQ algorithm has several advantages which make it a suit-
able tool to learn control tasks with delays. First, it is a batch reinforcement learning
algorithm, which is able to use all the previously seen control trajectories which acceler-
ate the learning procedure. Second it employs echo state networks (ESN) as a function
approximation to estimate delayed targets. The ESN is a specific type of recurrent neu-
ral networks which its short-term capacity enables it to hold the history of the system
states. Therefore, it is able to preserve the Markov property for a reinforcement learning
algorithm. The third contribution of our method is the ability to have different readout
layers for training the Echo State Q-function. This ability helps function approximation
to learn complex dynamic systems. Our algorithm benefits from its unconventional way
of preserving the history of system states and the ability to train different readout layers
which make it suitable for a particular control task. However, it has several parameters
which need to be carefully tuned in order to maximize its performance. To achieve this
goal we employed hyper-parameter optimization tools to automatically determine good
choices and take out a manual step of the configuration.
1.3 Outline
In the following, we first cover the major related works to our method and briefly compare
and contrast each of them with our method in chapter 2. Then we describe fundamental
features of our algorithms, reinforcement learning with feedback control in chapter 3,
and echo state networks for function approximation in chapter 4. After that, we proceed
by presenting the results of our method on different standard benchmarks in chapter
5. There also we compare our results to the standard tapped delay-line reinforcement
5
CHAPTER 1. INTRODUCTION 1.3. OUTLINE
learning algorithm. At the end, in chapter 6 we summarize this work by mentioning our
major contributions and addressing the future works that could be done.
6
CHAPTER 2
RELATED WORKS
2.1 Non-Markovian Reinforcement Learning
Markov decision process(MDP) is a popular framework for training control problems[22]
in reinforcement learning. In summary, given a certain state of a system, the agent
selects an action that brings the system to a new state and induces a cost, the new state
is observed and the cost is collected, then the decision maker selects a new action, and
so on. However, the basic MDP framework as it is defined in [34], makes a number of
restrictive assumptions that may limit its applicability:
• the system’s current state is always available to the agent.
• the agent’s actions always take effect immediately
• the cost induced by an action is always collected without delay
The Markov property is usually assumed in reinforcement learning and, therefore, it
vanishes when any of the described conditions is violated. It follows that if an agent must
learn to control a system, there will be periods of time when the internal representation
of system states will be inadequate. Therefore, the decision task will be non-Markovian
which challenges the performance of reinforcement learning algorithms. The presence of
delay in measured states or applied actions causes the violation of the Markov property.
In the following, we viewed three major state-of-the-art approaches for reinforcement
learning with delay.
2.1.1 Markov Decision Process With Delays And Asynchronous
Cost Collection
In the first approach, the state space of the learning agent is augmented with the actions
that were executed during the delay interval. In this work [16, 34] authors showed how
(discrete-time, total-expected-cost, infinite horizon) Markov decision process with obser-
vation, action and delayed cost are reduced to a Markov decision process without delays.
They drew connections amongst the three delay types and demonstrated that the cost
structure of the process without delays is the same as that of the original process with
7
CHAPTER 2. RELATED WORKS 2.1. NON-MARKOVIAN REINFORCEMENT LEARNING
constant or random delays. They considered an embedded process that behaves similarly
to a process with constant delays. However, their results are based on the intuition of
asynchronous cost collection. That means the costs may be induced and collected at
different decision stages and policies can still be compared properly as long as costs are
discounted accordingly. While this approach works well, the state space increase can
cause a large increase in learning time and memory requirements.
2.1.2 Learning And Planning In Environments With Delayed
Feedback
The second approach tried to learn a model of the underlying non-delayed process and
used this model to base control actions on the future state after the delay, predicted
by the model[71]. In this work, the authors evaluated algorithms for environments with
constant observation and reward delay. They covered three different approaches for deal-
ing with the constant delay Markov decision process(CDMDP). First, ”the wait agent”,
which waits for k steps, until the current observation comes through, and then acts us-
ing the optimal action in the non-delayed MDPs. Unfortunately, policies derived from
this strategy will not, in general, provide satisfactory solutions to the CDMDP planning
problem. Instead, the agent’s resulting policy will likely be suboptimal as it is essen-
tially losing potential reward on every wait step. Their second solution was a memoryless
policy, which treated CDMSP as MDP and used memoryless policy for the non-delayed
MDP. Their third solution was the traditional augmented approach, which involves ex-
plicitly constructing an MDP equivalent to original CDMDP in a larger state space. They
augmented each state and applied k previous actions to form a new state representation.
Authors formed new transition probability and reward function but such a solution adds
the extra burden of acquiring the model of the system while the added computational
complexity may actually increase the delay itself.
2.1.3 Control Delay in Reinforcement Learning for Real-Time
Dynamic Systems
The third approach for dealing with delays in MDP has been introduced in [60] which is
an improvement on the older approaches. The authors introduced two new memoryless
solutions and the most important one was an online algorithm named dSARSA(λ). Such
methods base the next control action only on the most recent observation. The downside
of memoryless approaches is that they are likely to learn a suboptimal policy because
they have no means of predicting the state in which the control action will take effect.
Furthermore, SARSA(λ) does not take the presence of the delay into account in its
learning updates. While their complexity remains comparable to that of SARSA(λ),
they exploited the knowledge about the length of the delay to improve their performance.
Then, the authors presented an extension to these algorithms which was applicable where
the delay length is not an integer multiple of the time step.
8
CHAPTER 2. RELATED WORKS 2.2. RECURRENT NEURAL NETWORKS IN REINFORCEMENT LEARNING
2.2 Recurrent Neural Networks In Reinforcement Learn-
ing
There are various attempts to combine the dynamic programming approaches with recur-
rent neural networks(RNN) to tackle reinforcement learning problems. They formulate
their solution in adaptive critic designs(ACDs) [52]. The ACDs have their root in dy-
namic programming and are suitable for learning in noisy, nonlinear, and non-stationary
environments. The ACDs in their application first define what to approximate through
the critic, and then how to adapt the actor in response to the information coming from
the critic. Schmidhuber took one of the earliest approaches [59] toward this direction. He
modeled the dynamics and control of a system with two separate networks and trained
them both in parallel or sequential. However, the proposed fully recurrent networks
struggled in their flexibility and adaptation ability. One of the major difficulties he faced
in his work was the disability of Backpropagation through time and the ACDs to deal
with the long time lag between applied action and its execution. In another work, Bakker
combined the long short term memory [4, 3, 15] with the actor-critic method to learn par-
tially observable reinforcement learning problems. He utilized long short term memory
strength in learning long-term temporal dependencies to infer states in partially observ-
able tasks. He developed an integrated method, which learns the system’s underlying
dynamics and the optimal policy at the same time. Compared to our method which is
a data efficient batch learning and designed to learn non-Markovian problems, Bakker’s
approach had problems in dealing with high dimensionality, partial observability, contin-
uous state and action space, and a limited amount of training data. In a recent approach
by Sch¨afer [58], he applied recurrent neural reinforcement learning approaches to identify
and control a high-dimensional dynamic system with continuous state and action spaces
in partially unknown environments like a gas turbine. He introduced a hybrid recurrent
neural network approach that combines system identification and determination of an
optimal policy in one network. Furthermore, in contrast to our reinforcement learning
methods, it determines the optimal policy directly without making use of a value func-
tion. His approach is model-based and by constructing the model of the system it is able
to learn high dimensional and partially observable reinforcement learning problems with
continuous state and action spaces in a data efficient manner.
As it shown by Lin and Mitchell [38] recurrent neural networks are robust tools for
functional approximation in partially observable systems. In their work authors defined
three different architectures, recurrent-model, recurrent-Q, and window-Q architectures.
The first one learns an action model for the history features. The second approach learns
a Q-function approximation using indirectly observable training examples. And the third
method learns the Q-value approximation by taking a known number of state-action
pairs, window size, into its memory. As they have shown, these architectures are all
capable of learning some non-Markovian problems, but they have their own advantages
and disadvantages. Later on in the work by K. Bush [11], the author used echo state
networks(ESNs) to address the problem of non-Markovian reinforcement learning and
showed how they could successfully learn some standard benchmarks problems. The au-
thor experimentally validated the positive performance and dynamic attributes of the
echo state networks on modeling a system in the non-Markovian reinforcement learning
domains. The major difference between this work and our approach is, we tried a non-
linear readout layer to learn complex dynamic systems, but Bush introduced a memory
9
CHAPTER 2. RELATED WORKS 2.2. RECURRENT NEURAL NETWORKS IN REINFORCEMENT LEARNING
consolidation method using the mixture of expert (MoE) and tested it for stationary
dynamic systems. This framework preserves the ability to train the readout layers via
linear regression. The author showed that echo state networks exhibit low-mobility learn-
ing through trajectory-based features. Later on, echo state networks are also successfully
used and tested as Q-function approximation in work done by Oubbati et.al. [47]. Their
work sees the control task as a result of the interaction between brains, bodies, and
environments. There, the authors utilized reservoir computing as an efficient tool to
understand how the system behaviour emerges from an interaction. Inspired by imita-
tion learning designs, they presented reservoir computing models, to extract the essential
components of the system dynamics which are the result of the agent-environment in-
teraction. They validated their learning architectures by experimenting with a mobile
robot in a collision avoidance scenario. Compared to this work, we trained our policy
offline and with non-linearity on readout layer and applied discrete control actions. But
they first did a short pre-training and then used online linear approach to train echo
state networks and applied a continuous action to control an autonomous driving robot
in order to follow a line.
10
CHAPTER 3
REINFORCEMENT LEARNING IN FEEDBACK CONTROL
3.1 Learning In Feedback Control
The classical feedback control loop basically describes a control mechanism that uses
information from measurements of a process to control it. In each time interval, the
process communicates a system state, vector of measured dynamic variables, to the con-
troller that applies the control commands. In feedback control, the control variables are
measured and compared to the target values. Therefore, the feedback control mecha-
nism manipulates the input variables of the system in a continuous loop to minimize the
differences between estimation and desired targets. This allows for the formulation of a
broad range of challenging control applications. The direct approach to have a feedback
controller is by designing a model of the system. However for a nonlinear control sys-
tem, the task of model identification becomes tedious and complicated. An alternative
approach is to learn to control the feedback control systems. Which means, we need an
intelligent controller component that learns to control a subjected dynamic process using
experiences made from an interaction.
Reinforcement learning (RL) is a type of machine learning method which tries to learn
an appropriate closed-loop controller by simply interacting with the process and incre-
mentally improving the control behaviour. The goal of reinforcement learning algorithms
is to maximize a numerical reward signal by discovering which control commands i.e.
actions yield the most reward. Using reinforcement learning algorithms, a controller can
be learned with only a small amount of prior knowledge of the process. Reinforcement
learning aims at learning control policies for a system in situations where the training
information is basically provided in terms of judging success or failure of the observed
system behaviour [62]. Because this is a very general scenario, a wide range of different
application areas can be addressed. Successful applications are known from such differ-
ent areas as game playing [9], dispatching and scheduling [6], robot control [49, 56], and
autonomic computing [64].
11
CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.2. MARKOV DECISION PROCESS
3.2 Markov Decision Process
The type of control problems we are trying to learn in this work are discrete time control
problems and can be formulated as a Markov decision process(MDP) [62]. An MDP has
four components: a set S of states, a set A of actions, a stochastic transition probability
function p(s, a, s ) describing system behaviour, and an immediate reward or cost function
c : S × A → R. The state of the system at time t, characterizes the current situation
of the agent in the world, denoted by s(t). The chosen action by agent at time step t is
denoted by a(t). The immediate reward or cost is the consequence of the taken action and
function of state and action. Since the rewards for the taken action can be formulated
as cost, the goal of the control agent would be to find an optimal policy π∗
: S → A
that minimizes the cumulated cost for all states. Basically, in reinforcement learning we
try to choose actions over time to minimize/maximize the expected value of the total
cost/reward:
E[R(s0) + R(s1) + R(s2) + ...]
3.3 Q-Learning
There are various reinforcement approaches that can be formulated based on the MDP
[62] e.g. value iteration and policy iteration, where the transition model and the reward
function of the control task are known. However, in many real-world problems the state
transition probabilities and the reward functions are not given explicitly. But, only a set
of states S and a set of actions A are known and we have to learn the dynamic system
behaviours by interacting with it. Methods of temporal differences were invented to
perform learning and optimization in exactly these circumstances. There are two principal
flavors of temporal difference methods. First, an actor-critic scheme [63, 62], which
parallels the policy iteration methods, and has been suggested as being implemented in
biological reinforcement learning. Second a method called Q-learning [72], which parallels
the value iteration methods. In this work, we consider using the Q-leaning as our principal
reinforcement learning algorithms. The basic idea in Q-learning is to iteratively learn the
value function, Q-function, that maps state-action pairs to expected optimal path costs.
The update rule in Q-learning algorithm is given by:
Qk+1(s, a) := (1 − α)Q(s, a) + α(r(s, a) + γ min
a
Qk(s , a ))
where s denotes the system state where the transition starts, a is the action that is
applied, and s is the successor system state. The learning rate α has to be decreased in
the course of learning in order to fulfill the conditions of the stochastic approximation and
the discounting factor denoted by γ [62]. It can be shown that under mild assumptions
Q-learning converges for finite state and action spaces, as long as the Q-value for every
state-action pair is updated infinitely often. Then, in the limit, the optimal Q-function
is reached.
3.3.1 Batch Reinforcement Leaning
As formulated above the standard Q-learning protocol considers an agent operating in
discrete time. At each time point t it observes the environment state st, takes an action
12
CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.3. Q-LEARNING
at, and receives feedbacks from the environment including next state st+1 and the instan-
taneous reward rt. The sole information that we assume available to learn the problem is
the one obtained from the observation of a certain number of one-step system transitions
(from t to t + 1). The agent interacts with the control system in the environment and
gathers state transitions in a set of four-tuples (st, at, rt, st+1). Except for very special
conditions, it is not possible to exactly determine an optimal control policy from a finite
set of transition samples. In the literature there is a method called batch mode reinforce-
ment learning [17, 46, 36], which aims at computing an approximation of such optimal
policy π∗
, from a set of four-tuples:
D = {(sl
t, al
t, rl
t, sl
t+1), l = 1 · · · , #D}
This set could be generated by gathering samples corresponding to one single trajec-
tory (or episode) as well as by considering several independently generated trajectories or
multi-step episodes. In the work by Lange et al. [36] authors covered various batch-mode
reinforcement learning algorithms. Among them, we chose the growing batch method for
the purpose of training our learning algorithm. Training algorithms with growing batch
have two major benefits. First, from the interaction perspective, it is very similar to
the ’pure’ online approach. Second, from the learning point of view, it is similar to an
offline approach that all the trajectory samples are used for training the algorithm. The
main idea in growing batch is to alternate between phases of exploration, where a set of
training examples is grown by interacting with the system, and phases of learning, where
the whole batch of observations is used. The distribution of the state transitions in the
provided batch must resemble the ’true’ transition probabilities of the system in order
to allow the derivation of good policies. In practice, exploration cultivates the quality of
learned policies by providing more variety in the distribution of the trajectory samples.
Furthermore, it is often necessary to have a rough idea of a good policy in order to explore
interesting regions that are not in the direct vicinity of the starting states. If ’important’
states i.e. states close to the goal state are not covered by any of the trajectory samples,
then it is obviously not possible to learn a good policy from the batch data. This happens
because the system would not know which series of actions lead the to the goal area.
3.3.2 Neural Fitted Q-Iteration (NFQ)
Model-free learning methods like Q-learning are appealing from a conceptual point of
view and have been very successful when applied to problems with small, discrete state
spaces. But when it comes to applying them to the real world systems with larger and
probably continuous state spaces these algorithms facing some limiting are factors. For
the relatively small or finite state and action space, the Q-function can be represented
in tabular form and it is straightforward to approximate. However, when dealing with
continuous or very large discrete state and/or action spaces, the Q-function cannot be
represented by a table with one entry for each state-action pair. In this respect, there
are three problem can be identified :
• the ’exploration overhead’, causing slow learning in practice
• inefficiencies due to the stochastic approximation
• stability issues when using function approximation
13
CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.3. Q-LEARNING
A common factor in modern batch reinforcement learning algorithms is that these
algorithms typically address all three issues and offer specific solutions to each of them.
The Fitted-Q iteration algorithm[17] is an efficient batch mode reinforcement learning al-
gorithm that can learn from a sufficiently rich trajectory generated. In this algorithm, the
Q-function approximation is done on an infinite or finite horizon optimal control problem
with discounted rewards. In each step, this algorithm uses batch data together with the
Q-values computed at the previous step to determine a new training set. Then it applies
a regression method on the training data to compute the next Q-value of the sequence.
Among several approaches to approximate Q-function, neural networks are considered
suitable tools [23, 67] because they provide a nonlinear mapping from input to output
data. The neural Q-function needs to define an error function which aims to measure the
difference between the target Q-value and the estimated Q-value. For example, a squared
error measure like
error = (Q(s, a) − (c(s, a) + γ min
a
Q(s , a )))2
.
To train the neural Q-function and minimize the estimation error common gradient
descent learning rules like Backpropagation [20] can be applied. In general, fitted-Q
iteration algorithms are trained online which requires thousands of samples and long
training time to learn a control task. Riedmiller [53] proposes an alternative approach,
neural fitted-Q iteration (NFQ), which performs an offline update step considering the
entire set of transitions. The standard NFQ benefits from the growing batch method
to collect the trajectory samples for training a multilayer perceptron Q-function. It has
been proven [54] that 2-layer of neural networks has sufficient approximation capacity to
generalize well for a closed loop control. The pseudo code for the NFQ algorithm shown
Algorithm 1:
Algorithm 1 Main loop of NFQ . k counts the number of iterations, kmax denotes the maxi-
mum number of iterations. init MLP() returns a multilayer perceptron with randomly initialized
weights. Backpropagation training ( P) takes pattern set P and returns a mulitlayer perceptron
that has been trained on P using backpropagation as a supervised training method.
procedure NFQ Main()
input: a set of transition samples D; output:Q value function QN ;
k = 0;
init MLP() → Q0;
while k < kmax do
generate pattern set P = {(inputl
, targetl
), l = 1, ..., #D} where:
inputl
= sl
, al
targetl
= c(sl
, al
) + γmin
a
Qk(s l
, a l)
Backpropagation training(P) → Qk+1
k := k + 1
3.3.3 Resilient Propagation (Rprop)
The original NFQ uses the Backpropagation [20] algorithm combined with an optimiza-
tion method i.e. gradient decent for training its multilayer perceptron neural network.
14
CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.3. Q-LEARNING
As the equation below shows, the loss function computes the gradient with respect to
all the weights in the network. Then updates of the weight values are performed with
respect to the computed gradient:
wij(t + 1) = wij(t) −
∂E
∂wij
(t)
The partial derivative of the error function E with respect to the neural network
weights is computed based on the chain rule. The learning rate parameter affects the
convergence of the algorithm. If it’s too small, a system will take so many steps to
converge, and with a large learning rate, the system will oscillate around local minima
and fail to fall into the desired value range. Regular Backpropagation is a slow and
inefficient process. In order to accelerate this supervised learning process, it is possible to
use advanced techniques like Rprop [55], which has a more reliable and faster convergence
than the regular gradient descent method. The pseudocode in Algorithm 2 shows the
core part of the adaptation and learning the rule for Rprop. Using Rprop to train the
NFQ algorithm also reduces its overall complexity by removing the necessity of parameter
tuning for the neural Q-function. Rprop stands for Resilient Propagation, and it is an
efficient learning scheme, that in every iteration updates network’s weights based on the
changes in sign of the error function E. Here each network weight is updated individually
by ∆ij based on its local gradient information. So if the last weight update was so large
that the algorithm jumped over the local minima, then it needs to change it direction
accordingly. In this adaptive method, the new value for ∆ij is computed by multiplying it
with η−
or η+
for the negative or positive sign of the E. The values for 0 < η−
< 1 < η+
are parameters that can be set to some constant. Using the update value ∆ij, Rprop
algorithm increases or decreases individual weight values given the sign of the derivative.
Algorithm 2 The core part of the Rprop algorithm. The minimum (maximum) operator is
supposed to deliver the minimum (maximum) of two numbers; the sign operator returns +1 if
the argument is positive, -1, if the argument is negative, and 0 otherwise.∆ij determines the
size of the individual update value for each weight.
For all weights and biases{
if ( ∂E
∂wij
(t − 1) ∗ ∂E
∂wij
(t) > 0) then
∆ij(t) = minimum(∆ij(t − 1) ∗ η+
, ∆max)
∆wij(t) = −sign( ∂E
∂wij
)∆ij(t)
wij(t + 1) = wij(t) + ∆wij(t)
else if ( ∂E
∂wij
(t − 1) ∗ ∂E
∂wij
(t) < 0) then
∆ij(t) = maximum (∆ij(t − 1) ∗ η−
, ∆min)
wij(t + 1) = wij(t) − ∆wij(t − 1)
∂E
∂wij
(t) = 0
else if ( ∂E
∂wij
(t − 1) ∗ ∂E
∂wij
(t) = 0) then
∆wij(t) = −sign( ∂E
∂∆ij
)∆ij(t)
wij(t + 1) = wij(t) + ∆wij(t)
15
CHAPTER 4
ECHO STATE NETWORKS FOR FUNCTION
APPROXIMATION
4.1 Recurrent Neural Networks
Our goal in this work is to learn a feedback control system with delays without knowing
its dynamic model. As we described in the previous chapter, we chose Neural Fitted-Q
Iteration (NFQ) algorithm, a specific Q-learning algorithm, to learn our model-free task.
In its core, the NFQ algorithm uses multi-later perceptrons as a function approximator
to learn a model of the system dynamics and estimate the proper Q-value for each ac-
tion. In the control task with delays the Markov property will be violated, the ordinary
multilayer perceptron will fail to approximate the desired targets. Therefore, we need a
Q-function that is capable of holding the history of the system states in its memory for
robust approximation and preserving the Markov property. One possible option is to use
recurrent neural networks(RNNs) for the function approximation.
Recurrent neural networks (RNNs) have a structure similar to biological neural net-
works and they are able to hold the history information of the dynamic system. A
recurrent neural network contains (at least) one cycle path of synaptic connections. This
feature makes RNNs an excellent tool for function approximation in delayed control sys-
tems. Mathematically recurrent neural networks implement and approximate dynamic
systems and they are applied to the variety of tasks, for example, system identification
and inverse system identification, pattern classification, stochastic sequence modeling
[58, 47, 30]. However, it is not an easy straightforward task to train RNNs. Gener-
ally speaking, for training RNNs an extended version of the Backpropagation algorithm,
Backpropagation through time has been used [73, 74], but only with partial successes.
One of the conceptual limitations of the Backpropagation methods for the recurrent neu-
ral networks is that bifurcations can make the training non-converging [14]. Even when
they do converge, this convergence is slow, computationally expensive, and can lead to
poor local minima.
In more recent attempt to train recurrent neural networks the reservoir computing
approach [41] was introduced. These networks are in fact dynamic systems driven by the
input signal, or from another point of view they are nonlinear filters of the input signal.
16
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.2. A BRIEF OVERVIEW ON ECHO STATE NETWORKS
The idea of reservoir computing has been discovered and investigated independently under
the name of the echo state networks (ESNs) [28, 28] in machine learning, and liquid
state machines (LSMs) [42] in computational neuroscience. The work on the liquid state
machines is rooted in the biological setting of continuous time, spiking networks while
the ideas of the echo state networks were first conceived in the framework of discrete-
time, non-spiking networks in engineering applications. It was shown that the reservoir
computing approaches often work well enough even without full adaptation of all the
network weights. The reservoir computing methods usually use a supervised learning
scheme for training. In this work, we employ echo state networks as a Q-function for
their simplicity in training and capability in approximating the targets in the presence of
delay in dynamic systems [29]. Perhaps surprisingly, this approach yielded an excellent
performance in many benchmark tasks, e.g. [27, 32, 33, 68].
4.2 A Brief Overview On Echo State Networks
As we mentioned in the previous section, we employed discrete time echo state networks
for the function approximation. The echo state network has M input units, N hidden
units(reservoir neurons), and L output units. The data at each time point t, fed to the
network through the input units in the form of a vector U(n) = [u1(t), u2(t), · · · , uM (t)]T
.
The internal values for the reservoir are declared by X(n) = [x1(t), x2(t), · · · , xN (t)]T
, and
the output of the network in the time point t is showed by Y(t) = [y1(t), y2(t), · · · , yL(t)]T
.
The Figure 4.1 depicts the three-layered topology of the echo state network, the input
layer, the hidden units or reservoir, the and readout layer. There are four types of
weight matrices which connect different echo state network layers together. Three of
these weight matrices are initialized once and will stay constant during the training
time. These constant weights are described in the following. The input weight matrix
Win
with N × M dimensions which connects the input data to the hidden units. The
reservoir weight matrix W with N × N dimensions which connects the hidden units to
each other and also indicates the recurrent connections. The feedback weight matrix
Wfb
with N × L dimensions which connects the outputs back to the reservoir. The
initial values for these constant weight matrices have to be chosen carefully to increase
the performance of the echo state networks. There are various initialization and hyper-
parameter optimization techniques that can be used to initialize the Wfb
, Win
, and W
matrices more sophisticatedly according to the dynamic task. We will cover some of the
important methods along this chapter to initialize and optimize these weight values. In
the echo state networks only the output weights Wout
with L × (M + N) dimensions
will be trained. They map the reservoir states to the estimated targets. Later in this
chapter, we will describe various methods for training the weights Wout
. The update rule
for changing the state of the reservoir is:
X(t + 1) = f(Win
U(t + 1) + WX(t) + Wfb
Y(t))
X(t + 1) = (1 − α)X(t) + αX(t + 1)
Here α is the leaky integrator factor of each hidden unit. As it is shown in the equation
above echo state networks (ESNs) are composed of simple additive units with a nonlinear
activation function f. And the reservoir with a leaky integrator unit type has individual
state dynamics, which can be exploited in various ways to accommodate the network
to store temporal characteristics of a dynamic system. Leaking rate α is the amount of
17
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.2. A BRIEF OVERVIEW ON ECHO STATE NETWORKS
the excitation (signal) that a neuron discards, basically it implements the concept of the
leakage. This has an effect of smoothing the network dynamics, yielding an increased
modelling capacity of the network, for example in dealing with a low frequency sine
wave. The activation function f = (f1, f2, · · · , ft) for each hidden unit in the reservoir
is usually a sigmoid function e.g. logistic activation f(x) = 1
1+e−x or hyperbolic tangent
f(x) = ex−e−x
ex+e−x . It also can be a linear rectifier unit:
f(x) =
x if x > 0
0 otherwise
Figure 4.1: The topological illustration of the echo state network. Its structure consists of
three major parts. First an input layer containing input weight matrix Win
. Second the
middle layer containning the reservoir state matrix and its weight matrix W. And third
the output layer which contains readout weight matrix Wout
and readout value Ytarget
.
The initial state of the reservoir is usually set to random values, in most case X(0) = 0,
and this introduces an unnatural starting state which is not normally visited once the
network has ”warmed up” to the task. Therefore, when we input the sequence of long
trajectory samples to the reservoir we exclude few computed states of each trajectory
from the reservoir state. The number of discarded states depends on the memory of the
reservoir and usually it is an integer number equal or less that 10% of the input data.
The output values for the echo state network are computed according to:
Y(t + 1) = fout
(Wout
(U(t + 1), X(t + 1)))
Here fout
= (fout
1 , fout
2 · · · , fout
t ) is usually a linear function. Depending on the prob-
lem, it is also possible to connect the input data onto computed reservoir state at each time
point, using direct shortcut connections, and then compute the ESN outputs. Shortcut
connections are useful when the echo state networks are trained in the generative mode
which they try to output signal values same as the input data. But for the purpose of
this work we will not use them.
18
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.2. A BRIEF OVERVIEW ON ECHO STATE NETWORKS
4.2.1 Learning In Echo State Networks
In this work we want to use echo state networks for the function approximation in the
Q-learning algorithm because they could hold the history of the dynamic systems in their
state. Therefore, the ESN output Y(n) ∈ RL
will be the Q-value of the function approx-
imation and we try to minimize the error measure E(Y, Ytarget
) between the estimated
output Y(n) and target output Ytarget
(n). More importantly, we need to make sure that
our approximation generalizes well to unseen data. The error measure E can typically
be a Mean-Square Error (MSE), for example a root mean square error (RMSE):
E(Y, Ytarget
) =
1
L
L
i=1
1
T
T
i=1
(yi(n) − ytarget
i (n))2
which is here averaged over the L dimensions of the output data. The RMSE can
be dimension-wise normalized (divided) by the variance of the target data Ytarget
(n),
producing a normalized root mean square error (NRMSE). The NRMSE has an absolute
interpretation: it does not depend on the arbitrary scaling of the target data Ytarget
(n)
and the value of 1 can be achieved with a simple constant output data Y(n) set to the
mean value of Ytarget
(n). This suggests that a reasonable model of a stationary process
should achieve a NRMSE accuracy between zero and one. As we mentioned earlier, the
only adaptable weight matrix in echo state networks is the output weights matrix Wout
and it is possible to train these weights with different algorithms. In order to train linear
readout weights Wout
there are different methods [48, 28, 30, 66, 31], and using least
square method is a standard offline procedure and formulated as follows:
• Generate the reservoir with specified set of parameters.
• Feed the input data U(t) and compute and collect the reservoir states X(t).
• Compute Wout
from the reservoir using Linear Regression, minimizing the Mean
Square Error between Y(n) and Ytarget
(n).
• Compute Y(n) again from trained network by employing Wout
and feeding the
input data U(t).
4.2.2 Spectral Radius and Echo State Property
Echo state networks have two major functionalities that make them feasible tools for
function approximation in the presence of delays. First acting as a nonlinear filter in
order to expand the input signal U(t) to a higher dimensional space X(t). Second acting
as a memory to gather the temporal features of the input signal and this ability in the
literature called is short term capacity of the echo state network. The Combination of
these two capabilities enriches the computed reservoir state matrix X, containing the
history information about the dynamic behaviour of a control system. However, the
computed state of the reservoir is influenced by various parameters that have to be set-
ting judiciously. Those parameters or hyper-parameters are: the size of reservoir N, the
sparsity of the weight matrices, the distribution of the nonzero elements of weight matri-
ces, the spectral radius of reservoir weight matrix ρ(W), the scaling of the input weight
matrix Win
, and the leaking rate α. M. Lukosevicius [40] describes informative review
on how to set these parameters and explains their affects on the system performance. In
19
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.3. ECHO STATE FITTED-Q ITERATION
this work we mainly focus on the memory aspect of the echo state networks and try to
choose these parameters according to the memory size needed for the delayed control task.
According to Herbert Jaeger [28] a network has echo state property if the network
state X(t) is uniquely determined by any left-infinite input sequence U−∞
. After the
burn-in phase, removing the initial transition sates, the reservoir becomes synchronised
to the input signal U(t) and the current reservoir state X(t), and exhibit an involved
nonlinear response to the new input signal. This phenomenon can be explained by the
echo state property: The network will forget its random initial state after the burn-in
phase and the reservoir state becomes a function of the input signal. This is an essential
property for successful training of a reservoir. Izzet B. Yildiz and et.al. [75] defined a
simple recipe to preserve the echo state property. That procedure is explained in the
following steps:
• Generate a weight matrix W0 from uniform or Gaussian distribution.
• Normalize W0 with its spectral radius λmax, W1 = 1
λmax
W0
• Scale W1 with a factor 0 < α < 1 and use it as the reservoir matrix W = αW1.
The spectral radius of the reservoir weight matrix W is one of the most important
global parameters of the ESN and it has to be chosen according to, in our case, the needs
of the dynamic control task. Usually the spectral radius is related to the input signal,
in the sense that if lower time-scale dynamics is expected (fast oscillating signals) then
a lower spectral radius might be sufficient. However, if longer memory is necessary then
a higher spectral radius will be required. The downside of a larger spectral radius is
the longer time for the settling down of the network oscillations. From an experimental
point of view it means having a smaller region of optimality when searching for a good
echo state network with respect to some dataset. Basically, if the spectral radius value
is larger than one, it can lead to reservoirs hosting multiple fixed points that violate the
echo state property.
4.3 Echo State Fitted-Q Iteration
As we described earlier, an important aspect of echo state networks is their ability to
serve as a memory to capture and preserve the temporal behaviours of their input sig-
nal. For that reason, we substituted the multilayer perceptron Q-function in the NFQ
algorithm with echo state networks and introduced the Echo State Fitted-Q Iteration
(ESFQ) algorithm. Compared to Algorithm 1, we changed the MLP Q-function to the
ESN Q-function and the Backpropagation training is replaced by a particular training
method for echo state networks. The advantage of the ESFQ algorithm is that the cu-
mulative history of the states-action pairs in the reservoir helps the Q-function to deal
with the unknown delays in the control task. In order to train the echo state Q-function
for a dynamic control task first, the weight matrices need to be initialized. Then the
accumulated state-action pairs in the growing batch, are used to feed to the reservoir at
the end of each episode. The reservoir dynamics at each point in time presented in the
activation values of its hidden units. These activation values from the reservoir state ma-
trix, reservoir design matrix, which its rows correspond to the system state-action pairs
20
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.4. TYPES OF READOUT LAYERS FOR ECHO STATE Q-FUNCTION
on each cycle in each trajectory sample and its columns correspond to the computed
features. In the burn-in phase, to remove the early noisy and unknown dynamics of the
reservoir for each trajectory sample, we remove a few beginning states of the reservoir
matrix. Then we compute the readout weights Wout
using the teacher values. These
teacher values for every state and action sample in a trajectory are generated from the
sum of the immediate cost and the Q-value of the successor state. Finally having the
new readout weights Wout
we compute the estimated targets.
4.4 Types Of Readout Layers For Echo State Q-Function
In the Echo State Fitted-Q Iteration algorithm, which is an offline policy learning algo-
rithm, the Q-Function plays the role of a supervised regression method. Compared to the
original Neural Fitted-Q Iteration algorithm described in section 3.3.2, we use an echo
state Q-function with linear least squares on the readout layer. Although, the linear least
squares makes the learning fast, sometimes it struggles in learning more complex nonlin-
ear problems or higher input dimensions. However, because of the echo state networks
topology and the fixed values of the input and reservoir weight matrices, it is possible
to train the echo state Q-function with different types of readout layers. In the follow-
ing, we describe two methods for training the output layer of the echo state Q-function
that help it to estimate more accurate targets for the purpose of reinforcement learning.
In this regard, we investigated the possibility of using ridge regression, and multilayer
Perceptrons.
4.4.1 Ridge Regression
In a standard formulation, the reservoir state matrix or the design matrix, is overdeter-
mined matrix, because the number of trajectory samples is way larger than the hidden
unit numbers N T. Therefore, the solution for this equation Wout
= X−1
Ytarget
needs
to be approximated. To compute stable solution we apply ridge regression, also known
as Regression with Tikhonov Regularization:
Wout
= (XT
X + βI)−1
XT
Ytarget
The regularization parameter β will prevent output weights from growing arbitrarily
large, which causes overfitting and instability in the system. This approach penalizes the
squared length of the weight vector and transforms the error minimization task to a convex
optimization problem that can be solved in analytical closed form. The optimal values of
β can vary by many magnitudes of size, depending on the exact instance of the reservoir
and length of the training data. To choose the value of the regularization parameter β
by doing a simple exhaustive search, it is recommended to search on a logarithmic grid.
Furthermore, β can be found using cross-validation or Gaussian-Processes. It is preferable
to set the value of β not to zero to avoid numerical instability in computing the invert of
the matrix (XT
X)−1
. Interpretation of the linear readout gives an alternative criterion
for setting β directly [12]. Because determining the readout weights is a linear task, the
standard recursive algorithms, from adaptive filtering, for minimizing mean square root
error can be used for online readout weight estimation [30]. The recursive least squares
algorithm is widely used in signal processing when a fast convergence is of prime priority.
21
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.4. TYPES OF READOUT LAYERS FOR ECHO STATE Q-FUNCTION
4.4.2 Multilayer Perceptron Readout Layer
The first and major reason we employed echo state networks as a Q-function approxima-
tion were their ability to hold the history of the changes in the dynamic system behaviours.
We need this history to deal with delays in executed action in the control problem. How-
ever, the estimated Q-value by linear readout layers of the echo state Q-function might
not help with complex nonlinear dynamics. The linear least square readout layer returns
a global minimum which means the estimated targets have the minimum possible mean
squared error to the original target. But a small error value does not always mean we
have a successful policy for reinforcement learning. Also, it has been reported [29] that a
larger reservoir size, even with high randomness in its weight matrices, combined with a
linear readout layer might not be able to learn nonlinear dynamic systems. Therefore, we
need to utilize more advanced readout layers to learn the policy for nonlinear dynamic
systems. To do so, we use a multilayer perceptron on the readout layer of the echo state
Q-function and trained it via the Backpropagation algorithm. The novelty of this ap-
proach is that the reservoir expands the systems states into higher dimensional features
space and also preserves the history of the states. Then the Backpropagation algorithm
trains the multilayer perceptron to learn the value function which estimates the proper
Q-values for controlling the delayed dynamic system. The Figure 4.2 illustrates a topo-
logical example for the Echo State Fitted-Q Iteration algorithm.
Figure 4.2: The topological illustration of the Echo State Fitted-Q Iteration algorithm with
multilayer perceptron readout layer. It consists of two major parts. First, the memory part
containing echo state input weight and reservoir weight matrices and their corresponding neu-
rons. In the second part the readout layer containing one input layer, two hidden layers of
feedforward neural networks and one output layer. Notice that the applied action goes only
through the readout layer but not the reservoir itself.
22
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.5. SMAC FOR HYPER PARAMETER OPTIMIZATION
When we compare the Echo State Fitted-Q Iteration (ESFQ) algorithm to the NFQ
algorithm, it can be seen that our method is a specific version of the NFQ algorithm. As it
is shown in Algorithm 3 in more detail, after collecting trajectory samples first we fit them
into a reservoir with pre-initialized weight matrices. The reservoir collects the expanded
trajectory samples in its state matrix and then jointly with their corresponded actions fits
them to a multilayer perceptron algorithm i.e. the Q-function in the Algorithm 3. The
Rprop algorithm trains the multilayer perceptron and tries to minimize the estimation
error with respect to the target values. This novel structure is computationally efficient,
as long as the reservoir has fairly small size. The immediate costs/reward for each action
is based no system state not the reservoir state.
Algorithm 3 The main loop of Echo State Fitted-Q Iteration.k counts the number of it-
erations, kmax denotes the maximum number of iterations. init MLP() returns a multilayer
perceptron with randomly initialized weights. Rprop training(P) takes pattern set P and re-
turns a multilayer perceptron that has been trained on P using Rprop as the supervised training
method
procedure ESFQ Main()
input: a set of transition samples D; output:Q value function QN ;
k = 0;
init MLP() → Q0;
init ESN();
while k < kmax do
generate pattern set P = {(inputl
, targetl
), l = 1, ..., #D} where:
inputl
= ESN(sl
), al
targetl
= c(sl
, al
) + γmin
a
Qk(input l
, a l
)
Rprop training(P) → Qk+1
k := k + 1
4.5 SMAC For Hyper Parameter Optimization
There are a number of parameters that need to be carefully selected in order to boost
the performance of the Echo State Fitted-Q Iteration algorithm. These are parameters
related to the short-term memory capacity of the echo state Q-function. In the following,
we named those parameters whose their effect need to be studied. First is the spectral
radius of the reservoir which directly affect the short term memory capacity of the reser-
voir. Second, the leaking rate which determines how fast neurons in the hidden units will
shed off their dynamics. Third, the reservoir size or number hidden units, usually chosen
relatively larger than the input signals dimension. Fourth, the sparsity percentage of
the input and reservoir weight matrices which affect the computation time and to some
degree performance. Fifth, the reservoir activation function which is usually logistic,
hyperbolic-tangent, or rectifier units. As we see the echo state Q-function is a black-box
function with various parameters. Most of these parameters have continuous real values
and using combinatorial space of parameter settings for tuning them is inefficient and
leads to unsatisfactory outcomes.
Recently, automated approaches for solving this algorithm configuration problem have
led to substantial improvements in the state of the art for solving various problems. Se-
23
CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.5. SMAC FOR HYPER PARAMETER OPTIMIZATION
quential Bayesian optimization allows for global optimization of such black-box functions
by using as few trials as possible. MOCKUS et al. [44] applied Bayesian methods to cases
with linear and nonlinear constraints and to multi-objective optimization. They stud-
ied interactive procedures and reduction of multidimensional data in connection with
global optimization using Bayesian methods. Various implementations of this sequen-
tial optimization match or surpass manually tuned performance for the tasks such as
satisfiability problem [25] or object classification [7]. In this work we employed SMAC
hyper-parameter optimization software which is implemented in [25] and used sequential
Bayesian optimization methods. In the extended version of the SMAC software [24] it is
possible to optimize various types of parameters e.g. categorical parameters and sets of
instances.
24
CHAPTER 5
EXPERIMENTS AND RESULTS
5.1 Experimental Setup
In this section, we demonstrated results of our experiments in order to evaluate the
performance of the Echo State Fitted-Q Iteration (ESFQ) algorithm on the learning of
delayed control tasks. The presence of a delay, in particular, the action delay, violates
the Markov property of a control system. The main goal of this work was to show that
echo state networks with a nonlinear readout layer are practical memory based function
approximation in the non-Markovian reinforcement learning domain. Therefore, we tried
to compare the performance of our method to tapped delay-line memory base algorithms,
as a baseline method and discuss the different advantages and drawbacks of it. We choose
two standard simulated benchmarks, mountain car, and inverted pendulum, for evaluat-
ing our model-free learning algorithms. We tested our algorithms with plants containing
delays in their applied actions with different lengths in the range of 1 to 30. We showed
that the solution we proposed within the Echo State Fitted-Q Iteration algorithm can
be a very stable and reliable learning tool for a system with unknown delay lengths and
complex dynamics.
In the rest of this chapter, we introduce our selected benchmarks and their parameter
settings for each experiment. After that, we describe the setting for our algorithms,
the general scheme of our experiments, and the hyper-parameter optimization for the
ESFQ algorithm. Then we compare and analyze computed the results based on their
accuracy, efficiency, and ability to deal with the complex dynamic system with delays.
We implemented our core methods and algorithms in C + + using the Eigen framework
and integrated them to the CLSquare simulation system. For the hyper-parameter tuning,
we employed SMAC [24], a versatile tool for optimizing algorithm parameters.
5.2 Simulation Benchmarks
In the realm of reinforcement learning, there are various standard simulated benchmarks
[62] used to evaluated algorithm’s performances. In this work, we tried two of these bench-
marks for evaluating the performance of our algorithms and collected measurements to
compare their abilities in dealing with delay in control systems. The benchmarks were
25
CHAPTER 5. EXPERIMENTS AND RESULTS 5.2. SIMULATION BENCHMARKS
mountain car and inverted pendulum [62]. The dynamics of balancing a pendulum at an
unstable position can be employed in the applications of controlling walking robots or
rocket thrusters. This highly nonlinear dynamics make the inverted pendulum a suitable
and difficult task for our algorithm to solve. Our second chosen benchmark, the moun-
tain car problem, is a second-order, nonlinear dynamic system with low-dimensional and
continuous-valued state and action spaces. The simulated plants are provided by the
CLSquare tool, developed and implemented by the Machine Learning Lab of the Uni-
versity of Freiburg. In the following, we briefly introduce each benchmark and their
parameter settings for our experiments.
5.2.1 Mountain Car
Mountain car is a standard testing domain in reinforcement learning. The problem is for-
mulated such an under-actuated car must drive up a parabolic shape steep hill. Starting
from any initial state in the middle of the valley the goal of the system is to drive the
car up the valley’s side and escape in as few steps as possible. Since gravity is stronger
than the car engine, even at full throttle, the car cannot simply accelerate up the steep
slope. The car must rock back and forth along the bottom of the valley to build enough
momentum to escape. The allowed actions are {−4, 4} Newton to drive the system in
left and right directions. The state vector has two dimensions, car position along x-Axis
in meters, and car velocity in m/s. The allowed area for the car is between [−1.2, 0.7]
and the system will receive a transition cost of 0.01 in this range. The terminal area is in
the range of [∞, −1.2] and the agent will receive a terminal cost of 1.01 and episode cycle
will be ended. The target area is between [0.7, ∞] and agent receive no cost in the range
and the episode will be ended. The Figure 5.1 depicts the mountain-car visualization in
the CLSquare simulation.
Figure 5.1: The simulated mountain car visualization. The allowed position for the car is in
the range of [−1.2, 0.7] meters and the goal area in the range of [0.7, ∞] meters. The terminal
area for the car is in the range of [−∞, 0.7] meters.
5.2.2 Inverted Pendulum
The inverted pendulum problem benchmark requires balancing a pendulum of unknown
length and mass at the upright position by applying forces to the cart it is mounted
26
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
on. The allowed actions are {−10, 10} Newton to move the cart to the left or right
directions. Actions can be given at intervals of 0.05s. The system state is described by
a four dimensions vector, pole angle θ, pole angular velocity ˙θ, cart position P, and cart
transitional velocity ˙P. The cart can move in the range of {−2.4, 2.4} meters and anything
outside this ranges is considered terminal area and the episode will be ended. The pole is
allowed to rotate in between {−0.7, 0.7} radians. When the inverted pendulum is moving
in the allowed range will receive transition cost of 0.01 and cost for entering the terminal
area is 1.01. The target area for the pole is defined as the range {−0.05, 0.05} radiant,
and for the cart as the range {−0.05, 0.05} meters. There, the system will receive no
transition cost and the episode will end. The Figure 5.2 depicts the inverted pendulum
visualization in the CLSquare simulation.
Figure 5.2: The inverted pendulum visualization in the CLSquare simulation. The goal areas
for the pole and cart is indicated. The allowed area for the cart is in the range of {−2.4, 2.4}
meters and for the pole is in the range of {−0.7, 0.7} radians. Outside the allowed area is
considered the terminal area. The goal area for the cart is in the range of {−0.05, 0.05} meters
and the goal area for the pole is in the range of {−0.05, 0.05} radians.
5.3 Experimental Procedure And Parametrisation
The process that takes the Echo State Fitted-Q Iteration algorithm to learn a control
task has the following steps: first, we initialize the echo state networks input and reservoir
weights matrices, Win
, W. Then generated trajectory samples in each episode, containing
the plant state variable in a vector form, are passed through the reservoir at the end of
each episode and expanded to a higher dimensions feature space, the same dimensionality
as the reservoir state. Then these expanded samples and their corresponding actions are
accumulated in a growing batch format. The trajectory samples are generated by a
simulated plant containing unknown delay lengths in their executed actions in the time
intervals of 0.05s. At the end of each episode, the batch is fed to a multilayer perceptron
which contains a number of input units equal to the size of the reservoir hidden units. The
data flows through two extra hidden layers with 20 neurons each, and one output layer for
the Q-value. The activation function for the multilayer perceptrons is logistic. Finally,
the Rprop algorithm updated the weights of the multilayer perceptron. We designed our
finite horizon experiments on the introduced benchmarks such that the reinforcement
learning algorithm tries to minimize the time the agent needs to reach the goal area.
27
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
The discount factor γ parameter for our Q-learning algorithm was set to 0.98 and no
exploration strategy was used during the training.
5.3.1 Echo State Q-Function
As we described in the previous section the reservoir in echo state networks works as
memory for the Q-function and the multilayer perceptron as a readout layer in the Echo
State Fitted-Q Iteration algorithm. For our memory based approach it was essential
to have sufficient memory capacity for holding the history of delayed states. Here we
benefited from the short-term capacity of the echo state Q-function. However, the Q-
function parameters needed to be specified according to the length of the delay and the
dynamics of the control task at hand.
Table 5.1: The list of parameters for initializing echo state networks.
Name Description And Value Range Optimized
Spectral Radius Real value [0, 1], adjusts memory size True
Leaking Rate Real value [0, 1], adjusts dynamics of each neuron True
Reservoir Size Integer value in 10, 20, 30, 40, 50, 100, 200 True
Transition steps Integer set to {5% or 10%} of cycle length False
Regularization
coefficient
Real value in [10e-3,10e3] True
Activation function Logistic, Tanh, Rectifier True
Input weight sparsity Real value [0.2, 1.0] True
Reservoir weight spar-
sity
Real value [0.2, 1.0] True
A list of echo state networks parameters is available in Table 5.1. The spectral radius
and leaking rate together affect the short term memory length of the network. Basically,
for a system with fast oscillation like the inverted pendulum, we would need a shorter
memory length compared to slower rate systems like the mountain car. A high spectral
radius and or a very small leaking rate increase the short memory length of echo state
network. The transition steps are the number of the time steps in each trajectory samples
that are used for burning the reservoir and these state will be removed in the learning
time. The burning phase helps the reservoir to remove its noisy dynamics and become
input driven rather than affected by the arbitrary initial states of itself. The length of
the transition steps has a linear dependency on the sample trajectory length, usually
5 to 10 percent of the average trajectory length. The reservoir size depends on the
dimension of the input signal and the complexity of the system dynamics, for some
tasks it needs to be set to a very large number. The regularization coefficient and the
activation function need to be searched and found regarding the control system at hand.
The standard choice for activation function is the hyperbolic tangent function. A higher
value for the regularization coefficient is used to prevent linear readout weights from
growing exponentially. Typically it is a real number in logarithmic scale and in the range
28
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
of [0.00001, 1]. In the following we talk about the initialization of the echo state network
weight matrices which are playing a major role in capturing system dynamics.
Figure 5.3: On the left, the reservoir weight matrix is initialized from a normal distribution
with zero mean and unit variance. Then it is normalized with the spectral radius of 0.54. On the
right, the reservoir weight matrix is initialized from a normal distribution with zero mean and
unit variance. Then orthogonal matrix is computed using Schur decomposition implemented in
the Eigen library and it is normalized with the spectral radius of 0.54. Finally 40 percent of its
coefficient set to zero.
The sparsity of the echo state weight matrices boosts its performance by extracting
more distinguishable features from input signals. In the Figure 5.3 reservoir weight matri-
ces for 20 hidden units are illustrated. The matrix in the left is initialized using a normal
distribution with zero mean and unit variance and then normalized with the spectral
radius of 0.54 and fully connected. In fully connected reservoirs, all reservoir neurons
were directly connected to each other with all connection weights being nonzero. The
more sparsely connected matrix on the right graph was generated in two steps. First, a
fully connected reservoir was generated in the same way as the first one. Then from that,
an orthogonal matrix computed using Schur decomposition. Finally, given the sparsity
percentage, 40% in this example, randomly chosen coefficients were set to 0. We need
to consider the fact that if the sparsity percentage is set very high, then the reservoir
does not exist as a group of mutually connected neurons. Usually, larger reservoirs (more
hidden units) tolerate a lower connectivity and still fulfill this requirement. The advan-
tage of a sparse orthogonal reservoir weight matrix is that it has linearly independent
columns and echo states were computed by this matrix could better catch the dynam-
ics of the system. The weights of the reservoir W functions as a dynamic basis for the
original input signal. Therefore, these bases decompose the input signals into pertinent
sub-components to maximize their differences.
29
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
In summary, we suggest that the performance echo state Q-function highly depends
on a proper balance of the internal dynamics and the influence of external activities,
mediated via the input weights, and the strengths of the input and target activities. See-
ing that the external activities can not be directly manipulated, the initialization of the
reservoir and input weight matrices are important. If the input weight values were set
too high, it will dominate the dynamics unfolding in the reservoir and quickly overrule
any other useful internal dynamics. A proper choice of the weight initialization intervals
is necessary to generate a well-performing echo state Q-function.
5.3.2 Hyper-parameter Optimization
The choice of parameters depended on the delay length which was unknown for the
Echo State Fitted-Q Iteration algorithm. Therefore, we had to search over the parameter
space to find out which configuration provides sufficient memory length for our algorithm.
Hence, we used the hyper-parameter optimization tool SMAC to search over given pa-
rameters: spectral radius, leaking rate, reservoir size, non-linear function for reservoir
neurons, input weight matrix sparsity, and reservoir matrix sparsity. These parameters
are inter-depended and affect one another. The parameter optimization algorithms try
to minimize the loss function over sets of discrete and continue values. To compute the
loss value we trained our algorithm for 120 episodes, 200 cycles each on the inverted
pendulum task with an unknown action delay given a particular parameter set. The
algorithm was tested 100 times with different initial positions at the end of each episode.
A successful test was one that ended up in the goal area in the finite horizon task. The
averaged success rate subtracted after 120 episodes were returned as a loss value. SMAC
searched over this parameter space using Bayesian methods of global optimization to find
settings that provide minimum expected deviation from the global minimum. Parameter
tuning over reinforcement learning algorithm is a computationally expensive process, and
for each delay length has to be done separately. Therefore, in the scope of this work, we
only ran a few optimization iterations, roughly 60 to 80 per delay length and in total
2500 experiments, to find different good settings.
The Figure 5.4 shows the Pearson correlation coefficient for 6 hyper-parameters opti-
mized via SMAC. Each cell shows the correlation coefficient between parameters on its
corresponding column and row. Labels are parameter names in both Axis. Coefficient
values range between [−1, 1] with a higher absolute value for higher correlation and the
sign for direct or inverse correlation. And they are computed from samples with at least
80% goal reach. The value range for each parameter is presented in Table 5.1. Our re-
sults showed that the most frequently chosen activation function was logistic activation
therefore we exclude it from correlation computation. The goal of this optimization was
to find the configuration which gives echo state Q-function enough short-term memory
capacity to learn the delayed control task. Therefore, we include corresponding delay
lengths for each experiment to observe its correlation with other parameters.
30
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
Figure 5.4: This plot shows the pairwise correlation matrix between for 6 optimized hyperpa-
rameters using Pearson product-moments method. The labels on the figure assigned to their
corresponding parameter. The delay lengths, third column, was kept constant during each op-
timization iteration and its value is ranged between [1, 30]. The dataset which the correlations
are computed is from samples which have 80% or more goal reach, 1561 experiments, for the
simulated inverted pendulum plant.
There are few observations we could make by looking at the bivariate correlation ma-
trix on Figure 5.4. First, the reservoir matrix and input matrices sparsity were positively
correlated. This means for experiments with different delay length their value increases
and decreases simultaneously. However, this correlation was not strong. Contrary to
that, the leaking rate and spectral radius are inversely correlated. Although this was
a weak correlation it implied that a reservoir with high spectral radius needs a lower
leaking rate for holding the history of delayed actions. According to the update equation
of the reservoir, a high leaking rate means its new state will be influenced less by its
previous state and high spectral radius means the reservoir previous state will have more
influence on the next computed state. Another observation is that the delay length in
this matrix had a weak negative correlation with the spectral radius, the leaking rate,
and the input weight matrix sparsity. Our bivariate correlation matrix also shows the
delay length and the reservoir size had almost zero correlation. Furthermore, we can see
that the reservoir size and the spectral radius have a positive correlation. Both of these
two observations are counter-intuitive, because as we described before for a reservoir with
a large number of hidden units we would expect a relatively small spectral radius. In
addition, for a longer delay length, we need a higher spectral radius to provide enough
short-term memory capacity. Therefore, we can safely conclude that these parameters
are interdependent and they have a multivariate dependency among them which can not
be understood only using a bivariate correlation computation. To analyze the effects of
these parameters on the ESFQ performance in detail we need to run thoroughgoing ex-
periments of hyper-parameter optimization with an extensive number of iteration for each
delay length. Then by knowing their multivariate correlations we can find an efficient
way of selecting them. However, in the scope of this work for selecting these parameters,
31
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
we need to rely on computationally expensive hyper-parameter optimisation done over
our reinforcement learning algorithm.
Figure 5.5: The figure shows a histogram plot of the goal reaching percentage of the spectral
radius, leaking rate, reservoir weight matrix sparsity, input weight matrix sparsity. Each bin on
the Y-Axis shows how frequently and with what percentage the system reaches the goal given
the corresponding subject parameter on X-Axis. There are in total 2500 experiments on the
simulated inverted pendulum containing action delays in the range of [1, 30].
In Figure 5.5 we see a histogram plot for the spectral radius, leaking rate, reservoir
weight matrix sparsity, input weight matrix sparsity parameter relative to reaching the
goal area. In total, there were 1561 configuration samples with more than 80% success
rate, out of 5600 different configurations that we tried and in average 50 samples per
delay length. As we can see on the top left plot system spectral radius values in the
range of [0.0001, 0.3] have dominantly reached the goal area with high success rate. For
the leaking rate in the plot top right, the successful experiments have a broader range of
[0.0001, 0.9]. And for both weight matrices sparsity, the successful experiments frequently
occur in the range of [0.4, 0.8]. This illustration will help us to elicit some general rule
for initializing our hyper-parameters and putting some constraints on their value range
and make the search more efficient. Figure 5.6 depicts mean and variance of the four
echo state hyper-parameters relative to goal reach percentage for experiments above 80%
success rate. In the top rows, we compared spectral radius and leaking rate against
goal reaching percentage. The bottom row compares reservoir and input weight matrices
sparsity. The frequency of successful experiments for each of these values can be seen
in the corresponding histogram plots in Figure 5.5. As we can see for a wide range of
these subject parameters there are successful experiments. This is another support for
32
CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION
the presence of multivariate interdependencies between these parameters.
Figure 5.6: The figure shows the mean and variance of the goal reach percentage on the Y-Axis,
given value of the particular subject parameter on the X-Axis. The subject parameters are the
spectral radius, leaking rate, reservoir weight matrix sparsity, and input weight matrix sparsity.
The samples selected for experiments with more than 80% goal reach. There are in total 2500
experiments on the simulated inverted pendulum containing action delays in the range of [1, 30].
Our goal in this section was to suggest that there are rules for efficiently choosing
some of the hyper-parameters of the ESFQ algorithm. And according to our parameter
optimization results in the previous section deducing such rules is not a straightforward
task due to multivariate dependencies between these parameters. It will need an extensive
study of the parameter space which is beyond the scope of the work. But in order to
drive some general constraints for setting up and computing a reservoir given a particular
task, our results suggests the following:
• Reservoir size, the number of hidden units, up to 10 times larger than the size of
the input vector. This is usually a large enough number to grasp and reflect the
system dynamics.
• Input weight matrix and reservoir weight matrix should be sparse as we saw in
Figure 5.5, they better to have the sparsity values between [40, 60] percent.
• Our experimental results suggest that the logistic activation works well, 1514 out
of 1561 successful experiments, for the reservoir activation function.
• leaking rate and spectral radius have a weak inverse correlation but it is not in the
scope of this work to comment on the choice of their values.
33
CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM
Basically, these results suggest that there are possibilities to chose the algorithm
parameters according to the characteristics of a given dynamic control task. But to do
so we would need to do more experiments and collect enough samples per delay lengths.
We may run 300 SMAC iterations for each delay length and then study the change in the
parameters relative to how well the controller learns a task. Then, it is possible to fix
the value of some of these parameters and only search over particular ones. In the last
chapter of this work, we will mention some of the probable solutions for optimizing these
parameters and better understanding their dependencies.
5.4 Comparisons To Delay-Line NFQ Algorithm
In this section, we compare the results of our method to the standard delay-line Neural
Fitted-Q Iteration algorithm. Here we used a Q-function with two layers of hidden
units, 20 neurons each, proven[54] to be feasible structure to learn dynamic control tasks.
In the delay-line method, the action delays lengths are known to the algorithm. The
system history is a collection of observed states from the most recent state-action pair
and previous states, equal to the delay length. This means if there a 3 steps action
delays in the simulated plant, then the input vector contain the current state-action pair
and the past three states. The NFQ algorithm was trained on the simulated inverted
pendulum benchmark for 200 episodes, 200 cycles each, using Rprop and one update
per episode. For the sake of comparison, we added the same nonlinear structure as the
NFQ algorithm on top of our reservoir in Echo State Fitted-Q Iteration algorithm. As
we described in chapter 4, the difference between the NFQ and our method is that we
preserve the memory for our method by using the reservoir of the echo state networks.
And we employed the hyper-parameter optimization tool SMAC to find a suitable memory
capacity for the echo state Q-function. In simple terms, we compared the performance
of the reservoir as a memory to the classical tapped delay-line memories [65]. Although
the Echo State Fitted-Q Iteration algorithm is not aware of the delay length but with
the aid of the hyper-parameter optimization tool SMAC it manages to find the best
configuration. This approach could be considered the same as giving the delay length to
the system.
34
CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM
Figure 5.7: Maximum percentages of goal reach achieved by delay-line NFQ and ESFQ algo-
rithms given different delay lengths. The results achieved from the simulated inverted pendulum
with action delays in the range of [1, 30]. The X-Axes shows the delay lengths and the Y-Axis
show the maximum percentage of goal reach.
In Figures 5.7 and 5.8 we present a comparison between the delay-line Neural Fitted-
Q Iteration and the Echo State Fitted-Q Iteration algorithms on the test phase with
different initial values and different delay lengths. In the first graph, we can observe
that the ESFQ algorithm has the advantage of achieving the maximum value even with
the long delay length. On the other hand, we can observe that delay-line NFQ faces
difficulties after 10 steps action delays and its peak performance decays down to 40%.
Such observations suggest that reservoir is a very suitable memory based approach for
the control system with longer delay lengths. For every delay length it reaches the goal
area, from different initial positions in the test phase, on average more than 90% of the
times, and in some cases 100% of the times. Furthermore, in the second plot 5.8 we can
observe that the ESFQ algorithm shows fairly robust performance to randomness in the
initialization of its fixed weight matrices. Here for every length of delay, we trained both
algorithms, initialized with 10 different random seeds, and plot the mean and standard
deviation of goal reach percentage over delay lengths. As we see the NFQ algorithm
performance is prone to decay when it is trained with different random seeds especially
for longer delay lengths. On the contrary, the ESFQ algorithm performs more robustly
in the face of such random initialization even for longer delay lengths. In this regard,
we need to consider the fact that the echo state performance presumably should decay
because it has extra random parameters for its weight matrices. However, given the
same configuration for initializing the reservoir over various random seeds, our algorithm
shows much better overall performance. Furthermore, the closed range of performance
variances in both algorithms is another evidence for the robustness of the ESFQ algorithm
toward randomness in its parameter initializations. In our experiments, we are trying to
support the idea that the echo state Q-function is a fairly reliable memory based function
approximation for the dynamic system with delays. And, the facts about the randomness
support the idea that the memory capacity of the echo state Q-function is less likely to
be affected by random initialization of its parameters.
35
CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM
Figure 5.8: Mean and standard deviation of the goal reaches percentage achieved by delay-line
NFQ and ESFQ algorithms given different delay lengths. The results achieved from experi-
menting with simulated inverted pendulum containing action delays in the range of [1, 30]. The
X-Axis shows the delay lengths and the Y-Axis show the mean and standard deviation of the
goal reach percentage average over experiments with 10 different random seeds. The mean
values are computed from the maximum percentage of goal reach for each experiment.
In Figure 5.9 we compared the learning curve for the ESFQ and NFQ algorithm on
the experiments with 27 delay length. The value is averaged over 10 different random
seeds. Here we can see that although the echo state Q-function has more random pa-
rameters, overall the ESFQ algorithm performs better than the NFQ algorithm. In the
early episodes the delay-line NFQ shows some minor improvement relative to the ESFQ
but after 60 episodes the average performance of the NFQ stays the same and the ESFQ
performance increases. Furthermore, we can see that the ESFQ has higher standard de-
viation compared to delay-line NFQ.
36
CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM
Figure 5.9: The learning curve of the delay-line NFQ and the ESFQ algorithms. The results
achieved from experimenting with the simulated inverted pendulum containing 27 steps action
delay. The X-Axis shows the episodes index and the Y-Axis shows the mean and standard
deviation of the goal reach percentage average over experiments with 10 different random seeds.
The mean values computed from the percentage of goal reach for corresponding episodes over
all experiments.
In Figure 5.10 we compared how fast each algorithm reached the 80% success rate or
more. As we could see the Echo State Fitted-Q Iteration algorithm average needs 100
episodes to reach the minimum 80% success rate. Although the delay-line NFQ algorithm
for some short delay lengths reaches the maximum success rates faster it fails for most
of the delay lengths larger than 15. This fact indicates that the memory provided by the
reservoir contains information that helps the Echo State Fitted-Q Iteration algorithm to
perform more successfully and reach higher performances quicker. However, we should
remember that in order to find a suitable parameter setting for the ESFQ algorithm we
still need to use hyper-parameter optimization tools, explained in the previous section.
From all these observations we can conclude that the Echo State Fitted-Q Iteration
algorithm with nonlinear readout layer is a reliable algorithm for learning a dynamic
control task with delays. And, its performance can stay relatively stable regardless of the
randomness in the parameter initializations.
37
CHAPTER 5. EXPERIMENTS AND RESULTS 5.5. WHY NOT LINEAR READOUT LAYERS
Figure 5.10: In this figure, we see how many episodes for the ESFQ and the NFQ algorithms
take to reach success rate above 80%. The X-Axis corresponds to delay length and the Y-Axis
corresponds to episode number. The experiments are done on the simulated inverted pendulum
for a given delay length, for 120 episodes and 200 cycle each. For some delay lengths the
delay-line NFQ did not manage to reach the minimum of 80% success rate.
5.5 Why Not Linear Readout Layers
In the previous section, we showed the advantages of the Echo State Fitted-Q Iteration
algorithm with a nonlinear readout layer and compared its performance to the tapped
delay-line NFQ algorithm. As we described in section 4.4 it is possible to train different
readout layers for the ESFQ algorithm and in the following we present results for training
the algorithm with various readout layers. Here we show the performance evaluation for
ridge regression and very large reservoirs. Our aim is to illustrate that a linear method
for training the echo state Q-function, in offline growing batch mode, performs well for
the control task with simpler dynamics such as mountain car but fails to learn a proper
policy on more complex dynamic tasks such as the inverted pendulum.
5.5.1 Ridge Regression On Mountain Car
The first choice for training the echo state Q-function is the ridge regression algorithm.
We designed an experiment with the simulated mountain car containing 2 steps action
delay. In Figure 5.11 we can see how the dynamics of the mountain car are captured
by the reservoir and how they changed over the time given the input signal i.e. the car
position and velocity and the applied action. The ESFQ algorithm with a linear readout
layer achieves 100% test success in reaching the goal area starting from 100 different
initial positions after 17 episodes running with 300 cycles each. The effects of the delayed
actions are reflected in the reservoir states in the top plot. Hidden units adapted their
states according to the changes in the car position and velocity. It took 290 cycles for
the controller to reach the goal area. Other parameters of this experiment were set as
following. The spectral radius is 0.40, the reservoir size is 20, the regularization coefficient
is 10, 30 steps of the initial transitions, γ is 0.98, with 10% greedy exploration rate, and
38
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control
esnq_control

More Related Content

What's hot

Maxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysisMaxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysis
Maxime Javaux
 
TR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigonTR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigon
Bob Zigon
 
Modelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric GeneratorsModelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric Generators
Felipe Ferrari
 

What's hot (16)

Micazxpl - Intelligent Sensors Network project report
Micazxpl - Intelligent Sensors Network project reportMicazxpl - Intelligent Sensors Network project report
Micazxpl - Intelligent Sensors Network project report
 
Maxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysisMaxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysis
 
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
 
main
mainmain
main
 
Ee380 labmanual
Ee380 labmanualEe380 labmanual
Ee380 labmanual
 
TR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigonTR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigon
 
Tac note
Tac noteTac note
Tac note
 
MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Sho...
MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Sho...MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Sho...
MediaEval 2015 - Automatically Estimating Emotion in Music with Deep Long-Sho...
 
Diplomarbeit
DiplomarbeitDiplomarbeit
Diplomarbeit
 
Control systems lab manual R19 jntuk, vignan's institute of engineering for w...
Control systems lab manual R19 jntuk, vignan's institute of engineering for w...Control systems lab manual R19 jntuk, vignan's institute of engineering for w...
Control systems lab manual R19 jntuk, vignan's institute of engineering for w...
 
Modelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric GeneratorsModelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric Generators
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
 
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
 
KHAN_FAHAD_FL14
KHAN_FAHAD_FL14KHAN_FAHAD_FL14
KHAN_FAHAD_FL14
 
A review paper on memory fault models and test algorithms
A review paper on memory fault models and test algorithmsA review paper on memory fault models and test algorithms
A review paper on memory fault models and test algorithms
 
Priority assignment on the mp so c with dmac
Priority assignment on the mp so c with dmacPriority assignment on the mp so c with dmac
Priority assignment on the mp so c with dmac
 

Similar to esnq_control

A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
Man_Ebook
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
Man_Ebook
 
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdfMachine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
YAAKOVSOLOMON1
 
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Wouter Verbeek
 
Masters Thesis - Joshua Wilson
Masters Thesis - Joshua WilsonMasters Thesis - Joshua Wilson
Masters Thesis - Joshua Wilson
Joshua Wilson
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
Gustavo Pabon
 
Derya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesisDerya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesis
Derya SEZEN
 
Dissertation wonchae kim
Dissertation wonchae kimDissertation wonchae kim
Dissertation wonchae kim
Sudheer Babu
 

Similar to esnq_control (20)

Sarda_uta_2502M_12076
Sarda_uta_2502M_12076Sarda_uta_2502M_12076
Sarda_uta_2502M_12076
 
JPMthesis
JPMthesisJPMthesis
JPMthesis
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdfMachine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
 
Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
 
Masters Thesis - Joshua Wilson
Masters Thesis - Joshua WilsonMasters Thesis - Joshua Wilson
Masters Thesis - Joshua Wilson
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
Project report on Eye tracking interpretation system
Project report on Eye tracking interpretation systemProject report on Eye tracking interpretation system
Project report on Eye tracking interpretation system
 
UROP MPC Report
UROP MPC ReportUROP MPC Report
UROP MPC Report
 
Black_book
Black_bookBlack_book
Black_book
 
Thesis
ThesisThesis
Thesis
 
Derya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesisDerya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesis
 
Dissertation wonchae kim
Dissertation wonchae kimDissertation wonchae kim
Dissertation wonchae kim
 
Evaluation of tdoa techniques for position
Evaluation of tdoa techniques for positionEvaluation of tdoa techniques for position
Evaluation of tdoa techniques for position
 
andershuss2015
andershuss2015andershuss2015
andershuss2015
 
T401
T401T401
T401
 

esnq_control

  • 1. University Of Freiburg Department Of Computer Science Machine Learning Lab Master’s Thesis Echo State Fitted-Q Iteration Approach To Learn Delayed Control Systems Ramin Zohouri 16. December 2015 First Reviewer: Dr. Joschka Boedecker Second Reviewer: Prof. Dr. Moritz Diehl Supervisor: Thomas Lampe
  • 2.
  • 3. Machine Learning Lab Department Of Computer Science Systems Control and Optimization Laboratory Department of Microsystem Engineering University Of Freiburg Author : Ramin Zohouri Master’s Degree Program in Computer Science Master’s Thesis Echo State Fitted-Q Iteration Approach To Learn Delayed Control Systems First Reviewer: Dr. Joschka Boedecker Second Reviewer: Prof. Dr. Moritz Diehl Supervisor: Thomas Lampe Submitted : 16.12.2015 II
  • 4. ABSTRACT Stability and performance of dynamically controlled systems tend to decay by the pres- ence of delays in their applied actions or measured system states. The presence of such delays magnifies the difficulties of learning control systems, particularly when their dy- namic models are not known to us. In this work, we introduce an effective and efficient Q-learning algorithm, Echo State Fitted-Q Iteration (ESFQ) to learn delayed control systems. Our method employs reservoir computing to hold the history of the raw system states measurement and estimate a proper Q-value for the applied actions. To config- ure our algorithm and achieve the high performance we use hyper-parameter optimiza- tion tools. Experimental results, on different simulated benchmarks with various delay lengths, report improvement in the performance in comparison to the standard tapped delay-line algorithm. Furthermore, our results illustrate the benefits of a nonlinear read- out layer for the echo state Q-function on learning delayed control tasks with complex dynamics. 07.2015 - 12.2015 Ramin Zohouri III
  • 5. ACKNOWLEDGEMENT I have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. I would like to extend my sincere thanks to all of them. I would like to express my appreciation and special thanks to my advisers, Dr. Joschka Boedecker and Mr. Thomas Lampe, you have helped me a tremendous amount ! I would like to thank you for encouraging my research ideas and guiding me to grow as a researcher. Your advice on my research and the extended discussions we had in the past few months were priceless for me and I am truly grateful for everything; to you in the first place, but also the University of Freiburg for providing such a fruitful and motivating environment. I would like to express my gratitude towards Prof. Dr. Moritz Diehl for his kind co-operation and encouragement which helped me in the completion of this project. I would like also to thank my friends and members of the machine learning lab Manuel Blum, Jost Tobias Springenberg, Manuel Watter, and Jan W¨ulfing who also helped me a lot with giving me fresh ideas and discussing my problems when I needed it the most. Last but not least, I want to give my special thanks to Thorsten Engesser, Robin Tibor Schirrmeister, and Martin Goth who did the proofreading of my thesis and kindly helped me with translating the abstract of this report to The German language. IV
  • 6. CONTENTS 1 Introduction 3 1.1 Dynamic Control Systems And Reinforcement Learning . . . . . . . . . . 3 1.2 Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related Works 7 2.1 Non-Markovian Reinforcement Learning . . . . . . . . . . . . . . . . . . 7 2.1.1 Markov Decision Process With Delays And Asynchronous Cost Col- lection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Learning And Planning In Environments With Delayed Feedback 8 2.1.3 Control Delay in Reinforcement Learning for Real-Time Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Recurrent Neural Networks In Reinforcement Learning . . . . . . . . . . 9 3 Reinforcement Learning In Feedback Control 11 3.1 Learning In Feedback Control . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Batch Reinforcement Leaning . . . . . . . . . . . . . . . . . . . . 12 3.3.2 Neural Fitted Q-Iteration (NFQ) . . . . . . . . . . . . . . . . . . 13 3.3.3 Resilient Propagation (Rprop) . . . . . . . . . . . . . . . . . . . . 14 4 ESN For Function Approximation 16 4.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 A Brief Overview On Echo State Networks . . . . . . . . . . . . . . . . . 17 4.2.1 Learning In Echo State Networks . . . . . . . . . . . . . . . . . . 19 4.2.2 Spectral Radius and Echo State Property . . . . . . . . . . . . . . 19 4.3 Echo State Fitted-Q Iteration . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Types Of Readout Layers For Echo State Q-Function . . . . . . . . . . . 21 4.4.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.2 Multilayer Perceptron Readout Layer . . . . . . . . . . . . . . . . 22 4.5 SMAC For Hyper Parameter Optimization . . . . . . . . . . . . . . . . . 23 1
  • 7. CONTENTS CONTENTS 5 Experiments And Results 25 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Simulation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2.1 Mountain Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2.2 Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.3 Experimental Procedure And Parametrisation . . . . . . . . . . . . . . . 27 5.3.1 Echo State Q-Function . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3.2 Hyper-parameter Optimization . . . . . . . . . . . . . . . . . . . 30 5.4 Comparisons To Delay-Line NFQ Algorithm . . . . . . . . . . . . . . . . 34 5.5 Why Not Linear Readout Layers . . . . . . . . . . . . . . . . . . . . . . 38 5.5.1 Ridge Regression On Mountain Car . . . . . . . . . . . . . . . . . 38 5.5.2 Ridge Regression On Inverted Pendulum . . . . . . . . . . . . . . 41 5.5.3 Very Large Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusion 46 6.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2
  • 8. CHAPTER 1 INTRODUCTION 1.1 Dynamic Control Systems And Reinforcement Learning There are an increasing demands for building automatic controllers for systems with complex dynamics such as aircraft, magnetic levitation trains, and chemical plants, but also in objects of daily life such as air conditioners and computer drives [51, 18]. These systems are characterized as having an output or feedback, that can be measured, inputs that can be manipulated, and internal dynamics. Feedback control involves computing and applying suitable control inputs to minimize the differences between the observed and desired behaviour of a dynamic system. Feedback control design approaches include classical design methods for linear systems, multivariate control, nonlinear control, op- timal control, robust control, H-infinity control, adaptive control, and others [77, 8, 35]. Designing a controller has the advantage of being general and applicable to various sys- tems. However, the design of good controllers is a tedious and demanding job, due to unknown dynamics and underactuation, modelling errors, and various sorts of dis- turbances, uncertainties, and noise. Furthermore, since there is no learning involved, for a new control task with relatively similar dynamics a new system model has to be created. In contrast to the classical design process, reinforcement learning is geared towards learning appropriate closed-loop controllers by simply interacting with the process and incrementally improving control behaviour. Reinforcement learning is learning to act by trial and error. In this paradigm, an agent can perceive its state and perform actions. After each action, a numerical reward is given. The goal of the agent is to maximize the total reward it receives over time. Reinforcement learning has been successful in appli- cations as diverse as autonomous helicopter flight, robot-legged locomotion, cell-phone network routing, marketing strategy selection, factory control, and efficient web page indexing. Despite these successes, there are some challenges and problems which make learning a control task difficult. In an online control system, the data is sent and received by network nodes of different types and manufacturers. Sensor nodes measure process values and transmit these over the communication network. Actuator nodes receive new values for the process inputs 3
  • 9. CHAPTER 1. INTRODUCTION 1.1. DYNAMIC CONTROL SYSTEMS AND REINFORCEMENT LEARNING over the communication network and apply these to the process input. Controller nodes read process values from sensor nodes. A control algorithm calculates the control signals and sends them to the actuator nodes. Communication networks inevitably introduce delays, both due to limited bandwidth, but also due to overhead in the communicating nodes and the network. In many systems, there are various delay types. From the control perspective, a control system with varied delay types will no longer be time-invariant. Therefore, the standard optimal control theory can not be used to analyze and design these systems. We can categorize different types of control delays based on their place of occurrence. • Communication delay between the sensor and the controller. • Computational delay in the controller. • Communication delay between the controller and the actuator. The effect of time delay on the stability and performance of control systems has drawn the attention of many investigators in different engineering disciplines, including optimal control system [1, 45, 10, 43, 76], reinforcement learning [5, 39]. In general, the time delay in active control systems causes an unsynchronized application of the control forces, and this unsynchronized control not only degrades the system performance but also causes instability in the system response [61]. Dynamic computational models with delays re- quire the ability to store and access the time history of their inputs and outputs. And there are various difficulties in learning and controlling such systems. The first problem is, there is no explicit teacher signal that indicates the correct output at each time step. In addition, the temporal delay of the reward signal implies that the learning system must assign temporal reward/cost to each of the states and actions that resulted in the final outcome of the sequence. To overcome this restriction, several, more general approaches are considered that re- tain state information over time. We refer to these as stored-state methods. The simplest of these approaches is one which augments immediate sensory inputs with a delay line to achieve a crude form of short term memory [37]. This approach has been successful in certain speech recognition tasks [69]. The most common dynamic neural architecture is the time delayed neural networks [70]. The time delayed neural networks couples delay lines with a nonlinear static architecture where all the parameters (weights) are adapted with the Backpropagation algorithm. Another alternative is called the method of pre- dictive distinctions [2, 13, 59]. Following this approach, the system learns a predictive model of the sensory inputs i.e. environmental observables, and then uses the internal state of this model to drive action selection. A third approach uses a recurrent neural network [50, 21, 73, 57, 28, 19] in combina- tion with existing reinforcement learning methods to learn a recurrent state-dependent control policy directly [26]. In the sense of keeping the history of the system states the recurrent neural networks naturally are powerful tools that can be employed to learn delayed control systems. Formerly, training of recurrent neural networks was performed using Backpropagation through time [73], which actually means unfolding the network in time and constructing a much bigger network, then performing Backpropagation on this new network. However, besides the fact that this process is very slow, it does not always 4
  • 10. CHAPTER 1. INTRODUCTION 1.2. GOALS AND CONTRIBUTIONS guarantee a good solution because of the fading gradient issue. 1.2 Goals and Contributions As we described in the previous section, in this work we aimed to learn the behaviour of control systems which have unknown delays without knowing their underlying dy- namics. For the purpose of this task Q-learning, a particular model-free reinforcement learning algorithm would be a feasible solution. Basically, The Q-learning algorithm uses function approximation to choose the best control action given the current system state and minimizes the overall cost of the reaching the goal area. The presence of delay in control systems violates the Markov property which is a fundamental method for formu- lating a successful reinforcement learning algorithm. The Markov property simply says that a conditional probability distribution of future states of the process only depends on its present state. Hence, for a delayed task the proper function approximation i.e. Q-function needs to be able to preserve the Markov property by holding the history of a system state. Furthermore, the learning algorithm for a control system, in general, should be able to deal with problems of a long learning time and convergence. Last but not least, the learning algorithm we described above will have various parameters which are interdependent and must be selected carefully to boost the learning performance. To achieve the objectives we mentioned above, we designed and implemented the Echo State Fitted-Q Iteration (ESFQ) algorithm. It is a type of offline batch reinforcement learning algorithm. The ESFQ algorithm has several advantages which make it a suit- able tool to learn control tasks with delays. First, it is a batch reinforcement learning algorithm, which is able to use all the previously seen control trajectories which acceler- ate the learning procedure. Second it employs echo state networks (ESN) as a function approximation to estimate delayed targets. The ESN is a specific type of recurrent neu- ral networks which its short-term capacity enables it to hold the history of the system states. Therefore, it is able to preserve the Markov property for a reinforcement learning algorithm. The third contribution of our method is the ability to have different readout layers for training the Echo State Q-function. This ability helps function approximation to learn complex dynamic systems. Our algorithm benefits from its unconventional way of preserving the history of system states and the ability to train different readout layers which make it suitable for a particular control task. However, it has several parameters which need to be carefully tuned in order to maximize its performance. To achieve this goal we employed hyper-parameter optimization tools to automatically determine good choices and take out a manual step of the configuration. 1.3 Outline In the following, we first cover the major related works to our method and briefly compare and contrast each of them with our method in chapter 2. Then we describe fundamental features of our algorithms, reinforcement learning with feedback control in chapter 3, and echo state networks for function approximation in chapter 4. After that, we proceed by presenting the results of our method on different standard benchmarks in chapter 5. There also we compare our results to the standard tapped delay-line reinforcement 5
  • 11. CHAPTER 1. INTRODUCTION 1.3. OUTLINE learning algorithm. At the end, in chapter 6 we summarize this work by mentioning our major contributions and addressing the future works that could be done. 6
  • 12. CHAPTER 2 RELATED WORKS 2.1 Non-Markovian Reinforcement Learning Markov decision process(MDP) is a popular framework for training control problems[22] in reinforcement learning. In summary, given a certain state of a system, the agent selects an action that brings the system to a new state and induces a cost, the new state is observed and the cost is collected, then the decision maker selects a new action, and so on. However, the basic MDP framework as it is defined in [34], makes a number of restrictive assumptions that may limit its applicability: • the system’s current state is always available to the agent. • the agent’s actions always take effect immediately • the cost induced by an action is always collected without delay The Markov property is usually assumed in reinforcement learning and, therefore, it vanishes when any of the described conditions is violated. It follows that if an agent must learn to control a system, there will be periods of time when the internal representation of system states will be inadequate. Therefore, the decision task will be non-Markovian which challenges the performance of reinforcement learning algorithms. The presence of delay in measured states or applied actions causes the violation of the Markov property. In the following, we viewed three major state-of-the-art approaches for reinforcement learning with delay. 2.1.1 Markov Decision Process With Delays And Asynchronous Cost Collection In the first approach, the state space of the learning agent is augmented with the actions that were executed during the delay interval. In this work [16, 34] authors showed how (discrete-time, total-expected-cost, infinite horizon) Markov decision process with obser- vation, action and delayed cost are reduced to a Markov decision process without delays. They drew connections amongst the three delay types and demonstrated that the cost structure of the process without delays is the same as that of the original process with 7
  • 13. CHAPTER 2. RELATED WORKS 2.1. NON-MARKOVIAN REINFORCEMENT LEARNING constant or random delays. They considered an embedded process that behaves similarly to a process with constant delays. However, their results are based on the intuition of asynchronous cost collection. That means the costs may be induced and collected at different decision stages and policies can still be compared properly as long as costs are discounted accordingly. While this approach works well, the state space increase can cause a large increase in learning time and memory requirements. 2.1.2 Learning And Planning In Environments With Delayed Feedback The second approach tried to learn a model of the underlying non-delayed process and used this model to base control actions on the future state after the delay, predicted by the model[71]. In this work, the authors evaluated algorithms for environments with constant observation and reward delay. They covered three different approaches for deal- ing with the constant delay Markov decision process(CDMDP). First, ”the wait agent”, which waits for k steps, until the current observation comes through, and then acts us- ing the optimal action in the non-delayed MDPs. Unfortunately, policies derived from this strategy will not, in general, provide satisfactory solutions to the CDMDP planning problem. Instead, the agent’s resulting policy will likely be suboptimal as it is essen- tially losing potential reward on every wait step. Their second solution was a memoryless policy, which treated CDMSP as MDP and used memoryless policy for the non-delayed MDP. Their third solution was the traditional augmented approach, which involves ex- plicitly constructing an MDP equivalent to original CDMDP in a larger state space. They augmented each state and applied k previous actions to form a new state representation. Authors formed new transition probability and reward function but such a solution adds the extra burden of acquiring the model of the system while the added computational complexity may actually increase the delay itself. 2.1.3 Control Delay in Reinforcement Learning for Real-Time Dynamic Systems The third approach for dealing with delays in MDP has been introduced in [60] which is an improvement on the older approaches. The authors introduced two new memoryless solutions and the most important one was an online algorithm named dSARSA(λ). Such methods base the next control action only on the most recent observation. The downside of memoryless approaches is that they are likely to learn a suboptimal policy because they have no means of predicting the state in which the control action will take effect. Furthermore, SARSA(λ) does not take the presence of the delay into account in its learning updates. While their complexity remains comparable to that of SARSA(λ), they exploited the knowledge about the length of the delay to improve their performance. Then, the authors presented an extension to these algorithms which was applicable where the delay length is not an integer multiple of the time step. 8
  • 14. CHAPTER 2. RELATED WORKS 2.2. RECURRENT NEURAL NETWORKS IN REINFORCEMENT LEARNING 2.2 Recurrent Neural Networks In Reinforcement Learn- ing There are various attempts to combine the dynamic programming approaches with recur- rent neural networks(RNN) to tackle reinforcement learning problems. They formulate their solution in adaptive critic designs(ACDs) [52]. The ACDs have their root in dy- namic programming and are suitable for learning in noisy, nonlinear, and non-stationary environments. The ACDs in their application first define what to approximate through the critic, and then how to adapt the actor in response to the information coming from the critic. Schmidhuber took one of the earliest approaches [59] toward this direction. He modeled the dynamics and control of a system with two separate networks and trained them both in parallel or sequential. However, the proposed fully recurrent networks struggled in their flexibility and adaptation ability. One of the major difficulties he faced in his work was the disability of Backpropagation through time and the ACDs to deal with the long time lag between applied action and its execution. In another work, Bakker combined the long short term memory [4, 3, 15] with the actor-critic method to learn par- tially observable reinforcement learning problems. He utilized long short term memory strength in learning long-term temporal dependencies to infer states in partially observ- able tasks. He developed an integrated method, which learns the system’s underlying dynamics and the optimal policy at the same time. Compared to our method which is a data efficient batch learning and designed to learn non-Markovian problems, Bakker’s approach had problems in dealing with high dimensionality, partial observability, contin- uous state and action space, and a limited amount of training data. In a recent approach by Sch¨afer [58], he applied recurrent neural reinforcement learning approaches to identify and control a high-dimensional dynamic system with continuous state and action spaces in partially unknown environments like a gas turbine. He introduced a hybrid recurrent neural network approach that combines system identification and determination of an optimal policy in one network. Furthermore, in contrast to our reinforcement learning methods, it determines the optimal policy directly without making use of a value func- tion. His approach is model-based and by constructing the model of the system it is able to learn high dimensional and partially observable reinforcement learning problems with continuous state and action spaces in a data efficient manner. As it shown by Lin and Mitchell [38] recurrent neural networks are robust tools for functional approximation in partially observable systems. In their work authors defined three different architectures, recurrent-model, recurrent-Q, and window-Q architectures. The first one learns an action model for the history features. The second approach learns a Q-function approximation using indirectly observable training examples. And the third method learns the Q-value approximation by taking a known number of state-action pairs, window size, into its memory. As they have shown, these architectures are all capable of learning some non-Markovian problems, but they have their own advantages and disadvantages. Later on in the work by K. Bush [11], the author used echo state networks(ESNs) to address the problem of non-Markovian reinforcement learning and showed how they could successfully learn some standard benchmarks problems. The au- thor experimentally validated the positive performance and dynamic attributes of the echo state networks on modeling a system in the non-Markovian reinforcement learning domains. The major difference between this work and our approach is, we tried a non- linear readout layer to learn complex dynamic systems, but Bush introduced a memory 9
  • 15. CHAPTER 2. RELATED WORKS 2.2. RECURRENT NEURAL NETWORKS IN REINFORCEMENT LEARNING consolidation method using the mixture of expert (MoE) and tested it for stationary dynamic systems. This framework preserves the ability to train the readout layers via linear regression. The author showed that echo state networks exhibit low-mobility learn- ing through trajectory-based features. Later on, echo state networks are also successfully used and tested as Q-function approximation in work done by Oubbati et.al. [47]. Their work sees the control task as a result of the interaction between brains, bodies, and environments. There, the authors utilized reservoir computing as an efficient tool to understand how the system behaviour emerges from an interaction. Inspired by imita- tion learning designs, they presented reservoir computing models, to extract the essential components of the system dynamics which are the result of the agent-environment in- teraction. They validated their learning architectures by experimenting with a mobile robot in a collision avoidance scenario. Compared to this work, we trained our policy offline and with non-linearity on readout layer and applied discrete control actions. But they first did a short pre-training and then used online linear approach to train echo state networks and applied a continuous action to control an autonomous driving robot in order to follow a line. 10
  • 16. CHAPTER 3 REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.1 Learning In Feedback Control The classical feedback control loop basically describes a control mechanism that uses information from measurements of a process to control it. In each time interval, the process communicates a system state, vector of measured dynamic variables, to the con- troller that applies the control commands. In feedback control, the control variables are measured and compared to the target values. Therefore, the feedback control mecha- nism manipulates the input variables of the system in a continuous loop to minimize the differences between estimation and desired targets. This allows for the formulation of a broad range of challenging control applications. The direct approach to have a feedback controller is by designing a model of the system. However for a nonlinear control sys- tem, the task of model identification becomes tedious and complicated. An alternative approach is to learn to control the feedback control systems. Which means, we need an intelligent controller component that learns to control a subjected dynamic process using experiences made from an interaction. Reinforcement learning (RL) is a type of machine learning method which tries to learn an appropriate closed-loop controller by simply interacting with the process and incre- mentally improving the control behaviour. The goal of reinforcement learning algorithms is to maximize a numerical reward signal by discovering which control commands i.e. actions yield the most reward. Using reinforcement learning algorithms, a controller can be learned with only a small amount of prior knowledge of the process. Reinforcement learning aims at learning control policies for a system in situations where the training information is basically provided in terms of judging success or failure of the observed system behaviour [62]. Because this is a very general scenario, a wide range of different application areas can be addressed. Successful applications are known from such differ- ent areas as game playing [9], dispatching and scheduling [6], robot control [49, 56], and autonomic computing [64]. 11
  • 17. CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.2. MARKOV DECISION PROCESS 3.2 Markov Decision Process The type of control problems we are trying to learn in this work are discrete time control problems and can be formulated as a Markov decision process(MDP) [62]. An MDP has four components: a set S of states, a set A of actions, a stochastic transition probability function p(s, a, s ) describing system behaviour, and an immediate reward or cost function c : S × A → R. The state of the system at time t, characterizes the current situation of the agent in the world, denoted by s(t). The chosen action by agent at time step t is denoted by a(t). The immediate reward or cost is the consequence of the taken action and function of state and action. Since the rewards for the taken action can be formulated as cost, the goal of the control agent would be to find an optimal policy π∗ : S → A that minimizes the cumulated cost for all states. Basically, in reinforcement learning we try to choose actions over time to minimize/maximize the expected value of the total cost/reward: E[R(s0) + R(s1) + R(s2) + ...] 3.3 Q-Learning There are various reinforcement approaches that can be formulated based on the MDP [62] e.g. value iteration and policy iteration, where the transition model and the reward function of the control task are known. However, in many real-world problems the state transition probabilities and the reward functions are not given explicitly. But, only a set of states S and a set of actions A are known and we have to learn the dynamic system behaviours by interacting with it. Methods of temporal differences were invented to perform learning and optimization in exactly these circumstances. There are two principal flavors of temporal difference methods. First, an actor-critic scheme [63, 62], which parallels the policy iteration methods, and has been suggested as being implemented in biological reinforcement learning. Second a method called Q-learning [72], which parallels the value iteration methods. In this work, we consider using the Q-leaning as our principal reinforcement learning algorithms. The basic idea in Q-learning is to iteratively learn the value function, Q-function, that maps state-action pairs to expected optimal path costs. The update rule in Q-learning algorithm is given by: Qk+1(s, a) := (1 − α)Q(s, a) + α(r(s, a) + γ min a Qk(s , a )) where s denotes the system state where the transition starts, a is the action that is applied, and s is the successor system state. The learning rate α has to be decreased in the course of learning in order to fulfill the conditions of the stochastic approximation and the discounting factor denoted by γ [62]. It can be shown that under mild assumptions Q-learning converges for finite state and action spaces, as long as the Q-value for every state-action pair is updated infinitely often. Then, in the limit, the optimal Q-function is reached. 3.3.1 Batch Reinforcement Leaning As formulated above the standard Q-learning protocol considers an agent operating in discrete time. At each time point t it observes the environment state st, takes an action 12
  • 18. CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.3. Q-LEARNING at, and receives feedbacks from the environment including next state st+1 and the instan- taneous reward rt. The sole information that we assume available to learn the problem is the one obtained from the observation of a certain number of one-step system transitions (from t to t + 1). The agent interacts with the control system in the environment and gathers state transitions in a set of four-tuples (st, at, rt, st+1). Except for very special conditions, it is not possible to exactly determine an optimal control policy from a finite set of transition samples. In the literature there is a method called batch mode reinforce- ment learning [17, 46, 36], which aims at computing an approximation of such optimal policy π∗ , from a set of four-tuples: D = {(sl t, al t, rl t, sl t+1), l = 1 · · · , #D} This set could be generated by gathering samples corresponding to one single trajec- tory (or episode) as well as by considering several independently generated trajectories or multi-step episodes. In the work by Lange et al. [36] authors covered various batch-mode reinforcement learning algorithms. Among them, we chose the growing batch method for the purpose of training our learning algorithm. Training algorithms with growing batch have two major benefits. First, from the interaction perspective, it is very similar to the ’pure’ online approach. Second, from the learning point of view, it is similar to an offline approach that all the trajectory samples are used for training the algorithm. The main idea in growing batch is to alternate between phases of exploration, where a set of training examples is grown by interacting with the system, and phases of learning, where the whole batch of observations is used. The distribution of the state transitions in the provided batch must resemble the ’true’ transition probabilities of the system in order to allow the derivation of good policies. In practice, exploration cultivates the quality of learned policies by providing more variety in the distribution of the trajectory samples. Furthermore, it is often necessary to have a rough idea of a good policy in order to explore interesting regions that are not in the direct vicinity of the starting states. If ’important’ states i.e. states close to the goal state are not covered by any of the trajectory samples, then it is obviously not possible to learn a good policy from the batch data. This happens because the system would not know which series of actions lead the to the goal area. 3.3.2 Neural Fitted Q-Iteration (NFQ) Model-free learning methods like Q-learning are appealing from a conceptual point of view and have been very successful when applied to problems with small, discrete state spaces. But when it comes to applying them to the real world systems with larger and probably continuous state spaces these algorithms facing some limiting are factors. For the relatively small or finite state and action space, the Q-function can be represented in tabular form and it is straightforward to approximate. However, when dealing with continuous or very large discrete state and/or action spaces, the Q-function cannot be represented by a table with one entry for each state-action pair. In this respect, there are three problem can be identified : • the ’exploration overhead’, causing slow learning in practice • inefficiencies due to the stochastic approximation • stability issues when using function approximation 13
  • 19. CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.3. Q-LEARNING A common factor in modern batch reinforcement learning algorithms is that these algorithms typically address all three issues and offer specific solutions to each of them. The Fitted-Q iteration algorithm[17] is an efficient batch mode reinforcement learning al- gorithm that can learn from a sufficiently rich trajectory generated. In this algorithm, the Q-function approximation is done on an infinite or finite horizon optimal control problem with discounted rewards. In each step, this algorithm uses batch data together with the Q-values computed at the previous step to determine a new training set. Then it applies a regression method on the training data to compute the next Q-value of the sequence. Among several approaches to approximate Q-function, neural networks are considered suitable tools [23, 67] because they provide a nonlinear mapping from input to output data. The neural Q-function needs to define an error function which aims to measure the difference between the target Q-value and the estimated Q-value. For example, a squared error measure like error = (Q(s, a) − (c(s, a) + γ min a Q(s , a )))2 . To train the neural Q-function and minimize the estimation error common gradient descent learning rules like Backpropagation [20] can be applied. In general, fitted-Q iteration algorithms are trained online which requires thousands of samples and long training time to learn a control task. Riedmiller [53] proposes an alternative approach, neural fitted-Q iteration (NFQ), which performs an offline update step considering the entire set of transitions. The standard NFQ benefits from the growing batch method to collect the trajectory samples for training a multilayer perceptron Q-function. It has been proven [54] that 2-layer of neural networks has sufficient approximation capacity to generalize well for a closed loop control. The pseudo code for the NFQ algorithm shown Algorithm 1: Algorithm 1 Main loop of NFQ . k counts the number of iterations, kmax denotes the maxi- mum number of iterations. init MLP() returns a multilayer perceptron with randomly initialized weights. Backpropagation training ( P) takes pattern set P and returns a mulitlayer perceptron that has been trained on P using backpropagation as a supervised training method. procedure NFQ Main() input: a set of transition samples D; output:Q value function QN ; k = 0; init MLP() → Q0; while k < kmax do generate pattern set P = {(inputl , targetl ), l = 1, ..., #D} where: inputl = sl , al targetl = c(sl , al ) + γmin a Qk(s l , a l) Backpropagation training(P) → Qk+1 k := k + 1 3.3.3 Resilient Propagation (Rprop) The original NFQ uses the Backpropagation [20] algorithm combined with an optimiza- tion method i.e. gradient decent for training its multilayer perceptron neural network. 14
  • 20. CHAPTER 3. REINFORCEMENT LEARNING IN FEEDBACK CONTROL 3.3. Q-LEARNING As the equation below shows, the loss function computes the gradient with respect to all the weights in the network. Then updates of the weight values are performed with respect to the computed gradient: wij(t + 1) = wij(t) − ∂E ∂wij (t) The partial derivative of the error function E with respect to the neural network weights is computed based on the chain rule. The learning rate parameter affects the convergence of the algorithm. If it’s too small, a system will take so many steps to converge, and with a large learning rate, the system will oscillate around local minima and fail to fall into the desired value range. Regular Backpropagation is a slow and inefficient process. In order to accelerate this supervised learning process, it is possible to use advanced techniques like Rprop [55], which has a more reliable and faster convergence than the regular gradient descent method. The pseudocode in Algorithm 2 shows the core part of the adaptation and learning the rule for Rprop. Using Rprop to train the NFQ algorithm also reduces its overall complexity by removing the necessity of parameter tuning for the neural Q-function. Rprop stands for Resilient Propagation, and it is an efficient learning scheme, that in every iteration updates network’s weights based on the changes in sign of the error function E. Here each network weight is updated individually by ∆ij based on its local gradient information. So if the last weight update was so large that the algorithm jumped over the local minima, then it needs to change it direction accordingly. In this adaptive method, the new value for ∆ij is computed by multiplying it with η− or η+ for the negative or positive sign of the E. The values for 0 < η− < 1 < η+ are parameters that can be set to some constant. Using the update value ∆ij, Rprop algorithm increases or decreases individual weight values given the sign of the derivative. Algorithm 2 The core part of the Rprop algorithm. The minimum (maximum) operator is supposed to deliver the minimum (maximum) of two numbers; the sign operator returns +1 if the argument is positive, -1, if the argument is negative, and 0 otherwise.∆ij determines the size of the individual update value for each weight. For all weights and biases{ if ( ∂E ∂wij (t − 1) ∗ ∂E ∂wij (t) > 0) then ∆ij(t) = minimum(∆ij(t − 1) ∗ η+ , ∆max) ∆wij(t) = −sign( ∂E ∂wij )∆ij(t) wij(t + 1) = wij(t) + ∆wij(t) else if ( ∂E ∂wij (t − 1) ∗ ∂E ∂wij (t) < 0) then ∆ij(t) = maximum (∆ij(t − 1) ∗ η− , ∆min) wij(t + 1) = wij(t) − ∆wij(t − 1) ∂E ∂wij (t) = 0 else if ( ∂E ∂wij (t − 1) ∗ ∂E ∂wij (t) = 0) then ∆wij(t) = −sign( ∂E ∂∆ij )∆ij(t) wij(t + 1) = wij(t) + ∆wij(t) 15
  • 21. CHAPTER 4 ECHO STATE NETWORKS FOR FUNCTION APPROXIMATION 4.1 Recurrent Neural Networks Our goal in this work is to learn a feedback control system with delays without knowing its dynamic model. As we described in the previous chapter, we chose Neural Fitted-Q Iteration (NFQ) algorithm, a specific Q-learning algorithm, to learn our model-free task. In its core, the NFQ algorithm uses multi-later perceptrons as a function approximator to learn a model of the system dynamics and estimate the proper Q-value for each ac- tion. In the control task with delays the Markov property will be violated, the ordinary multilayer perceptron will fail to approximate the desired targets. Therefore, we need a Q-function that is capable of holding the history of the system states in its memory for robust approximation and preserving the Markov property. One possible option is to use recurrent neural networks(RNNs) for the function approximation. Recurrent neural networks (RNNs) have a structure similar to biological neural net- works and they are able to hold the history information of the dynamic system. A recurrent neural network contains (at least) one cycle path of synaptic connections. This feature makes RNNs an excellent tool for function approximation in delayed control sys- tems. Mathematically recurrent neural networks implement and approximate dynamic systems and they are applied to the variety of tasks, for example, system identification and inverse system identification, pattern classification, stochastic sequence modeling [58, 47, 30]. However, it is not an easy straightforward task to train RNNs. Gener- ally speaking, for training RNNs an extended version of the Backpropagation algorithm, Backpropagation through time has been used [73, 74], but only with partial successes. One of the conceptual limitations of the Backpropagation methods for the recurrent neu- ral networks is that bifurcations can make the training non-converging [14]. Even when they do converge, this convergence is slow, computationally expensive, and can lead to poor local minima. In more recent attempt to train recurrent neural networks the reservoir computing approach [41] was introduced. These networks are in fact dynamic systems driven by the input signal, or from another point of view they are nonlinear filters of the input signal. 16
  • 22. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.2. A BRIEF OVERVIEW ON ECHO STATE NETWORKS The idea of reservoir computing has been discovered and investigated independently under the name of the echo state networks (ESNs) [28, 28] in machine learning, and liquid state machines (LSMs) [42] in computational neuroscience. The work on the liquid state machines is rooted in the biological setting of continuous time, spiking networks while the ideas of the echo state networks were first conceived in the framework of discrete- time, non-spiking networks in engineering applications. It was shown that the reservoir computing approaches often work well enough even without full adaptation of all the network weights. The reservoir computing methods usually use a supervised learning scheme for training. In this work, we employ echo state networks as a Q-function for their simplicity in training and capability in approximating the targets in the presence of delay in dynamic systems [29]. Perhaps surprisingly, this approach yielded an excellent performance in many benchmark tasks, e.g. [27, 32, 33, 68]. 4.2 A Brief Overview On Echo State Networks As we mentioned in the previous section, we employed discrete time echo state networks for the function approximation. The echo state network has M input units, N hidden units(reservoir neurons), and L output units. The data at each time point t, fed to the network through the input units in the form of a vector U(n) = [u1(t), u2(t), · · · , uM (t)]T . The internal values for the reservoir are declared by X(n) = [x1(t), x2(t), · · · , xN (t)]T , and the output of the network in the time point t is showed by Y(t) = [y1(t), y2(t), · · · , yL(t)]T . The Figure 4.1 depicts the three-layered topology of the echo state network, the input layer, the hidden units or reservoir, the and readout layer. There are four types of weight matrices which connect different echo state network layers together. Three of these weight matrices are initialized once and will stay constant during the training time. These constant weights are described in the following. The input weight matrix Win with N × M dimensions which connects the input data to the hidden units. The reservoir weight matrix W with N × N dimensions which connects the hidden units to each other and also indicates the recurrent connections. The feedback weight matrix Wfb with N × L dimensions which connects the outputs back to the reservoir. The initial values for these constant weight matrices have to be chosen carefully to increase the performance of the echo state networks. There are various initialization and hyper- parameter optimization techniques that can be used to initialize the Wfb , Win , and W matrices more sophisticatedly according to the dynamic task. We will cover some of the important methods along this chapter to initialize and optimize these weight values. In the echo state networks only the output weights Wout with L × (M + N) dimensions will be trained. They map the reservoir states to the estimated targets. Later in this chapter, we will describe various methods for training the weights Wout . The update rule for changing the state of the reservoir is: X(t + 1) = f(Win U(t + 1) + WX(t) + Wfb Y(t)) X(t + 1) = (1 − α)X(t) + αX(t + 1) Here α is the leaky integrator factor of each hidden unit. As it is shown in the equation above echo state networks (ESNs) are composed of simple additive units with a nonlinear activation function f. And the reservoir with a leaky integrator unit type has individual state dynamics, which can be exploited in various ways to accommodate the network to store temporal characteristics of a dynamic system. Leaking rate α is the amount of 17
  • 23. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.2. A BRIEF OVERVIEW ON ECHO STATE NETWORKS the excitation (signal) that a neuron discards, basically it implements the concept of the leakage. This has an effect of smoothing the network dynamics, yielding an increased modelling capacity of the network, for example in dealing with a low frequency sine wave. The activation function f = (f1, f2, · · · , ft) for each hidden unit in the reservoir is usually a sigmoid function e.g. logistic activation f(x) = 1 1+e−x or hyperbolic tangent f(x) = ex−e−x ex+e−x . It also can be a linear rectifier unit: f(x) = x if x > 0 0 otherwise Figure 4.1: The topological illustration of the echo state network. Its structure consists of three major parts. First an input layer containing input weight matrix Win . Second the middle layer containning the reservoir state matrix and its weight matrix W. And third the output layer which contains readout weight matrix Wout and readout value Ytarget . The initial state of the reservoir is usually set to random values, in most case X(0) = 0, and this introduces an unnatural starting state which is not normally visited once the network has ”warmed up” to the task. Therefore, when we input the sequence of long trajectory samples to the reservoir we exclude few computed states of each trajectory from the reservoir state. The number of discarded states depends on the memory of the reservoir and usually it is an integer number equal or less that 10% of the input data. The output values for the echo state network are computed according to: Y(t + 1) = fout (Wout (U(t + 1), X(t + 1))) Here fout = (fout 1 , fout 2 · · · , fout t ) is usually a linear function. Depending on the prob- lem, it is also possible to connect the input data onto computed reservoir state at each time point, using direct shortcut connections, and then compute the ESN outputs. Shortcut connections are useful when the echo state networks are trained in the generative mode which they try to output signal values same as the input data. But for the purpose of this work we will not use them. 18
  • 24. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.2. A BRIEF OVERVIEW ON ECHO STATE NETWORKS 4.2.1 Learning In Echo State Networks In this work we want to use echo state networks for the function approximation in the Q-learning algorithm because they could hold the history of the dynamic systems in their state. Therefore, the ESN output Y(n) ∈ RL will be the Q-value of the function approx- imation and we try to minimize the error measure E(Y, Ytarget ) between the estimated output Y(n) and target output Ytarget (n). More importantly, we need to make sure that our approximation generalizes well to unseen data. The error measure E can typically be a Mean-Square Error (MSE), for example a root mean square error (RMSE): E(Y, Ytarget ) = 1 L L i=1 1 T T i=1 (yi(n) − ytarget i (n))2 which is here averaged over the L dimensions of the output data. The RMSE can be dimension-wise normalized (divided) by the variance of the target data Ytarget (n), producing a normalized root mean square error (NRMSE). The NRMSE has an absolute interpretation: it does not depend on the arbitrary scaling of the target data Ytarget (n) and the value of 1 can be achieved with a simple constant output data Y(n) set to the mean value of Ytarget (n). This suggests that a reasonable model of a stationary process should achieve a NRMSE accuracy between zero and one. As we mentioned earlier, the only adaptable weight matrix in echo state networks is the output weights matrix Wout and it is possible to train these weights with different algorithms. In order to train linear readout weights Wout there are different methods [48, 28, 30, 66, 31], and using least square method is a standard offline procedure and formulated as follows: • Generate the reservoir with specified set of parameters. • Feed the input data U(t) and compute and collect the reservoir states X(t). • Compute Wout from the reservoir using Linear Regression, minimizing the Mean Square Error between Y(n) and Ytarget (n). • Compute Y(n) again from trained network by employing Wout and feeding the input data U(t). 4.2.2 Spectral Radius and Echo State Property Echo state networks have two major functionalities that make them feasible tools for function approximation in the presence of delays. First acting as a nonlinear filter in order to expand the input signal U(t) to a higher dimensional space X(t). Second acting as a memory to gather the temporal features of the input signal and this ability in the literature called is short term capacity of the echo state network. The Combination of these two capabilities enriches the computed reservoir state matrix X, containing the history information about the dynamic behaviour of a control system. However, the computed state of the reservoir is influenced by various parameters that have to be set- ting judiciously. Those parameters or hyper-parameters are: the size of reservoir N, the sparsity of the weight matrices, the distribution of the nonzero elements of weight matri- ces, the spectral radius of reservoir weight matrix ρ(W), the scaling of the input weight matrix Win , and the leaking rate α. M. Lukosevicius [40] describes informative review on how to set these parameters and explains their affects on the system performance. In 19
  • 25. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.3. ECHO STATE FITTED-Q ITERATION this work we mainly focus on the memory aspect of the echo state networks and try to choose these parameters according to the memory size needed for the delayed control task. According to Herbert Jaeger [28] a network has echo state property if the network state X(t) is uniquely determined by any left-infinite input sequence U−∞ . After the burn-in phase, removing the initial transition sates, the reservoir becomes synchronised to the input signal U(t) and the current reservoir state X(t), and exhibit an involved nonlinear response to the new input signal. This phenomenon can be explained by the echo state property: The network will forget its random initial state after the burn-in phase and the reservoir state becomes a function of the input signal. This is an essential property for successful training of a reservoir. Izzet B. Yildiz and et.al. [75] defined a simple recipe to preserve the echo state property. That procedure is explained in the following steps: • Generate a weight matrix W0 from uniform or Gaussian distribution. • Normalize W0 with its spectral radius λmax, W1 = 1 λmax W0 • Scale W1 with a factor 0 < α < 1 and use it as the reservoir matrix W = αW1. The spectral radius of the reservoir weight matrix W is one of the most important global parameters of the ESN and it has to be chosen according to, in our case, the needs of the dynamic control task. Usually the spectral radius is related to the input signal, in the sense that if lower time-scale dynamics is expected (fast oscillating signals) then a lower spectral radius might be sufficient. However, if longer memory is necessary then a higher spectral radius will be required. The downside of a larger spectral radius is the longer time for the settling down of the network oscillations. From an experimental point of view it means having a smaller region of optimality when searching for a good echo state network with respect to some dataset. Basically, if the spectral radius value is larger than one, it can lead to reservoirs hosting multiple fixed points that violate the echo state property. 4.3 Echo State Fitted-Q Iteration As we described earlier, an important aspect of echo state networks is their ability to serve as a memory to capture and preserve the temporal behaviours of their input sig- nal. For that reason, we substituted the multilayer perceptron Q-function in the NFQ algorithm with echo state networks and introduced the Echo State Fitted-Q Iteration (ESFQ) algorithm. Compared to Algorithm 1, we changed the MLP Q-function to the ESN Q-function and the Backpropagation training is replaced by a particular training method for echo state networks. The advantage of the ESFQ algorithm is that the cu- mulative history of the states-action pairs in the reservoir helps the Q-function to deal with the unknown delays in the control task. In order to train the echo state Q-function for a dynamic control task first, the weight matrices need to be initialized. Then the accumulated state-action pairs in the growing batch, are used to feed to the reservoir at the end of each episode. The reservoir dynamics at each point in time presented in the activation values of its hidden units. These activation values from the reservoir state ma- trix, reservoir design matrix, which its rows correspond to the system state-action pairs 20
  • 26. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.4. TYPES OF READOUT LAYERS FOR ECHO STATE Q-FUNCTION on each cycle in each trajectory sample and its columns correspond to the computed features. In the burn-in phase, to remove the early noisy and unknown dynamics of the reservoir for each trajectory sample, we remove a few beginning states of the reservoir matrix. Then we compute the readout weights Wout using the teacher values. These teacher values for every state and action sample in a trajectory are generated from the sum of the immediate cost and the Q-value of the successor state. Finally having the new readout weights Wout we compute the estimated targets. 4.4 Types Of Readout Layers For Echo State Q-Function In the Echo State Fitted-Q Iteration algorithm, which is an offline policy learning algo- rithm, the Q-Function plays the role of a supervised regression method. Compared to the original Neural Fitted-Q Iteration algorithm described in section 3.3.2, we use an echo state Q-function with linear least squares on the readout layer. Although, the linear least squares makes the learning fast, sometimes it struggles in learning more complex nonlin- ear problems or higher input dimensions. However, because of the echo state networks topology and the fixed values of the input and reservoir weight matrices, it is possible to train the echo state Q-function with different types of readout layers. In the follow- ing, we describe two methods for training the output layer of the echo state Q-function that help it to estimate more accurate targets for the purpose of reinforcement learning. In this regard, we investigated the possibility of using ridge regression, and multilayer Perceptrons. 4.4.1 Ridge Regression In a standard formulation, the reservoir state matrix or the design matrix, is overdeter- mined matrix, because the number of trajectory samples is way larger than the hidden unit numbers N T. Therefore, the solution for this equation Wout = X−1 Ytarget needs to be approximated. To compute stable solution we apply ridge regression, also known as Regression with Tikhonov Regularization: Wout = (XT X + βI)−1 XT Ytarget The regularization parameter β will prevent output weights from growing arbitrarily large, which causes overfitting and instability in the system. This approach penalizes the squared length of the weight vector and transforms the error minimization task to a convex optimization problem that can be solved in analytical closed form. The optimal values of β can vary by many magnitudes of size, depending on the exact instance of the reservoir and length of the training data. To choose the value of the regularization parameter β by doing a simple exhaustive search, it is recommended to search on a logarithmic grid. Furthermore, β can be found using cross-validation or Gaussian-Processes. It is preferable to set the value of β not to zero to avoid numerical instability in computing the invert of the matrix (XT X)−1 . Interpretation of the linear readout gives an alternative criterion for setting β directly [12]. Because determining the readout weights is a linear task, the standard recursive algorithms, from adaptive filtering, for minimizing mean square root error can be used for online readout weight estimation [30]. The recursive least squares algorithm is widely used in signal processing when a fast convergence is of prime priority. 21
  • 27. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.4. TYPES OF READOUT LAYERS FOR ECHO STATE Q-FUNCTION 4.4.2 Multilayer Perceptron Readout Layer The first and major reason we employed echo state networks as a Q-function approxima- tion were their ability to hold the history of the changes in the dynamic system behaviours. We need this history to deal with delays in executed action in the control problem. How- ever, the estimated Q-value by linear readout layers of the echo state Q-function might not help with complex nonlinear dynamics. The linear least square readout layer returns a global minimum which means the estimated targets have the minimum possible mean squared error to the original target. But a small error value does not always mean we have a successful policy for reinforcement learning. Also, it has been reported [29] that a larger reservoir size, even with high randomness in its weight matrices, combined with a linear readout layer might not be able to learn nonlinear dynamic systems. Therefore, we need to utilize more advanced readout layers to learn the policy for nonlinear dynamic systems. To do so, we use a multilayer perceptron on the readout layer of the echo state Q-function and trained it via the Backpropagation algorithm. The novelty of this ap- proach is that the reservoir expands the systems states into higher dimensional features space and also preserves the history of the states. Then the Backpropagation algorithm trains the multilayer perceptron to learn the value function which estimates the proper Q-values for controlling the delayed dynamic system. The Figure 4.2 illustrates a topo- logical example for the Echo State Fitted-Q Iteration algorithm. Figure 4.2: The topological illustration of the Echo State Fitted-Q Iteration algorithm with multilayer perceptron readout layer. It consists of two major parts. First, the memory part containing echo state input weight and reservoir weight matrices and their corresponding neu- rons. In the second part the readout layer containing one input layer, two hidden layers of feedforward neural networks and one output layer. Notice that the applied action goes only through the readout layer but not the reservoir itself. 22
  • 28. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.5. SMAC FOR HYPER PARAMETER OPTIMIZATION When we compare the Echo State Fitted-Q Iteration (ESFQ) algorithm to the NFQ algorithm, it can be seen that our method is a specific version of the NFQ algorithm. As it is shown in Algorithm 3 in more detail, after collecting trajectory samples first we fit them into a reservoir with pre-initialized weight matrices. The reservoir collects the expanded trajectory samples in its state matrix and then jointly with their corresponded actions fits them to a multilayer perceptron algorithm i.e. the Q-function in the Algorithm 3. The Rprop algorithm trains the multilayer perceptron and tries to minimize the estimation error with respect to the target values. This novel structure is computationally efficient, as long as the reservoir has fairly small size. The immediate costs/reward for each action is based no system state not the reservoir state. Algorithm 3 The main loop of Echo State Fitted-Q Iteration.k counts the number of it- erations, kmax denotes the maximum number of iterations. init MLP() returns a multilayer perceptron with randomly initialized weights. Rprop training(P) takes pattern set P and re- turns a multilayer perceptron that has been trained on P using Rprop as the supervised training method procedure ESFQ Main() input: a set of transition samples D; output:Q value function QN ; k = 0; init MLP() → Q0; init ESN(); while k < kmax do generate pattern set P = {(inputl , targetl ), l = 1, ..., #D} where: inputl = ESN(sl ), al targetl = c(sl , al ) + γmin a Qk(input l , a l ) Rprop training(P) → Qk+1 k := k + 1 4.5 SMAC For Hyper Parameter Optimization There are a number of parameters that need to be carefully selected in order to boost the performance of the Echo State Fitted-Q Iteration algorithm. These are parameters related to the short-term memory capacity of the echo state Q-function. In the following, we named those parameters whose their effect need to be studied. First is the spectral radius of the reservoir which directly affect the short term memory capacity of the reser- voir. Second, the leaking rate which determines how fast neurons in the hidden units will shed off their dynamics. Third, the reservoir size or number hidden units, usually chosen relatively larger than the input signals dimension. Fourth, the sparsity percentage of the input and reservoir weight matrices which affect the computation time and to some degree performance. Fifth, the reservoir activation function which is usually logistic, hyperbolic-tangent, or rectifier units. As we see the echo state Q-function is a black-box function with various parameters. Most of these parameters have continuous real values and using combinatorial space of parameter settings for tuning them is inefficient and leads to unsatisfactory outcomes. Recently, automated approaches for solving this algorithm configuration problem have led to substantial improvements in the state of the art for solving various problems. Se- 23
  • 29. CHAPTER 4. ESN FOR FUNCTION APPROXIMATION 4.5. SMAC FOR HYPER PARAMETER OPTIMIZATION quential Bayesian optimization allows for global optimization of such black-box functions by using as few trials as possible. MOCKUS et al. [44] applied Bayesian methods to cases with linear and nonlinear constraints and to multi-objective optimization. They stud- ied interactive procedures and reduction of multidimensional data in connection with global optimization using Bayesian methods. Various implementations of this sequen- tial optimization match or surpass manually tuned performance for the tasks such as satisfiability problem [25] or object classification [7]. In this work we employed SMAC hyper-parameter optimization software which is implemented in [25] and used sequential Bayesian optimization methods. In the extended version of the SMAC software [24] it is possible to optimize various types of parameters e.g. categorical parameters and sets of instances. 24
  • 30. CHAPTER 5 EXPERIMENTS AND RESULTS 5.1 Experimental Setup In this section, we demonstrated results of our experiments in order to evaluate the performance of the Echo State Fitted-Q Iteration (ESFQ) algorithm on the learning of delayed control tasks. The presence of a delay, in particular, the action delay, violates the Markov property of a control system. The main goal of this work was to show that echo state networks with a nonlinear readout layer are practical memory based function approximation in the non-Markovian reinforcement learning domain. Therefore, we tried to compare the performance of our method to tapped delay-line memory base algorithms, as a baseline method and discuss the different advantages and drawbacks of it. We choose two standard simulated benchmarks, mountain car, and inverted pendulum, for evaluat- ing our model-free learning algorithms. We tested our algorithms with plants containing delays in their applied actions with different lengths in the range of 1 to 30. We showed that the solution we proposed within the Echo State Fitted-Q Iteration algorithm can be a very stable and reliable learning tool for a system with unknown delay lengths and complex dynamics. In the rest of this chapter, we introduce our selected benchmarks and their parameter settings for each experiment. After that, we describe the setting for our algorithms, the general scheme of our experiments, and the hyper-parameter optimization for the ESFQ algorithm. Then we compare and analyze computed the results based on their accuracy, efficiency, and ability to deal with the complex dynamic system with delays. We implemented our core methods and algorithms in C + + using the Eigen framework and integrated them to the CLSquare simulation system. For the hyper-parameter tuning, we employed SMAC [24], a versatile tool for optimizing algorithm parameters. 5.2 Simulation Benchmarks In the realm of reinforcement learning, there are various standard simulated benchmarks [62] used to evaluated algorithm’s performances. In this work, we tried two of these bench- marks for evaluating the performance of our algorithms and collected measurements to compare their abilities in dealing with delay in control systems. The benchmarks were 25
  • 31. CHAPTER 5. EXPERIMENTS AND RESULTS 5.2. SIMULATION BENCHMARKS mountain car and inverted pendulum [62]. The dynamics of balancing a pendulum at an unstable position can be employed in the applications of controlling walking robots or rocket thrusters. This highly nonlinear dynamics make the inverted pendulum a suitable and difficult task for our algorithm to solve. Our second chosen benchmark, the moun- tain car problem, is a second-order, nonlinear dynamic system with low-dimensional and continuous-valued state and action spaces. The simulated plants are provided by the CLSquare tool, developed and implemented by the Machine Learning Lab of the Uni- versity of Freiburg. In the following, we briefly introduce each benchmark and their parameter settings for our experiments. 5.2.1 Mountain Car Mountain car is a standard testing domain in reinforcement learning. The problem is for- mulated such an under-actuated car must drive up a parabolic shape steep hill. Starting from any initial state in the middle of the valley the goal of the system is to drive the car up the valley’s side and escape in as few steps as possible. Since gravity is stronger than the car engine, even at full throttle, the car cannot simply accelerate up the steep slope. The car must rock back and forth along the bottom of the valley to build enough momentum to escape. The allowed actions are {−4, 4} Newton to drive the system in left and right directions. The state vector has two dimensions, car position along x-Axis in meters, and car velocity in m/s. The allowed area for the car is between [−1.2, 0.7] and the system will receive a transition cost of 0.01 in this range. The terminal area is in the range of [∞, −1.2] and the agent will receive a terminal cost of 1.01 and episode cycle will be ended. The target area is between [0.7, ∞] and agent receive no cost in the range and the episode will be ended. The Figure 5.1 depicts the mountain-car visualization in the CLSquare simulation. Figure 5.1: The simulated mountain car visualization. The allowed position for the car is in the range of [−1.2, 0.7] meters and the goal area in the range of [0.7, ∞] meters. The terminal area for the car is in the range of [−∞, 0.7] meters. 5.2.2 Inverted Pendulum The inverted pendulum problem benchmark requires balancing a pendulum of unknown length and mass at the upright position by applying forces to the cart it is mounted 26
  • 32. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION on. The allowed actions are {−10, 10} Newton to move the cart to the left or right directions. Actions can be given at intervals of 0.05s. The system state is described by a four dimensions vector, pole angle θ, pole angular velocity ˙θ, cart position P, and cart transitional velocity ˙P. The cart can move in the range of {−2.4, 2.4} meters and anything outside this ranges is considered terminal area and the episode will be ended. The pole is allowed to rotate in between {−0.7, 0.7} radians. When the inverted pendulum is moving in the allowed range will receive transition cost of 0.01 and cost for entering the terminal area is 1.01. The target area for the pole is defined as the range {−0.05, 0.05} radiant, and for the cart as the range {−0.05, 0.05} meters. There, the system will receive no transition cost and the episode will end. The Figure 5.2 depicts the inverted pendulum visualization in the CLSquare simulation. Figure 5.2: The inverted pendulum visualization in the CLSquare simulation. The goal areas for the pole and cart is indicated. The allowed area for the cart is in the range of {−2.4, 2.4} meters and for the pole is in the range of {−0.7, 0.7} radians. Outside the allowed area is considered the terminal area. The goal area for the cart is in the range of {−0.05, 0.05} meters and the goal area for the pole is in the range of {−0.05, 0.05} radians. 5.3 Experimental Procedure And Parametrisation The process that takes the Echo State Fitted-Q Iteration algorithm to learn a control task has the following steps: first, we initialize the echo state networks input and reservoir weights matrices, Win , W. Then generated trajectory samples in each episode, containing the plant state variable in a vector form, are passed through the reservoir at the end of each episode and expanded to a higher dimensions feature space, the same dimensionality as the reservoir state. Then these expanded samples and their corresponding actions are accumulated in a growing batch format. The trajectory samples are generated by a simulated plant containing unknown delay lengths in their executed actions in the time intervals of 0.05s. At the end of each episode, the batch is fed to a multilayer perceptron which contains a number of input units equal to the size of the reservoir hidden units. The data flows through two extra hidden layers with 20 neurons each, and one output layer for the Q-value. The activation function for the multilayer perceptrons is logistic. Finally, the Rprop algorithm updated the weights of the multilayer perceptron. We designed our finite horizon experiments on the introduced benchmarks such that the reinforcement learning algorithm tries to minimize the time the agent needs to reach the goal area. 27
  • 33. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION The discount factor γ parameter for our Q-learning algorithm was set to 0.98 and no exploration strategy was used during the training. 5.3.1 Echo State Q-Function As we described in the previous section the reservoir in echo state networks works as memory for the Q-function and the multilayer perceptron as a readout layer in the Echo State Fitted-Q Iteration algorithm. For our memory based approach it was essential to have sufficient memory capacity for holding the history of delayed states. Here we benefited from the short-term capacity of the echo state Q-function. However, the Q- function parameters needed to be specified according to the length of the delay and the dynamics of the control task at hand. Table 5.1: The list of parameters for initializing echo state networks. Name Description And Value Range Optimized Spectral Radius Real value [0, 1], adjusts memory size True Leaking Rate Real value [0, 1], adjusts dynamics of each neuron True Reservoir Size Integer value in 10, 20, 30, 40, 50, 100, 200 True Transition steps Integer set to {5% or 10%} of cycle length False Regularization coefficient Real value in [10e-3,10e3] True Activation function Logistic, Tanh, Rectifier True Input weight sparsity Real value [0.2, 1.0] True Reservoir weight spar- sity Real value [0.2, 1.0] True A list of echo state networks parameters is available in Table 5.1. The spectral radius and leaking rate together affect the short term memory length of the network. Basically, for a system with fast oscillation like the inverted pendulum, we would need a shorter memory length compared to slower rate systems like the mountain car. A high spectral radius and or a very small leaking rate increase the short memory length of echo state network. The transition steps are the number of the time steps in each trajectory samples that are used for burning the reservoir and these state will be removed in the learning time. The burning phase helps the reservoir to remove its noisy dynamics and become input driven rather than affected by the arbitrary initial states of itself. The length of the transition steps has a linear dependency on the sample trajectory length, usually 5 to 10 percent of the average trajectory length. The reservoir size depends on the dimension of the input signal and the complexity of the system dynamics, for some tasks it needs to be set to a very large number. The regularization coefficient and the activation function need to be searched and found regarding the control system at hand. The standard choice for activation function is the hyperbolic tangent function. A higher value for the regularization coefficient is used to prevent linear readout weights from growing exponentially. Typically it is a real number in logarithmic scale and in the range 28
  • 34. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION of [0.00001, 1]. In the following we talk about the initialization of the echo state network weight matrices which are playing a major role in capturing system dynamics. Figure 5.3: On the left, the reservoir weight matrix is initialized from a normal distribution with zero mean and unit variance. Then it is normalized with the spectral radius of 0.54. On the right, the reservoir weight matrix is initialized from a normal distribution with zero mean and unit variance. Then orthogonal matrix is computed using Schur decomposition implemented in the Eigen library and it is normalized with the spectral radius of 0.54. Finally 40 percent of its coefficient set to zero. The sparsity of the echo state weight matrices boosts its performance by extracting more distinguishable features from input signals. In the Figure 5.3 reservoir weight matri- ces for 20 hidden units are illustrated. The matrix in the left is initialized using a normal distribution with zero mean and unit variance and then normalized with the spectral radius of 0.54 and fully connected. In fully connected reservoirs, all reservoir neurons were directly connected to each other with all connection weights being nonzero. The more sparsely connected matrix on the right graph was generated in two steps. First, a fully connected reservoir was generated in the same way as the first one. Then from that, an orthogonal matrix computed using Schur decomposition. Finally, given the sparsity percentage, 40% in this example, randomly chosen coefficients were set to 0. We need to consider the fact that if the sparsity percentage is set very high, then the reservoir does not exist as a group of mutually connected neurons. Usually, larger reservoirs (more hidden units) tolerate a lower connectivity and still fulfill this requirement. The advan- tage of a sparse orthogonal reservoir weight matrix is that it has linearly independent columns and echo states were computed by this matrix could better catch the dynam- ics of the system. The weights of the reservoir W functions as a dynamic basis for the original input signal. Therefore, these bases decompose the input signals into pertinent sub-components to maximize their differences. 29
  • 35. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION In summary, we suggest that the performance echo state Q-function highly depends on a proper balance of the internal dynamics and the influence of external activities, mediated via the input weights, and the strengths of the input and target activities. See- ing that the external activities can not be directly manipulated, the initialization of the reservoir and input weight matrices are important. If the input weight values were set too high, it will dominate the dynamics unfolding in the reservoir and quickly overrule any other useful internal dynamics. A proper choice of the weight initialization intervals is necessary to generate a well-performing echo state Q-function. 5.3.2 Hyper-parameter Optimization The choice of parameters depended on the delay length which was unknown for the Echo State Fitted-Q Iteration algorithm. Therefore, we had to search over the parameter space to find out which configuration provides sufficient memory length for our algorithm. Hence, we used the hyper-parameter optimization tool SMAC to search over given pa- rameters: spectral radius, leaking rate, reservoir size, non-linear function for reservoir neurons, input weight matrix sparsity, and reservoir matrix sparsity. These parameters are inter-depended and affect one another. The parameter optimization algorithms try to minimize the loss function over sets of discrete and continue values. To compute the loss value we trained our algorithm for 120 episodes, 200 cycles each on the inverted pendulum task with an unknown action delay given a particular parameter set. The algorithm was tested 100 times with different initial positions at the end of each episode. A successful test was one that ended up in the goal area in the finite horizon task. The averaged success rate subtracted after 120 episodes were returned as a loss value. SMAC searched over this parameter space using Bayesian methods of global optimization to find settings that provide minimum expected deviation from the global minimum. Parameter tuning over reinforcement learning algorithm is a computationally expensive process, and for each delay length has to be done separately. Therefore, in the scope of this work, we only ran a few optimization iterations, roughly 60 to 80 per delay length and in total 2500 experiments, to find different good settings. The Figure 5.4 shows the Pearson correlation coefficient for 6 hyper-parameters opti- mized via SMAC. Each cell shows the correlation coefficient between parameters on its corresponding column and row. Labels are parameter names in both Axis. Coefficient values range between [−1, 1] with a higher absolute value for higher correlation and the sign for direct or inverse correlation. And they are computed from samples with at least 80% goal reach. The value range for each parameter is presented in Table 5.1. Our re- sults showed that the most frequently chosen activation function was logistic activation therefore we exclude it from correlation computation. The goal of this optimization was to find the configuration which gives echo state Q-function enough short-term memory capacity to learn the delayed control task. Therefore, we include corresponding delay lengths for each experiment to observe its correlation with other parameters. 30
  • 36. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION Figure 5.4: This plot shows the pairwise correlation matrix between for 6 optimized hyperpa- rameters using Pearson product-moments method. The labels on the figure assigned to their corresponding parameter. The delay lengths, third column, was kept constant during each op- timization iteration and its value is ranged between [1, 30]. The dataset which the correlations are computed is from samples which have 80% or more goal reach, 1561 experiments, for the simulated inverted pendulum plant. There are few observations we could make by looking at the bivariate correlation ma- trix on Figure 5.4. First, the reservoir matrix and input matrices sparsity were positively correlated. This means for experiments with different delay length their value increases and decreases simultaneously. However, this correlation was not strong. Contrary to that, the leaking rate and spectral radius are inversely correlated. Although this was a weak correlation it implied that a reservoir with high spectral radius needs a lower leaking rate for holding the history of delayed actions. According to the update equation of the reservoir, a high leaking rate means its new state will be influenced less by its previous state and high spectral radius means the reservoir previous state will have more influence on the next computed state. Another observation is that the delay length in this matrix had a weak negative correlation with the spectral radius, the leaking rate, and the input weight matrix sparsity. Our bivariate correlation matrix also shows the delay length and the reservoir size had almost zero correlation. Furthermore, we can see that the reservoir size and the spectral radius have a positive correlation. Both of these two observations are counter-intuitive, because as we described before for a reservoir with a large number of hidden units we would expect a relatively small spectral radius. In addition, for a longer delay length, we need a higher spectral radius to provide enough short-term memory capacity. Therefore, we can safely conclude that these parameters are interdependent and they have a multivariate dependency among them which can not be understood only using a bivariate correlation computation. To analyze the effects of these parameters on the ESFQ performance in detail we need to run thoroughgoing ex- periments of hyper-parameter optimization with an extensive number of iteration for each delay length. Then by knowing their multivariate correlations we can find an efficient way of selecting them. However, in the scope of this work for selecting these parameters, 31
  • 37. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION we need to rely on computationally expensive hyper-parameter optimisation done over our reinforcement learning algorithm. Figure 5.5: The figure shows a histogram plot of the goal reaching percentage of the spectral radius, leaking rate, reservoir weight matrix sparsity, input weight matrix sparsity. Each bin on the Y-Axis shows how frequently and with what percentage the system reaches the goal given the corresponding subject parameter on X-Axis. There are in total 2500 experiments on the simulated inverted pendulum containing action delays in the range of [1, 30]. In Figure 5.5 we see a histogram plot for the spectral radius, leaking rate, reservoir weight matrix sparsity, input weight matrix sparsity parameter relative to reaching the goal area. In total, there were 1561 configuration samples with more than 80% success rate, out of 5600 different configurations that we tried and in average 50 samples per delay length. As we can see on the top left plot system spectral radius values in the range of [0.0001, 0.3] have dominantly reached the goal area with high success rate. For the leaking rate in the plot top right, the successful experiments have a broader range of [0.0001, 0.9]. And for both weight matrices sparsity, the successful experiments frequently occur in the range of [0.4, 0.8]. This illustration will help us to elicit some general rule for initializing our hyper-parameters and putting some constraints on their value range and make the search more efficient. Figure 5.6 depicts mean and variance of the four echo state hyper-parameters relative to goal reach percentage for experiments above 80% success rate. In the top rows, we compared spectral radius and leaking rate against goal reaching percentage. The bottom row compares reservoir and input weight matrices sparsity. The frequency of successful experiments for each of these values can be seen in the corresponding histogram plots in Figure 5.5. As we can see for a wide range of these subject parameters there are successful experiments. This is another support for 32
  • 38. CHAPTER 5. EXPERIMENTS AND RESULTS 5.3. EXPERIMENTAL PROCEDURE AND PARAMETRISATION the presence of multivariate interdependencies between these parameters. Figure 5.6: The figure shows the mean and variance of the goal reach percentage on the Y-Axis, given value of the particular subject parameter on the X-Axis. The subject parameters are the spectral radius, leaking rate, reservoir weight matrix sparsity, and input weight matrix sparsity. The samples selected for experiments with more than 80% goal reach. There are in total 2500 experiments on the simulated inverted pendulum containing action delays in the range of [1, 30]. Our goal in this section was to suggest that there are rules for efficiently choosing some of the hyper-parameters of the ESFQ algorithm. And according to our parameter optimization results in the previous section deducing such rules is not a straightforward task due to multivariate dependencies between these parameters. It will need an extensive study of the parameter space which is beyond the scope of the work. But in order to drive some general constraints for setting up and computing a reservoir given a particular task, our results suggests the following: • Reservoir size, the number of hidden units, up to 10 times larger than the size of the input vector. This is usually a large enough number to grasp and reflect the system dynamics. • Input weight matrix and reservoir weight matrix should be sparse as we saw in Figure 5.5, they better to have the sparsity values between [40, 60] percent. • Our experimental results suggest that the logistic activation works well, 1514 out of 1561 successful experiments, for the reservoir activation function. • leaking rate and spectral radius have a weak inverse correlation but it is not in the scope of this work to comment on the choice of their values. 33
  • 39. CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM Basically, these results suggest that there are possibilities to chose the algorithm parameters according to the characteristics of a given dynamic control task. But to do so we would need to do more experiments and collect enough samples per delay lengths. We may run 300 SMAC iterations for each delay length and then study the change in the parameters relative to how well the controller learns a task. Then, it is possible to fix the value of some of these parameters and only search over particular ones. In the last chapter of this work, we will mention some of the probable solutions for optimizing these parameters and better understanding their dependencies. 5.4 Comparisons To Delay-Line NFQ Algorithm In this section, we compare the results of our method to the standard delay-line Neural Fitted-Q Iteration algorithm. Here we used a Q-function with two layers of hidden units, 20 neurons each, proven[54] to be feasible structure to learn dynamic control tasks. In the delay-line method, the action delays lengths are known to the algorithm. The system history is a collection of observed states from the most recent state-action pair and previous states, equal to the delay length. This means if there a 3 steps action delays in the simulated plant, then the input vector contain the current state-action pair and the past three states. The NFQ algorithm was trained on the simulated inverted pendulum benchmark for 200 episodes, 200 cycles each, using Rprop and one update per episode. For the sake of comparison, we added the same nonlinear structure as the NFQ algorithm on top of our reservoir in Echo State Fitted-Q Iteration algorithm. As we described in chapter 4, the difference between the NFQ and our method is that we preserve the memory for our method by using the reservoir of the echo state networks. And we employed the hyper-parameter optimization tool SMAC to find a suitable memory capacity for the echo state Q-function. In simple terms, we compared the performance of the reservoir as a memory to the classical tapped delay-line memories [65]. Although the Echo State Fitted-Q Iteration algorithm is not aware of the delay length but with the aid of the hyper-parameter optimization tool SMAC it manages to find the best configuration. This approach could be considered the same as giving the delay length to the system. 34
  • 40. CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM Figure 5.7: Maximum percentages of goal reach achieved by delay-line NFQ and ESFQ algo- rithms given different delay lengths. The results achieved from the simulated inverted pendulum with action delays in the range of [1, 30]. The X-Axes shows the delay lengths and the Y-Axis show the maximum percentage of goal reach. In Figures 5.7 and 5.8 we present a comparison between the delay-line Neural Fitted- Q Iteration and the Echo State Fitted-Q Iteration algorithms on the test phase with different initial values and different delay lengths. In the first graph, we can observe that the ESFQ algorithm has the advantage of achieving the maximum value even with the long delay length. On the other hand, we can observe that delay-line NFQ faces difficulties after 10 steps action delays and its peak performance decays down to 40%. Such observations suggest that reservoir is a very suitable memory based approach for the control system with longer delay lengths. For every delay length it reaches the goal area, from different initial positions in the test phase, on average more than 90% of the times, and in some cases 100% of the times. Furthermore, in the second plot 5.8 we can observe that the ESFQ algorithm shows fairly robust performance to randomness in the initialization of its fixed weight matrices. Here for every length of delay, we trained both algorithms, initialized with 10 different random seeds, and plot the mean and standard deviation of goal reach percentage over delay lengths. As we see the NFQ algorithm performance is prone to decay when it is trained with different random seeds especially for longer delay lengths. On the contrary, the ESFQ algorithm performs more robustly in the face of such random initialization even for longer delay lengths. In this regard, we need to consider the fact that the echo state performance presumably should decay because it has extra random parameters for its weight matrices. However, given the same configuration for initializing the reservoir over various random seeds, our algorithm shows much better overall performance. Furthermore, the closed range of performance variances in both algorithms is another evidence for the robustness of the ESFQ algorithm toward randomness in its parameter initializations. In our experiments, we are trying to support the idea that the echo state Q-function is a fairly reliable memory based function approximation for the dynamic system with delays. And, the facts about the randomness support the idea that the memory capacity of the echo state Q-function is less likely to be affected by random initialization of its parameters. 35
  • 41. CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM Figure 5.8: Mean and standard deviation of the goal reaches percentage achieved by delay-line NFQ and ESFQ algorithms given different delay lengths. The results achieved from experi- menting with simulated inverted pendulum containing action delays in the range of [1, 30]. The X-Axis shows the delay lengths and the Y-Axis show the mean and standard deviation of the goal reach percentage average over experiments with 10 different random seeds. The mean values are computed from the maximum percentage of goal reach for each experiment. In Figure 5.9 we compared the learning curve for the ESFQ and NFQ algorithm on the experiments with 27 delay length. The value is averaged over 10 different random seeds. Here we can see that although the echo state Q-function has more random pa- rameters, overall the ESFQ algorithm performs better than the NFQ algorithm. In the early episodes the delay-line NFQ shows some minor improvement relative to the ESFQ but after 60 episodes the average performance of the NFQ stays the same and the ESFQ performance increases. Furthermore, we can see that the ESFQ has higher standard de- viation compared to delay-line NFQ. 36
  • 42. CHAPTER 5. EXPERIMENTS AND RESULTS 5.4. COMPARISONS TO DELAY-LINE NFQ ALGORITHM Figure 5.9: The learning curve of the delay-line NFQ and the ESFQ algorithms. The results achieved from experimenting with the simulated inverted pendulum containing 27 steps action delay. The X-Axis shows the episodes index and the Y-Axis shows the mean and standard deviation of the goal reach percentage average over experiments with 10 different random seeds. The mean values computed from the percentage of goal reach for corresponding episodes over all experiments. In Figure 5.10 we compared how fast each algorithm reached the 80% success rate or more. As we could see the Echo State Fitted-Q Iteration algorithm average needs 100 episodes to reach the minimum 80% success rate. Although the delay-line NFQ algorithm for some short delay lengths reaches the maximum success rates faster it fails for most of the delay lengths larger than 15. This fact indicates that the memory provided by the reservoir contains information that helps the Echo State Fitted-Q Iteration algorithm to perform more successfully and reach higher performances quicker. However, we should remember that in order to find a suitable parameter setting for the ESFQ algorithm we still need to use hyper-parameter optimization tools, explained in the previous section. From all these observations we can conclude that the Echo State Fitted-Q Iteration algorithm with nonlinear readout layer is a reliable algorithm for learning a dynamic control task with delays. And, its performance can stay relatively stable regardless of the randomness in the parameter initializations. 37
  • 43. CHAPTER 5. EXPERIMENTS AND RESULTS 5.5. WHY NOT LINEAR READOUT LAYERS Figure 5.10: In this figure, we see how many episodes for the ESFQ and the NFQ algorithms take to reach success rate above 80%. The X-Axis corresponds to delay length and the Y-Axis corresponds to episode number. The experiments are done on the simulated inverted pendulum for a given delay length, for 120 episodes and 200 cycle each. For some delay lengths the delay-line NFQ did not manage to reach the minimum of 80% success rate. 5.5 Why Not Linear Readout Layers In the previous section, we showed the advantages of the Echo State Fitted-Q Iteration algorithm with a nonlinear readout layer and compared its performance to the tapped delay-line NFQ algorithm. As we described in section 4.4 it is possible to train different readout layers for the ESFQ algorithm and in the following we present results for training the algorithm with various readout layers. Here we show the performance evaluation for ridge regression and very large reservoirs. Our aim is to illustrate that a linear method for training the echo state Q-function, in offline growing batch mode, performs well for the control task with simpler dynamics such as mountain car but fails to learn a proper policy on more complex dynamic tasks such as the inverted pendulum. 5.5.1 Ridge Regression On Mountain Car The first choice for training the echo state Q-function is the ridge regression algorithm. We designed an experiment with the simulated mountain car containing 2 steps action delay. In Figure 5.11 we can see how the dynamics of the mountain car are captured by the reservoir and how they changed over the time given the input signal i.e. the car position and velocity and the applied action. The ESFQ algorithm with a linear readout layer achieves 100% test success in reaching the goal area starting from 100 different initial positions after 17 episodes running with 300 cycles each. The effects of the delayed actions are reflected in the reservoir states in the top plot. Hidden units adapted their states according to the changes in the car position and velocity. It took 290 cycles for the controller to reach the goal area. Other parameters of this experiment were set as following. The spectral radius is 0.40, the reservoir size is 20, the regularization coefficient is 10, 30 steps of the initial transitions, γ is 0.98, with 10% greedy exploration rate, and 38