The document discusses reinforcement learning and the exploration-exploitation tradeoff that agents face. It proposes learning exploration-exploitation strategies rather than relying on predefined formulas. The approach defines training problems, candidate strategies parameterized by formulas, a performance criterion, and optimizes strategies on training problems using an estimation of distribution algorithm. Simulation results show learned strategies outperform strategies from common formulas on matching test problems.
We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include.
In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.
We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include.
In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.
This slide gives brief overview of supervised, unsupervised and reinforcement learning. Algorithms discussed are Naive Bayes, K nearest neighbour, SVM,decision tree, Markov model.
Difference between regression and classification. difference between supervised and reinforcement, iterative functioning of Markov model and machine learning applications.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
Présentation faite lors de l'Assemblée générale de "Touche pas à mes certificats verts", une organisation belge qui défend les propriétaires de panneaux photovoltaïques contre les gestionnaires de réseaux et les gouvernements. Voici le lien vers la vidéo de la conférence: https://www.youtube.com/watch?v=Icz4JIP8_sY
This slide gives brief overview of supervised, unsupervised and reinforcement learning. Algorithms discussed are Naive Bayes, K nearest neighbour, SVM,decision tree, Markov model.
Difference between regression and classification. difference between supervised and reinforcement, iterative functioning of Markov model and machine learning applications.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
Présentation faite lors de l'Assemblée générale de "Touche pas à mes certificats verts", une organisation belge qui défend les propriétaires de panneaux photovoltaïques contre les gestionnaires de réseaux et les gouvernements. Voici le lien vers la vidéo de la conférence: https://www.youtube.com/watch?v=Icz4JIP8_sY
It happens to the best of us. It's been a part of each one of my startups, and my guess is that it happens to us all in various degrees. The goal here is to be cathartic, to help us realize that its real, that it happens to each of us, and how we can cope well until we swing out of the roughest parts.
I'll be sharing a great deal from my personal experiences with depression, startups, tech, highs/lows, and some insights from the journey. This isn't a talk from someone on the other side (if that exists) but someone who is very much changing the world from where I am. I believe you can do the same.
My hope for this talk is that we can understand we're not alone, find encouragement, and get a few tools to help along the way.
Extended Bio
I earned a business/psychology degree (called service management), and my masters was in divinity (with a course load in counseling/psychology/systems). I've also worked with many people over the years in various stages of life (parents, students, hospice, homeless). I'm not a psychologist (I run a software company) and would gladly recommend people if needed.
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...Université de Liège (ULg)
This presentation discusses future electricity market designs and, in particular, capacity remuneration mechanisms that are needed for new investments and security of supply.
These social media business tips may help a start-up or a business where social media marketing is a vital component in driving the business forward. Also to act as a refresher to any marketing person that may be working on different marketing projects at the same time making it difficult to remember all the steps. The main tip when it comes to social media marketing is that it must part of the daily routine so it needs to be resourced, planned and worked on.
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
This is a presntation of my second year intership on optimal stochastic theory and how we can apply it on some financial application then how we can solve such problems using finite differences methods!
Enjoy it !
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
This is an introductory to optimal stochastic control theory with two applications in finance: Merton portfolio problem and Investement/consumption problem with numerical results using finite differences approach
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
Deep Reinforcement Learning with Shallow Trees:
In this talk, I present Concept Network Reinforcement Learning (CNRL), developed at Bonsai. It is an industrially applicable approach to solving complex tasks using reinforcement learning, which facilitates problem decomposition, allows component reuse, and simplifies reward functions. Inspired by Sutton’s options framework, we introduce the notion of “Concept Networks” which are tree-like structures in which leaves are “sub-concepts” (sub-tasks), representing policies on a subset of state space. The parent (non-leaf) nodes are “Selectors”, containing policies on which sub-concept to choose from the child nodes, at each time during an episode. There will be a high-level overview on the reinforcement learning fundamentals at the beginning of the talk.
Bio: Matineh Shaker is an Artificial Intelligence Scientist at Bonsai in Berkeley, CA, where she builds machine learning, reinforcement learning, and deep learning tools and algorithms for general purpose intelligent systems. She was previously a Machine Learning Researcher at Geometric Intelligence, Data Science Fellow at Insight Data Science, Predoctoral Fellow at Harvard Medical School. She received her PhD from Northeastern University with a dissertation in geometry-inspired manifold learning.
Multi objective predictive control a solution using metaheuristicsijcsit
The application of multi objective model predictive control approaches is significantly limited with
computation time associated with optimization algorithms. Metaheuristics are general purpose heuristics
that have been successfully used in solving difficult optimization problems in a reasonable computation
time. In this work , we use and compare two multi objective metaheuristics, Multi-Objective Particle
swarm Optimization, MOPSO, and Multi-Objective Gravitational Search Algorithm, MOGSA, to generate
a set of approximately Pareto-optimal solutions in a single run. Two examples are studied, a nonlinear
system consisting of two mobile robots tracking trajectories and avoiding obstacles and a linear multi
variable system. The computation times and the quality of the solution in terms of the smoothness of the
control signals and precision of tracking show that MOPSO can be an alternative for real time
applications.
This is a talk where Prof. Damien ERNST explains how Reinforcement Learning could be used to tackle various problems in the field of energy markets and for the energy transition.
This is the conference I gave for the kickoff meeting of AI4Energy.
You can also access a power point version of the presentation with better quality images and the video from this URL address: http://hdl.handle.net/2268/252633
In this presentation, we discuss ambitious solutions for fighting climate change and present the first results of the Katabata project. The goal of this project was to install weather stations in very windy parts of Greenland, as a first step to harvest wind energy there.
Download a power point version of this presentation with higher quality images at the following address: https://orbi.uliege.be/handle/2268/251827
In this presentation, we discuss several major engineering projects that should be put in place for fighting climate change at a cheap cost. Among others: a global electrical grid, carbon capture technologies, power-to-gas devices.
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Université de Liège (ULg)
Current global environmental challenges require vigorous and diverse actions in the energy sector. One solution that has recently attracted interest consists in harnessing high-quality variable renewable energy resources in remote locations, while using transmission links to transport the power to end users. In this context, a comparison of western European and Greenland wind regimes is proposed. By leveraging a regional atmospheric model specically designed to accurately capture polar phenomena, local climatic features of southern Greenland are identied to be particularly conducive to extensive renewable electricity generation from wind. A methodology to assess how connecting remote locations to major demand centres would benet the latter from a resource availability standpoint is introduced and applied to the aforementioned Europe-Greenland case study, showing superior and complementary wind generation potential in the considered region of Greenland with respect to selected European sites.
Ce document présente le décret sur les communautés d'énergie renouvelable qui sera bientôt en vigueur en Wallonie, la partie francophone de la Belgique. Il s'agit d'une pièce de législation avant-gardiste dans le domaine des réseaux électriques.
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Université de Liège (ULg)
This presentation explores the potential of power-to-gaz
technologies for a deep decarbonization of our economies. A case study carried out on the Belgian energy system is discussed.
This is a summary of the work I did over these last years in power systems and in reinforcement learning (and with my very latest work about neuromulation in reinforcement learning).
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Université de Liège (ULg)
Presentation given on the 3rd of December 2018 in Paris, at the event where I was awarded by the SEE the prestigious Blondel Medal for my research work. This presentation gives a brief overview of the work I have carried out over these last twenty years.
This is the material of a 4 hour class given in the framework of a EES-UETP class (Electrical Energy Systems - University Enterprise Training Partnership). The first part gives a brief overview of the applications of reinforcement learning for solving decision-making problems related to electrical systems. The second part explains how to build intelligent agents using reinforcement learning.
Electricity retailing in Europe: remarkable events (with a special focus on B...Université de Liège (ULg)
This short presentation gives a quick overview of the significant changes that have happened in the electricity retailing business in Europe, since the liberalization of the sector.
Very nice presentation done on the Belgian offshore wind potential. It has been done by a student in the "Sustainable energy class" that I am giving at the ULiège. http://blogs.ulg.ac.be/damien-ernst/genv0002-1-sustainable-energy/
Time to make a choice between a fully liberal or fully regulated model for th...Université de Liège (ULg)
Prof. Damien Ernst explaining why it is time to make a choice between a fully liberal or fully regulated model for the electrical industry. Link to a video of the conference given at the end of the presentation.
Conference on the electrification of the Democratic Republic of the Congo. Link to a video of the conference at the end of the presentation (in French).
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
A review on techniques and modelling methodologies used for checking electrom...
Learning for exploration-exploitation in reinforcement learning. The dusk of the small formulas’ reign.
1. Learning for exploration-exploitation
in reinforcement learning.
The dusk of the small formulas’ reign.
Damien Ernst
University of Li`ege, Belgium
25th of November 2011
SEQUEL project, INRIA Lille - Nord Europe.
2. Exploration-exploitation in RL
Reinforcement Learning (RL): an agent interacts with an
environment in which it perceives information about its current
state and takes actions. The environment, in return, provides a
reward signal. The agent’s objective is to maximize the
(expected) cumulative reward signal. He may have a priori
knowledge on its environment before taking an action and has
limited computational resources.
Exploration-exploitation: The agent needs to take actions so as
to gather information that may help him to discover how to
obtain high rewards in the long-term (exploration) and, at the
same time, he has to exploit at best its current information to
obtain high short-term rewards.
3. Pure exploration: The agent first interacts with its environment
to collect information without paying attention to the rewards.
The information is used afterwards to take actions leading to
high rewards.
Pure exploration with model of the environment: Typical
problem met when an agent with limited computational
resources computes actions based on (pieces of) simulated
trajectories.
2
4. Example 1: Multi-armed bandit problems
Definition: A gambler (agent) has T coins, and at each step he
may choose among one of K slots (or arms) to allocate one of
these coins, and then earns some money (his reward)
depending on the response of the machine he selected.
Rewards are sampled from an unknown probability distribution
dependent on the selected arm. Goal of the agent: to collect
the largest cumulated reward once he has exhausted his coins.
Standard solutions: Index based policies; use an index for
ranking the arms and pick at each play the arm with the highest
index; the index for each arm k is a small formula that takes
for example as input the average rewards collected (rk ), the
standard deviations of these rewards (σk ), the total number of
plays t, the number of times arm k has been played (Tk ), etc.
Upper Confidence Bound index as example: rk + 2 ln t
Tk
3
5. Example 2: Tree search
Tree search: Pure exploration problem with model of the
environment. The agent represents possible strategies by a
tree. He cannot explore the tree exhaustively because of
limited computational resources. When computational
resources are exhausted, he needs to take an action. Often
used with time receding horizon strategies.
Example of tree exploration algorithms: A∗
, Monte Carlo Tree
Search, optimistic planning for deterministic systems, etc. In all
these algorithms, terminal nodes to be developed are chosen
based on small formulas.
4
6. Example 3: Exploration-exploitation in a MDP
Markov Decision Process (MDP): At time t, the agent is in state
xt ∈ X and may choose any action ut ∈ U. The environment
responds by randomly moving into a new state
xt+1 ∼ P(·|xt , ut ), and giving the agent the reward rt ∼ ρ(xt , ut ).
An approach for interacting with MDPs when P and r unknown:
a layer that learns (directly or indirectly) an approximate
state-action value function Q : X × U → R atop of which comes
the exploration/exploitation policy. Typical
exploration-exploration policies are made of small formulas.
ǫ-Greedy: ut = arg max
u∈U
Q(xt , u) with probability 1 − ǫ and ut at
random wiht probability ǫ.
Softmax policy: ut ∼ Psm(·) where Psm(u) = expQ(xt ,u)/τ
P
u∈U expQ(xt ,u)/τ .
5
7. Pros and cons for small formulas
Pros: (i) Lend themselves to a theoretical analysis (ii) No
computing time required for designing a solution (iii) No a priori
knowledge on the problem required.
Cons: (i) Formulas published not used as such in practice and
solutions often engineered to the problem at hand (typically the
case for Monte-Carlo Tree Search (MCTS) techniques) (ii)
Computing time often available before starting the
exploration-exploitation task (iii) A priori knowledge on the
problem often available and not exploited by the formulas.
6
8. Learning for exploration-(exploitation) in RL
1. Define a set of training problems (TP) for
exploration-(exploitation). A priori knowledge can (should) be
used for defining the set TP.
2. Define a rich set of candidate exploration-(exploitation)
strategies (CS) with numerical parametrizations/with formulas.
3. Define a performance criterion PC : CS × TP → R such that
PC(strategy, training problem) gives the performance of
strategy on training problem.
4. Solve arg max
strategy∈CS
training prob.∈TP PC(strategy, training prob.)
using an optimisation tool.
NB: approach inspired by current methods for tuning the parameters of
existing formulas; differs by the fact that (i) it searches in much richer spaces of
strategies (ii) the search procedure is much more systematic.
7
9. Example 1: Learning for multi-armed bandit
problems
Multi-armed bandit problem: Let i ∈ {1, 2, . . . , K} be the K
arms of the bandit problem. νi is the reward distribution of arm
i and µi its expected value. T is the total number of plays. bt is
the arm played by the agent at round t and rt ∼ νbt
is the
reward it obtains.
Information: Ht = [b1, r1, b2, r2, . . . , bt , rt ] is a vector that
gathers all the information the agent has collected during the
first t plays. H the set of all possible histories of any length t.
Policy: The agent’s policy π : H → {1, 2, . . . , K} processes at
play t the information Ht−1 to select the arm bt ∈ {1, 2, . . . , K}:
bt = π(Ht−1).
8
10. Notion of regret: Let µ∗
= maxk µk be the expected reward of
the optimal arm. The regret of π is : Rπ
T = Tµ∗
−
T
t=1 rt . The
expected regret is E[Rπ
T ] =
K
k=1 E[Tk (T)](µ∗
− µk ) where
Tk (T) is the number of times the policy has drawn arm k on
the first T rounds.
Objective: Find a policy π∗
that for a given K minimizes the
expected regret, ideally for any T and any {νi}K
i=1 (equivalent to
maximizing the expected sum of rewards obtained).
Index-based policy: (i) During the first K plays, play
sequentially each arm (ii) For each t > K, compute for every
machine k the score index(Hk
t−1, t) where Hk
t−1 is the history of
rewards for machine k (iii) Play the arm with the largest score.
9
11. 1. Define a set of training problems (TP) for
exploration-exploitation. A priori knowledge can (should) be
used for defining the set TP.
In our simulations a training set is made of 100 bandit
problems with Bernouilli distributions, two arms and the same
horizon T. Every Bernouilli distribution is generated by
selecting at random in [0, 1] its expectation.
Three training sets generated, one for each value of the
training horizon T ∈ {10, 100, 1000}.
NB: In the spirit of supervised learning, the learned strategies are evaluated on
test problems different from the training problems. The first three test sets are
generated using the same procedure. The second three test sets are also
generated in a similar way but by considering truncated Gaussian distributions
in the interval [0, 1] for the rewards. The mean and the standard deviation of
the Gaussian distributions are selected at random in [0, 1]. The test sets count
10,000 elements.
10
12. 2. Define a rich set of candidate exploration-exploitation
strategies (CS).
Candidate exploration-exploitation strategies are index-based
policies.
Set of index-based policies contains all the policies that can be
represented by: index(Hk
t , t) = θ · φ(Hk
t , t) where θ is the vector
of parameters to be optimized and φ(·, ·) a hand-designed
feature extraction function.
We suggest to use as feature extraction function the one that
leads the indexes:
p
i=0
p
j=0
p
k=0
p
l=0
θi,j,k,l(
√
ln t)i
· (
1
√
Tk
)j
· (rk )k
· (σk )l
where θi,j,k,l are parameters.
p = 2 in our simulations ⇒ 81 parameters to be learned.
11
13. 3. Define a performance criterion PC : CS × TP → R such that
PC(strategy, training problem) gives the performance of
strategy on training problem.
Performance criterion of an index-based policy on a
multi-armed bandit problem is chosen as the expected regret
obtained by this policy.
NB: Again in the spirit of supervised learning, a regularization term could be
added to the above mentioned performance criterion to avoid overfitting.
12
14. 4. Solve arg max
strategy∈CS
training prob.∈TP PC(strategy, training prob.)
using an optimisation tool.
Objective function has a complex relation with the parameter
and may contain many local minima. Global optimisation
approaches are advocated.
Estimation of Distribution Algorithms (EDA) are used here as
global optimization tool. EDAs rely on a probabilistic model to
describe promising regions of the search space and to sample
good candidate solutions. This is performed by repeating
iterations that first sample a population of np candidates using
the current probabilistic model and then fit a new probabilistic
model given the b < np best candidates.
13
15. Simulation results
Policy quality on a test set measured by its average expected
regret on this test set.
Learned index based-policies compared with 5 other
index-based policies: UCB1, UCB1-BERNOULLI, UCB2,
UCB1-NORMAL, UCB-V.
These 5 policies are made of small formulas which may have
parameters. Two cases studied: (i) default values for the
parameters or (ii) parameters optimized on a training set.
14
16. Policy Training Parameters Bernoulli test problems
Horizon T=10 T=100 T=1000
Policies based on small formulas. Default parameters
UCB1 - C = 2 1.07 5.57 20.1
UCB1-BERNOULLI - 0.75 2.28 5.43
UCB1-NORMAL - 1.71 13.1 31.7
UCB2 - α = 10−3 0.97 3.13 7.26
UCB-V - c = 1, ζ = 1 1.45 8.59 25.5
ǫn-GREEDY - c = 1, d = 1 1.07 3.21 11.5
Learned policies
T=10 . . . 0.72 2.37 15.7
T=100 (81 parameters) 0.76 1.82 5.81
T=1000 . . . 0.83 2.07 3.95
Observations: (i) UCB1-BERNOULLI is the best policy based on small
formulas. (ii) The learned policy for a training horizon is always better than any
policy based on a small formula if the test horizon is the same. (iii)
Performances of the learned policies with respect to the formulas based
policies degrade when the training horizon is not equal to the test horizon.
15
17. Policy Training Parameters Bernoulli test problems
Horizon T=10 T=100 T=1000
Policies based on small formulas. Optimized parameters
T=10 C = 0.170 0.74 2.05 4.85
UCB1 T=100 C = 0.173 0.74 2.05 4.84
T=1000 C = 0.187 0.74 2.08 4.91
T=10 α = 0.0316 0.97 3.15 7.39
UCB2 T=100 α = 0.000749 0.97 3.12 7.26
T=1000 α = 0.00398 0.97 3.13 7.25
T=10 c = 1.542, ζ = 0.0631 0.75 2.36 5.15
UCB-V T=100 c = 1.681, ζ = 0.0347 0.75 2.28 7.07
T=1000 c = 1.304, ζ = 0.0852 0.77 2.43 5.14
T=10 c = 0.0499, d = 1.505 0.79 3.86 32.5
ǫn-GREEDY T=100 c = 1.096, d = 1.349 0.95 3.19 14.8
T=1000 c = 0.845, d = 0.738 1.23 3.48 9.93
Learned policies
T=10 . . . 0.72 2.37 15.7
T=100 (81 parameters) 0.76 1.82 5.81
T=1000 . . . 0.83 2.07 3.95
Main observation: The learned policy for a training horizon is always better
than any policy based on a small formula if the test horizon is the same. 16
18. Policy Training Parameters Gaussian test problems
Horizon T=10 T=100 T=1000
Policies based on small formulas. Optimized parameters
T=10 C = 0.170 1.05 6.05 32.1
UCB1 T=100 C = 0.173 1.05 6.06 32.3
T=1000 C = 0.187 1.05 6.17 33.0
T=10 α = 0.0316 1.28 7.91 40.5
UCB2 T=100 α = 0.000749 1.33 8.14 40.4
T=1000 α = 0.00398 1.28 7.89 40.0
T=10 c = 1.542, ζ = 0.0631 1.01 5.75 26.8
UCB-V T=100 c = 1.681, ζ = 0.0347 1.01 5.30 27.4
T=1000 c = 1.304, ζ = 0.0852 1.13 5.99 27.5
T=10 c = 0.0499, d = 1.505 1.01 7.31 67.57
ǫn-GREEDY T=100 c = 1.096, d = 1.349 1.12 6.38 46.6
T=1000 c = 0.845, d = 0.738 1.32 6.28 37.7
Learned policies
T=10 . . . 0.97 6.16 55.5
T=100 (81 parameters) 1.05 5.03 29.6
T=1000 . . . 1.12 5.61 27.3
Main observation: A learned policy is always better than any policy based on
a small formula if the test horizon is equal to the training horizon.
17
19. Example 2: Learning for tree search. The case of
look-ahead trees for deterministic planning
Deterministic planning: At time t, the agent is in state xt ∈ X
and may choose any action ut ∈ U. The environment responds
by moving into a new state xt+1 = f(xt , ut ), and giving the
agent the reward rt = ρ(xt , ut ). The agent wants to maximize
its return defined as:
∞
t=0 γt
rt where γ ∈ [0, 1[. Functions f
and r are known.
Look-ahead tree: Particular type of policy that represents at
every instant t the set of all possible trajectories starting from
xt by a tree. A look-ahead tree policy explores this tree until the
computational resources (measured here in terms of number of
tree nodes developed) are exhausted. When exhausted, it
outputs the first action of the branch along which the highest
discounted rewards are collected.
18
20. xt+2xt+2
uA
uAρ(xt+1, uA
)
ρ(xt , uB
)
xt−2
xt−1ut−1
ut−2
ρ(xt−1, ut−1)
ρ(xt−2, ut−2)
uB
uB
uB
xt
uA
ρ(xt , uA
)
xt+1
uA uB
ρ(xt+1, uB
)
Scoring the terminal nodes for exploration: The tree is
developed incrementally by always expending first the terminal
node which has the highest score. Scoring functions are
usually small formulas. (“best-first search” tree-exploration
approach).
Objective: Find an exploration strategy for the look-ahead tree
policy that maximizes the return of the agent, whatever the
initial state of the system.
19
21. 1. Define a set of training problems (TP) for
exploration-(exploitation). A priori knowledge can (should) be
used for defining the set TP.
The training problems are made of the elements of the
deterministic planning problem. They only differ by the initial
state.
If information is available on the true initial state of the problem,
it should be used for defining the set of initial states X0 used for
defining the training set.
If the initial state is known perfectly well, one may consider only
one training problem.
20
22. 2. Define a rich set of candidate exploration strategies (CS).
The candidate exploration strategies all grow the tree
incrementally, expending always the terminal node which has
the highest score according to a scoring function.
The set of scoring functions contain all the functions that take
as input a tree and a terminal node and that can be
represented by: θ · φ(tree, terminal node) where θ is the vector
of parameters to be optimized and φ(·, ·) a hand-designed
feature extraction function.
For problem where X ⊂ Rm
, we use as φ:
(x[1], . . . , x[m], dx[1], . . ., dx[m], rx[1], . . ., rx[m]) where x is the
state associated with the terminal node, d is its depth and r is
the reward collected before reaching this node.
21
23. 3. Define a performance criterion PC : CS × TP → R such that
PC(strategy, training problem) gives the performance of
strategy on training problem.
The performance criterion of a scoring function is the return on
the test problem of the look-ahead policy it leads to.
The computational budget for the tree development when
evaluating the performance criterion should be chosen equal to
the computational budget available when controlling the ’real
system’.
22
24. 4. Solve arg max
strategy∈CS
training prob.∈TP PC(strategy, training prob.)
using an optimisation tool.
Objective function has a complex relation with the parameter
and may contain many local minima. Global optimisation
approaches are advocated.
Estimation of Distribution Algorithms (EDA) are used here as
global optimization tool.
NB: Gaussian processes for global optimisation have also been tested. They
are more complex optimisation tools and they have been found to be able to
require significantly less evaluation of objective function to identify a
near-optimal solution.
23
25. Simulation results
Small formulas based scoring functions used for comparison:
scoremindepth
= −(depth terminal node)
scoregreedy1
= reward obtained before reaching terminal node
scoregreedy2
= sum of disc. rew. from top to terminal node
scoreoptimistic
= scoregreedy2
+
γdepth terminal node
1 − γ
Br
where Br is an upperbound on the reward function.
In the results hereafter, performance of a scoring function
evaluated by averaging the return of the agent over a set of
initial states. Results generated for different computational
budgets. The score function has always been optimized for the
budget used for performance evaluation.
24
26. 2-dimensional test problem
Dynamics: (yt+1, vt+1) = (yt , vt ) + (vt , ut )0.1; X = R2
;
U = {−1, +1}; ρ((yt , vt ), ut ) = max(1 − y2
t+1, 0); γ = 0.9. The
initial states chosen at random in [−1, 1] × [−2, 2] for building
the training problems. Same states used for evaluation.
Main observation: (i) Optimized trees outperform other methods (ii) Small
computational budget needed for reaching near-optimal performances. 25
27. HIV infection control problem
Problem with six state variables, two control actions and a
highly non linear dynamics. A single training ’initial state’ is
used. Policies first evaluated when the system starts from this
state.
26
28. Main observations: (i) Optimized look ahead trees clearly outperforms other
methods. (ii) Very small computational budget required for reaching near
optimal performances.
27
29. Policies evaluated when the system starts from a state rather
far from the initial state used for training:
Main observation: Optimized look ahead trees perform very well even when
the ’training state’ is not the test state.
28
30. A side note on these optimized look-ahead trees
Optimizing look-ahead trees is a direct policy search method
for which parameters to be optimized are those of the tree
exploration procedure rather than those of standard function
approximators (e.g, neural networks, RBFs) as it is usually the
case.
Require very little memory (compared to dynamic
programming methods and even other policy search
techniques); not that many parameters need to be optimized;
robust with respect to the initial states chosen for training, etc.
Right problem statement for comparing different techniques:
“How to compute the best policy given an amount of off-line
and on-line computational resources knowing fully the
environment and not the initial state?”
29
31. Complement to other techniques rather than competitor:
look-ahead tree policies could be used as an “on-line
correction mechanism” for policies computed off-line.
(for tree exploration and selection of action ut )
uB
ρ(xt+1, uB
)
xt+2xt+2
uA
uAρ(xt+1, uA
)
ρ(xt , uB
)
Other policy used for scoring the nodes
uB
uB
uB
xt
uA
ρ(xt , uA
)
xt+1
uA
30
32. Example 3: Learning for exploration-exploitation
in finite MDPs
Instantiation of the general learning paradigm for
exploration-exploitation for RL to this case is very much similar
to what has been described for multi-bandit problems.
Good results already obtained by considering large set of
canditate exploration-exploitation policies made of small
formulas.
Next step would be to experiment the approach on MDPs with
continuous spaces.
31
33. Automatic learning of (small) formulas for
exploration-exploitation?
Remind: There are pros for using small formulas !
The learning approach for exploration-exploitation in RL could
help to find new (small) formulas people have not thought of
before by considering as candidate strategies a large set of
(small) formulas.
An example of formula for index-based policies found to
perform very well on bandit problems by such an approach:
rk + C
Tk
.
32
34. Conclusion and future works
Learning for exploration/exploitation: excellent performances
that suggest a bright future for this approach.
Several directions for future research:
Optimisation criterion: Approach targets an
exploration-exploitation policy that gives the best average
performances on a set of problems. Risk adverse criteria could
also be thought of. Regularization terms could also be added
to avoid overfitting.
Design of customized optimisation tools: Particularly relevant
when the set of candidate policies is made of formulas.
Theoretical analysis: E.g., what are the properties ofthe
strategies learned in generalization?
33
35. Presentation based on (order of appearance)
“Learning to play K-armed bandit problems”. F. Maes, L. Wehenkel and D.
Ernst. In Proceedings of the 4th International Conference on Agents and
Artificial Intelligence (ICAART 2012), Vilamoura, Algarve, Portugal, 6-8
February 2012. (8 pages).
“0ptimized look-ahead tree search policies”. F. Maes, L. Wehenkel and D.
Ernst. In Proceedings of the 9th European Workshop on Reinforcement
Learning (EWRL 2011), Athens, Greece, September 9-11, 2011. (12 pages).
“Learning exploration/exploitation strategies for single trajectory reinforcement
learning”. M. Castronovo, F. Maes, R. Fonteneau and D. Ernst. In Proceedings
of the 10th European Workshop on Reinforcement Learning (EWRL 2012),
Edinburgh, Scotland, June 30-July 1 2012. (8 pages).
“Automatic discovery of ranking formulas for playing with multi-armed bandits”.
F. Maes, L. Wehenkel and D. Ernst. In Proceedings of the 9th European
Workshop on Reinforcement Learning (EWRL 2011), Athens, Greece,
September 9-11, 2011. (12 pages).
“Optimized look-ahead tree policies: a bridge between look-ahead tree policies
and direct policy search”. T. Jung, D. Ernst, L.Wehenkel and F. Maes.
Submitted.
“Learning exploration/exploitation strategies: the multi-armed bandit case” F.
Maes, L. Wehenkel and D. Ernst. Submitted.
34