SlideShare a Scribd company logo
Learning for exploration-exploitation
in reinforcement learning.
The dusk of the small formulas’ reign.
Damien Ernst
University of Li`ege, Belgium
25th of November 2011
SEQUEL project, INRIA Lille - Nord Europe.
Exploration-exploitation in RL
Reinforcement Learning (RL): an agent interacts with an
environment in which it perceives information about its current
state and takes actions. The environment, in return, provides a
reward signal. The agent’s objective is to maximize the
(expected) cumulative reward signal. He may have a priori
knowledge on its environment before taking an action and has
limited computational resources.
Exploration-exploitation: The agent needs to take actions so as
to gather information that may help him to discover how to
obtain high rewards in the long-term (exploration) and, at the
same time, he has to exploit at best its current information to
obtain high short-term rewards.
Pure exploration: The agent first interacts with its environment
to collect information without paying attention to the rewards.
The information is used afterwards to take actions leading to
high rewards.
Pure exploration with model of the environment: Typical
problem met when an agent with limited computational
resources computes actions based on (pieces of) simulated
trajectories.
2
Example 1: Multi-armed bandit problems
Definition: A gambler (agent) has T coins, and at each step he
may choose among one of K slots (or arms) to allocate one of
these coins, and then earns some money (his reward)
depending on the response of the machine he selected.
Rewards are sampled from an unknown probability distribution
dependent on the selected arm. Goal of the agent: to collect
the largest cumulated reward once he has exhausted his coins.
Standard solutions: Index based policies; use an index for
ranking the arms and pick at each play the arm with the highest
index; the index for each arm k is a small formula that takes
for example as input the average rewards collected (rk ), the
standard deviations of these rewards (σk ), the total number of
plays t, the number of times arm k has been played (Tk ), etc.
Upper Confidence Bound index as example: rk + 2 ln t
Tk
3
Example 2: Tree search
Tree search: Pure exploration problem with model of the
environment. The agent represents possible strategies by a
tree. He cannot explore the tree exhaustively because of
limited computational resources. When computational
resources are exhausted, he needs to take an action. Often
used with time receding horizon strategies.
Example of tree exploration algorithms: A∗
, Monte Carlo Tree
Search, optimistic planning for deterministic systems, etc. In all
these algorithms, terminal nodes to be developed are chosen
based on small formulas.
4
Example 3: Exploration-exploitation in a MDP
Markov Decision Process (MDP): At time t, the agent is in state
xt ∈ X and may choose any action ut ∈ U. The environment
responds by randomly moving into a new state
xt+1 ∼ P(·|xt , ut ), and giving the agent the reward rt ∼ ρ(xt , ut ).
An approach for interacting with MDPs when P and r unknown:
a layer that learns (directly or indirectly) an approximate
state-action value function Q : X × U → R atop of which comes
the exploration/exploitation policy. Typical
exploration-exploration policies are made of small formulas.
ǫ-Greedy: ut = arg max
u∈U
Q(xt , u) with probability 1 − ǫ and ut at
random wiht probability ǫ.
Softmax policy: ut ∼ Psm(·) where Psm(u) = expQ(xt ,u)/τ
P
u∈U expQ(xt ,u)/τ .
5
Pros and cons for small formulas
Pros: (i) Lend themselves to a theoretical analysis (ii) No
computing time required for designing a solution (iii) No a priori
knowledge on the problem required.
Cons: (i) Formulas published not used as such in practice and
solutions often engineered to the problem at hand (typically the
case for Monte-Carlo Tree Search (MCTS) techniques) (ii)
Computing time often available before starting the
exploration-exploitation task (iii) A priori knowledge on the
problem often available and not exploited by the formulas.
6
Learning for exploration-(exploitation) in RL
1. Define a set of training problems (TP) for
exploration-(exploitation). A priori knowledge can (should) be
used for defining the set TP.
2. Define a rich set of candidate exploration-(exploitation)
strategies (CS) with numerical parametrizations/with formulas.
3. Define a performance criterion PC : CS × TP → R such that
PC(strategy, training problem) gives the performance of
strategy on training problem.
4. Solve arg max
strategy∈CS
training prob.∈TP PC(strategy, training prob.)
using an optimisation tool.
NB: approach inspired by current methods for tuning the parameters of
existing formulas; differs by the fact that (i) it searches in much richer spaces of
strategies (ii) the search procedure is much more systematic.
7
Example 1: Learning for multi-armed bandit
problems
Multi-armed bandit problem: Let i ∈ {1, 2, . . . , K} be the K
arms of the bandit problem. νi is the reward distribution of arm
i and µi its expected value. T is the total number of plays. bt is
the arm played by the agent at round t and rt ∼ νbt
is the
reward it obtains.
Information: Ht = [b1, r1, b2, r2, . . . , bt , rt ] is a vector that
gathers all the information the agent has collected during the
first t plays. H the set of all possible histories of any length t.
Policy: The agent’s policy π : H → {1, 2, . . . , K} processes at
play t the information Ht−1 to select the arm bt ∈ {1, 2, . . . , K}:
bt = π(Ht−1).
8
Notion of regret: Let µ∗
= maxk µk be the expected reward of
the optimal arm. The regret of π is : Rπ
T = Tµ∗
−
T
t=1 rt . The
expected regret is E[Rπ
T ] =
K
k=1 E[Tk (T)](µ∗
− µk ) where
Tk (T) is the number of times the policy has drawn arm k on
the first T rounds.
Objective: Find a policy π∗
that for a given K minimizes the
expected regret, ideally for any T and any {νi}K
i=1 (equivalent to
maximizing the expected sum of rewards obtained).
Index-based policy: (i) During the first K plays, play
sequentially each arm (ii) For each t > K, compute for every
machine k the score index(Hk
t−1, t) where Hk
t−1 is the history of
rewards for machine k (iii) Play the arm with the largest score.
9
1. Define a set of training problems (TP) for
exploration-exploitation. A priori knowledge can (should) be
used for defining the set TP.
In our simulations a training set is made of 100 bandit
problems with Bernouilli distributions, two arms and the same
horizon T. Every Bernouilli distribution is generated by
selecting at random in [0, 1] its expectation.
Three training sets generated, one for each value of the
training horizon T ∈ {10, 100, 1000}.
NB: In the spirit of supervised learning, the learned strategies are evaluated on
test problems different from the training problems. The first three test sets are
generated using the same procedure. The second three test sets are also
generated in a similar way but by considering truncated Gaussian distributions
in the interval [0, 1] for the rewards. The mean and the standard deviation of
the Gaussian distributions are selected at random in [0, 1]. The test sets count
10,000 elements.
10
2. Define a rich set of candidate exploration-exploitation
strategies (CS).
Candidate exploration-exploitation strategies are index-based
policies.
Set of index-based policies contains all the policies that can be
represented by: index(Hk
t , t) = θ · φ(Hk
t , t) where θ is the vector
of parameters to be optimized and φ(·, ·) a hand-designed
feature extraction function.
We suggest to use as feature extraction function the one that
leads the indexes:
p
i=0
p
j=0
p
k=0
p
l=0
θi,j,k,l(
√
ln t)i
· (
1
√
Tk
)j
· (rk )k
· (σk )l
where θi,j,k,l are parameters.
p = 2 in our simulations ⇒ 81 parameters to be learned.
11
3. Define a performance criterion PC : CS × TP → R such that
PC(strategy, training problem) gives the performance of
strategy on training problem.
Performance criterion of an index-based policy on a
multi-armed bandit problem is chosen as the expected regret
obtained by this policy.
NB: Again in the spirit of supervised learning, a regularization term could be
added to the above mentioned performance criterion to avoid overfitting.
12
4. Solve arg max
strategy∈CS
training prob.∈TP PC(strategy, training prob.)
using an optimisation tool.
Objective function has a complex relation with the parameter
and may contain many local minima. Global optimisation
approaches are advocated.
Estimation of Distribution Algorithms (EDA) are used here as
global optimization tool. EDAs rely on a probabilistic model to
describe promising regions of the search space and to sample
good candidate solutions. This is performed by repeating
iterations that first sample a population of np candidates using
the current probabilistic model and then fit a new probabilistic
model given the b < np best candidates.
13
Simulation results
Policy quality on a test set measured by its average expected
regret on this test set.
Learned index based-policies compared with 5 other
index-based policies: UCB1, UCB1-BERNOULLI, UCB2,
UCB1-NORMAL, UCB-V.
These 5 policies are made of small formulas which may have
parameters. Two cases studied: (i) default values for the
parameters or (ii) parameters optimized on a training set.
14
Policy Training Parameters Bernoulli test problems
Horizon T=10 T=100 T=1000
Policies based on small formulas. Default parameters
UCB1 - C = 2 1.07 5.57 20.1
UCB1-BERNOULLI - 0.75 2.28 5.43
UCB1-NORMAL - 1.71 13.1 31.7
UCB2 - α = 10−3 0.97 3.13 7.26
UCB-V - c = 1, ζ = 1 1.45 8.59 25.5
ǫn-GREEDY - c = 1, d = 1 1.07 3.21 11.5
Learned policies
T=10 . . . 0.72 2.37 15.7
T=100 (81 parameters) 0.76 1.82 5.81
T=1000 . . . 0.83 2.07 3.95
Observations: (i) UCB1-BERNOULLI is the best policy based on small
formulas. (ii) The learned policy for a training horizon is always better than any
policy based on a small formula if the test horizon is the same. (iii)
Performances of the learned policies with respect to the formulas based
policies degrade when the training horizon is not equal to the test horizon.
15
Policy Training Parameters Bernoulli test problems
Horizon T=10 T=100 T=1000
Policies based on small formulas. Optimized parameters
T=10 C = 0.170 0.74 2.05 4.85
UCB1 T=100 C = 0.173 0.74 2.05 4.84
T=1000 C = 0.187 0.74 2.08 4.91
T=10 α = 0.0316 0.97 3.15 7.39
UCB2 T=100 α = 0.000749 0.97 3.12 7.26
T=1000 α = 0.00398 0.97 3.13 7.25
T=10 c = 1.542, ζ = 0.0631 0.75 2.36 5.15
UCB-V T=100 c = 1.681, ζ = 0.0347 0.75 2.28 7.07
T=1000 c = 1.304, ζ = 0.0852 0.77 2.43 5.14
T=10 c = 0.0499, d = 1.505 0.79 3.86 32.5
ǫn-GREEDY T=100 c = 1.096, d = 1.349 0.95 3.19 14.8
T=1000 c = 0.845, d = 0.738 1.23 3.48 9.93
Learned policies
T=10 . . . 0.72 2.37 15.7
T=100 (81 parameters) 0.76 1.82 5.81
T=1000 . . . 0.83 2.07 3.95
Main observation: The learned policy for a training horizon is always better
than any policy based on a small formula if the test horizon is the same. 16
Policy Training Parameters Gaussian test problems
Horizon T=10 T=100 T=1000
Policies based on small formulas. Optimized parameters
T=10 C = 0.170 1.05 6.05 32.1
UCB1 T=100 C = 0.173 1.05 6.06 32.3
T=1000 C = 0.187 1.05 6.17 33.0
T=10 α = 0.0316 1.28 7.91 40.5
UCB2 T=100 α = 0.000749 1.33 8.14 40.4
T=1000 α = 0.00398 1.28 7.89 40.0
T=10 c = 1.542, ζ = 0.0631 1.01 5.75 26.8
UCB-V T=100 c = 1.681, ζ = 0.0347 1.01 5.30 27.4
T=1000 c = 1.304, ζ = 0.0852 1.13 5.99 27.5
T=10 c = 0.0499, d = 1.505 1.01 7.31 67.57
ǫn-GREEDY T=100 c = 1.096, d = 1.349 1.12 6.38 46.6
T=1000 c = 0.845, d = 0.738 1.32 6.28 37.7
Learned policies
T=10 . . . 0.97 6.16 55.5
T=100 (81 parameters) 1.05 5.03 29.6
T=1000 . . . 1.12 5.61 27.3
Main observation: A learned policy is always better than any policy based on
a small formula if the test horizon is equal to the training horizon.
17
Example 2: Learning for tree search. The case of
look-ahead trees for deterministic planning
Deterministic planning: At time t, the agent is in state xt ∈ X
and may choose any action ut ∈ U. The environment responds
by moving into a new state xt+1 = f(xt , ut ), and giving the
agent the reward rt = ρ(xt , ut ). The agent wants to maximize
its return defined as:
∞
t=0 γt
rt where γ ∈ [0, 1[. Functions f
and r are known.
Look-ahead tree: Particular type of policy that represents at
every instant t the set of all possible trajectories starting from
xt by a tree. A look-ahead tree policy explores this tree until the
computational resources (measured here in terms of number of
tree nodes developed) are exhausted. When exhausted, it
outputs the first action of the branch along which the highest
discounted rewards are collected.
18
xt+2xt+2
uA
uAρ(xt+1, uA
)
ρ(xt , uB
)
xt−2
xt−1ut−1
ut−2
ρ(xt−1, ut−1)
ρ(xt−2, ut−2)
uB
uB
uB
xt
uA
ρ(xt , uA
)
xt+1
uA uB
ρ(xt+1, uB
)
Scoring the terminal nodes for exploration: The tree is
developed incrementally by always expending first the terminal
node which has the highest score. Scoring functions are
usually small formulas. (“best-first search” tree-exploration
approach).
Objective: Find an exploration strategy for the look-ahead tree
policy that maximizes the return of the agent, whatever the
initial state of the system.
19
1. Define a set of training problems (TP) for
exploration-(exploitation). A priori knowledge can (should) be
used for defining the set TP.
The training problems are made of the elements of the
deterministic planning problem. They only differ by the initial
state.
If information is available on the true initial state of the problem,
it should be used for defining the set of initial states X0 used for
defining the training set.
If the initial state is known perfectly well, one may consider only
one training problem.
20
2. Define a rich set of candidate exploration strategies (CS).
The candidate exploration strategies all grow the tree
incrementally, expending always the terminal node which has
the highest score according to a scoring function.
The set of scoring functions contain all the functions that take
as input a tree and a terminal node and that can be
represented by: θ · φ(tree, terminal node) where θ is the vector
of parameters to be optimized and φ(·, ·) a hand-designed
feature extraction function.
For problem where X ⊂ Rm
, we use as φ:
(x[1], . . . , x[m], dx[1], . . ., dx[m], rx[1], . . ., rx[m]) where x is the
state associated with the terminal node, d is its depth and r is
the reward collected before reaching this node.
21
3. Define a performance criterion PC : CS × TP → R such that
PC(strategy, training problem) gives the performance of
strategy on training problem.
The performance criterion of a scoring function is the return on
the test problem of the look-ahead policy it leads to.
The computational budget for the tree development when
evaluating the performance criterion should be chosen equal to
the computational budget available when controlling the ’real
system’.
22
4. Solve arg max
strategy∈CS
training prob.∈TP PC(strategy, training prob.)
using an optimisation tool.
Objective function has a complex relation with the parameter
and may contain many local minima. Global optimisation
approaches are advocated.
Estimation of Distribution Algorithms (EDA) are used here as
global optimization tool.
NB: Gaussian processes for global optimisation have also been tested. They
are more complex optimisation tools and they have been found to be able to
require significantly less evaluation of objective function to identify a
near-optimal solution.
23
Simulation results
Small formulas based scoring functions used for comparison:
scoremindepth
= −(depth terminal node)
scoregreedy1
= reward obtained before reaching terminal node
scoregreedy2
= sum of disc. rew. from top to terminal node
scoreoptimistic
= scoregreedy2
+
γdepth terminal node
1 − γ
Br
where Br is an upperbound on the reward function.
In the results hereafter, performance of a scoring function
evaluated by averaging the return of the agent over a set of
initial states. Results generated for different computational
budgets. The score function has always been optimized for the
budget used for performance evaluation.
24
2-dimensional test problem
Dynamics: (yt+1, vt+1) = (yt , vt ) + (vt , ut )0.1; X = R2
;
U = {−1, +1}; ρ((yt , vt ), ut ) = max(1 − y2
t+1, 0); γ = 0.9. The
initial states chosen at random in [−1, 1] × [−2, 2] for building
the training problems. Same states used for evaluation.
Main observation: (i) Optimized trees outperform other methods (ii) Small
computational budget needed for reaching near-optimal performances. 25
HIV infection control problem
Problem with six state variables, two control actions and a
highly non linear dynamics. A single training ’initial state’ is
used. Policies first evaluated when the system starts from this
state.
26
Main observations: (i) Optimized look ahead trees clearly outperforms other
methods. (ii) Very small computational budget required for reaching near
optimal performances.
27
Policies evaluated when the system starts from a state rather
far from the initial state used for training:
Main observation: Optimized look ahead trees perform very well even when
the ’training state’ is not the test state.
28
A side note on these optimized look-ahead trees
Optimizing look-ahead trees is a direct policy search method
for which parameters to be optimized are those of the tree
exploration procedure rather than those of standard function
approximators (e.g, neural networks, RBFs) as it is usually the
case.
Require very little memory (compared to dynamic
programming methods and even other policy search
techniques); not that many parameters need to be optimized;
robust with respect to the initial states chosen for training, etc.
Right problem statement for comparing different techniques:
“How to compute the best policy given an amount of off-line
and on-line computational resources knowing fully the
environment and not the initial state?”
29
Complement to other techniques rather than competitor:
look-ahead tree policies could be used as an “on-line
correction mechanism” for policies computed off-line.
(for tree exploration and selection of action ut )
uB
ρ(xt+1, uB
)
xt+2xt+2
uA
uAρ(xt+1, uA
)
ρ(xt , uB
)
Other policy used for scoring the nodes
uB
uB
uB
xt
uA
ρ(xt , uA
)
xt+1
uA
30
Example 3: Learning for exploration-exploitation
in finite MDPs
Instantiation of the general learning paradigm for
exploration-exploitation for RL to this case is very much similar
to what has been described for multi-bandit problems.
Good results already obtained by considering large set of
canditate exploration-exploitation policies made of small
formulas.
Next step would be to experiment the approach on MDPs with
continuous spaces.
31
Automatic learning of (small) formulas for
exploration-exploitation?
Remind: There are pros for using small formulas !
The learning approach for exploration-exploitation in RL could
help to find new (small) formulas people have not thought of
before by considering as candidate strategies a large set of
(small) formulas.
An example of formula for index-based policies found to
perform very well on bandit problems by such an approach:
rk + C
Tk
.
32
Conclusion and future works
Learning for exploration/exploitation: excellent performances
that suggest a bright future for this approach.
Several directions for future research:
Optimisation criterion: Approach targets an
exploration-exploitation policy that gives the best average
performances on a set of problems. Risk adverse criteria could
also be thought of. Regularization terms could also be added
to avoid overfitting.
Design of customized optimisation tools: Particularly relevant
when the set of candidate policies is made of formulas.
Theoretical analysis: E.g., what are the properties ofthe
strategies learned in generalization?
33
Presentation based on (order of appearance)
“Learning to play K-armed bandit problems”. F. Maes, L. Wehenkel and D.
Ernst. In Proceedings of the 4th International Conference on Agents and
Artificial Intelligence (ICAART 2012), Vilamoura, Algarve, Portugal, 6-8
February 2012. (8 pages).
“0ptimized look-ahead tree search policies”. F. Maes, L. Wehenkel and D.
Ernst. In Proceedings of the 9th European Workshop on Reinforcement
Learning (EWRL 2011), Athens, Greece, September 9-11, 2011. (12 pages).
“Learning exploration/exploitation strategies for single trajectory reinforcement
learning”. M. Castronovo, F. Maes, R. Fonteneau and D. Ernst. In Proceedings
of the 10th European Workshop on Reinforcement Learning (EWRL 2012),
Edinburgh, Scotland, June 30-July 1 2012. (8 pages).
“Automatic discovery of ranking formulas for playing with multi-armed bandits”.
F. Maes, L. Wehenkel and D. Ernst. In Proceedings of the 9th European
Workshop on Reinforcement Learning (EWRL 2011), Athens, Greece,
September 9-11, 2011. (12 pages).
“Optimized look-ahead tree policies: a bridge between look-ahead tree policies
and direct policy search”. T. Jung, D. Ernst, L.Wehenkel and F. Maes.
Submitted.
“Learning exploration/exploitation strategies: the multi-armed bandit case” F.
Maes, L. Wehenkel and D. Ernst. Submitted.
34

More Related Content

What's hot

Figure 1.doc
Figure 1.docFigure 1.doc
Figure 1.docbutest
 
Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...
Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...
Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...
Xin-She Yang
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
Dr. Radhey Shyam
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
Christian Robert
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Prof. Dr. K. Adisesha
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
NBER
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
NBER
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
recsysfr
 
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk AlgorithmReview of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
Xin-She Yang
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes Factors
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
Christian Robert
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 

What's hot (18)

Figure 1.doc
Figure 1.docFigure 1.doc
Figure 1.doc
 
Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...
Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...
Eagle Strategy Using Levy Walk and Firefly Algorithms For Stochastic Optimiza...
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk AlgorithmReview of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes Factors
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
 
PhysicsSIG2008-01-Seneviratne
PhysicsSIG2008-01-SeneviratnePhysicsSIG2008-01-Seneviratne
PhysicsSIG2008-01-Seneviratne
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 

Viewers also liked

Microgrids and their destructuring eff ects on the electrical industry
Microgrids and their destructuring effects on the electrical industryMicrogrids and their destructuring effects on the electrical industry
Microgrids and their destructuring eff ects on the electrical industryUniversité de Liège (ULg)
 
TPCV versus gouvernement(s) et gestionnaires de réseaux: décodage
TPCV versus gouvernement(s) et gestionnaires de  réseaux: décodage  TPCV versus gouvernement(s) et gestionnaires de  réseaux: décodage
TPCV versus gouvernement(s) et gestionnaires de réseaux: décodage
Université de Liège (ULg)
 
Mobile Printing Across Various Industries
Mobile Printing Across Various IndustriesMobile Printing Across Various Industries
Mobile Printing Across Various Industries
ADCBarcode
 
презентация курса товароведения
презентация курса  товароведения презентация курса  товароведения
презентация курса товароведения Светлана Бреусова
 
Burnout and Depression Uncensored
Burnout and Depression UncensoredBurnout and Depression Uncensored
Burnout and Depression Uncensored
Fresh Vine
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02
Antony Pagan
 
презентация курса товароведение Cb krat
презентация курса  товароведение Cb krat  презентация курса  товароведение Cb krat
презентация курса товароведение Cb krat Светлана Бреусова
 
Integration of renewable energy sources and demand-side management into distr...
Integration of renewable energy sources and demand-side management into distr...Integration of renewable energy sources and demand-side management into distr...
Integration of renewable energy sources and demand-side management into distr...
Université de Liège (ULg)
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02
Antony Pagan
 
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Université de Liège (ULg)
 
UNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTES
UNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTESUNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTES
UNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTESUniversité de Liège (ULg)
 
презентация курса товароведение Cb вся
презентация курса  товароведение Cb вся презентация курса  товароведение Cb вся
презентация курса товароведение Cb вся Светлана Бреусова
 
Social media business tips
Social media business tipsSocial media business tips
Social media business tips
The Bitter Business
 

Viewers also liked (14)

Microgrids and their destructuring eff ects on the electrical industry
Microgrids and their destructuring effects on the electrical industryMicrogrids and their destructuring effects on the electrical industry
Microgrids and their destructuring eff ects on the electrical industry
 
TPCV versus gouvernement(s) et gestionnaires de réseaux: décodage
TPCV versus gouvernement(s) et gestionnaires de  réseaux: décodage  TPCV versus gouvernement(s) et gestionnaires de  réseaux: décodage
TPCV versus gouvernement(s) et gestionnaires de réseaux: décodage
 
Mobile Printing Across Various Industries
Mobile Printing Across Various IndustriesMobile Printing Across Various Industries
Mobile Printing Across Various Industries
 
презентация курса товароведения
презентация курса  товароведения презентация курса  товароведения
презентация курса товароведения
 
Burnout and Depression Uncensored
Burnout and Depression UncensoredBurnout and Depression Uncensored
Burnout and Depression Uncensored
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02
 
The Global Grid
The Global GridThe Global Grid
The Global Grid
 
презентация курса товароведение Cb krat
презентация курса  товароведение Cb krat  презентация курса  товароведение Cb krat
презентация курса товароведение Cb krat
 
Integration of renewable energy sources and demand-side management into distr...
Integration of renewable energy sources and demand-side management into distr...Integration of renewable energy sources and demand-side management into distr...
Integration of renewable energy sources and demand-side management into distr...
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02
 
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
 
UNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTES
UNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTESUNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTES
UNDERSTANDING ACTIVE NETWORK MANAGEMENT IN 30 MINUTES
 
презентация курса товароведение Cb вся
презентация курса  товароведение Cb вся презентация курса  товароведение Cb вся
презентация курса товароведение Cb вся
 
Social media business tips
Social media business tipsSocial media business tips
Social media business tips
 

Similar to Learning for exploration-exploitation in reinforcement learning. The dusk of the small formulas’ reign.

Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Université de Liège (ULg)
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
Asma Ben Slimene
 
Using information theory principles to schedule real time tasks
 Using information theory principles to schedule real time tasks Using information theory principles to schedule real time tasks
Using information theory principles to schedule real time tasks
azm13
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
Xin-She Yang
 
Unit 2 in daa
Unit 2 in daaUnit 2 in daa
Unit 2 in daa
Nv Thejaswini
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
The Statistical and Applied Mathematical Sciences Institute
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
Ganesh Solanke
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Tesis-Maestria-Presentacion-SIturriaga
Tesis-Maestria-Presentacion-SIturriagaTesis-Maestria-Presentacion-SIturriaga
Tesis-Maestria-Presentacion-SIturriaga
Facultad de Ingeniería, UdelaR
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
IJRES Journal
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
yang947066
 
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Xin-She Yang
 
Assignment oprations research luv
Assignment oprations research luvAssignment oprations research luv
Assignment oprations research luvAshok Sharma
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Multi objective predictive control a solution using metaheuristics
Multi objective predictive control  a solution using metaheuristicsMulti objective predictive control  a solution using metaheuristics
Multi objective predictive control a solution using metaheuristics
ijcsit
 
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
Ajay Kumar
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...butest
 

Similar to Learning for exploration-exploitation in reinforcement learning. The dusk of the small formulas’ reign. (20)

Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Using information theory principles to schedule real time tasks
 Using information theory principles to schedule real time tasks Using information theory principles to schedule real time tasks
Using information theory principles to schedule real time tasks
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
 
Unit 2 in daa
Unit 2 in daaUnit 2 in daa
Unit 2 in daa
 
algorithm Unit 2
algorithm Unit 2 algorithm Unit 2
algorithm Unit 2
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Tesis-Maestria-Presentacion-SIturriaga
Tesis-Maestria-Presentacion-SIturriagaTesis-Maestria-Presentacion-SIturriaga
Tesis-Maestria-Presentacion-SIturriaga
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
 
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
 
Assignment oprations research luv
Assignment oprations research luvAssignment oprations research luv
Assignment oprations research luv
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Multi objective predictive control a solution using metaheuristics
Multi objective predictive control  a solution using metaheuristicsMulti objective predictive control  a solution using metaheuristics
Multi objective predictive control a solution using metaheuristics
 
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
 

More from Université de Liège (ULg)

Reinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionReinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transition
Université de Liège (ULg)
 
Algorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesAlgorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communities
Université de Liège (ULg)
 
AI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadAI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future ahead
Université de Liège (ULg)
 
Extreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectExtreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata project
Université de Liège (ULg)
 
Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...
Université de Liège (ULg)
 
Big infrastructures for fighting climate change
Big infrastructures for fighting climate changeBig infrastructures for fighting climate change
Big infrastructures for fighting climate change
Université de Liège (ULg)
 
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Université de Liège (ULg)
 
Décret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableDécret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelable
Université de Liège (ULg)
 
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Université de Liège (ULg)
 
Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets
Université de Liège (ULg)
 
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Université de Liège (ULg)
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
Université de Liège (ULg)
 
Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...
Université de Liège (ULg)
 
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTProjet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Université de Liège (ULg)
 
Belgian offshore wind potential
Belgian offshore wind potentialBelgian offshore wind potential
Belgian offshore wind potential
Université de Liège (ULg)
 
Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...
Université de Liège (ULg)
 
Electrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoElectrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the Congo
Université de Liège (ULg)
 
Energy: the clash of nations
Energy: the clash of nationsEnergy: the clash of nations
Energy: the clash of nations
Université de Liège (ULg)
 
Smart Grids Versus Microgrids
Smart Grids Versus MicrogridsSmart Grids Versus Microgrids
Smart Grids Versus Microgrids
Université de Liège (ULg)
 
Uber-like Models for the Electrical Industry
Uber-like Models for the Electrical IndustryUber-like Models for the Electrical Industry
Uber-like Models for the Electrical Industry
Université de Liège (ULg)
 

More from Université de Liège (ULg) (20)

Reinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionReinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transition
 
Algorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesAlgorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communities
 
AI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadAI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future ahead
 
Extreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectExtreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata project
 
Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...
 
Big infrastructures for fighting climate change
Big infrastructures for fighting climate changeBig infrastructures for fighting climate change
Big infrastructures for fighting climate change
 
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
 
Décret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableDécret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelable
 
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
 
Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets
 
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
 
Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...
 
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTProjet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
 
Belgian offshore wind potential
Belgian offshore wind potentialBelgian offshore wind potential
Belgian offshore wind potential
 
Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...
 
Electrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoElectrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the Congo
 
Energy: the clash of nations
Energy: the clash of nationsEnergy: the clash of nations
Energy: the clash of nations
 
Smart Grids Versus Microgrids
Smart Grids Versus MicrogridsSmart Grids Versus Microgrids
Smart Grids Versus Microgrids
 
Uber-like Models for the Electrical Industry
Uber-like Models for the Electrical IndustryUber-like Models for the Electrical Industry
Uber-like Models for the Electrical Industry
 

Recently uploaded

ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx
benykoy2024
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 

Recently uploaded (20)

ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 

Learning for exploration-exploitation in reinforcement learning. The dusk of the small formulas’ reign.

  • 1. Learning for exploration-exploitation in reinforcement learning. The dusk of the small formulas’ reign. Damien Ernst University of Li`ege, Belgium 25th of November 2011 SEQUEL project, INRIA Lille - Nord Europe.
  • 2. Exploration-exploitation in RL Reinforcement Learning (RL): an agent interacts with an environment in which it perceives information about its current state and takes actions. The environment, in return, provides a reward signal. The agent’s objective is to maximize the (expected) cumulative reward signal. He may have a priori knowledge on its environment before taking an action and has limited computational resources. Exploration-exploitation: The agent needs to take actions so as to gather information that may help him to discover how to obtain high rewards in the long-term (exploration) and, at the same time, he has to exploit at best its current information to obtain high short-term rewards.
  • 3. Pure exploration: The agent first interacts with its environment to collect information without paying attention to the rewards. The information is used afterwards to take actions leading to high rewards. Pure exploration with model of the environment: Typical problem met when an agent with limited computational resources computes actions based on (pieces of) simulated trajectories. 2
  • 4. Example 1: Multi-armed bandit problems Definition: A gambler (agent) has T coins, and at each step he may choose among one of K slots (or arms) to allocate one of these coins, and then earns some money (his reward) depending on the response of the machine he selected. Rewards are sampled from an unknown probability distribution dependent on the selected arm. Goal of the agent: to collect the largest cumulated reward once he has exhausted his coins. Standard solutions: Index based policies; use an index for ranking the arms and pick at each play the arm with the highest index; the index for each arm k is a small formula that takes for example as input the average rewards collected (rk ), the standard deviations of these rewards (σk ), the total number of plays t, the number of times arm k has been played (Tk ), etc. Upper Confidence Bound index as example: rk + 2 ln t Tk 3
  • 5. Example 2: Tree search Tree search: Pure exploration problem with model of the environment. The agent represents possible strategies by a tree. He cannot explore the tree exhaustively because of limited computational resources. When computational resources are exhausted, he needs to take an action. Often used with time receding horizon strategies. Example of tree exploration algorithms: A∗ , Monte Carlo Tree Search, optimistic planning for deterministic systems, etc. In all these algorithms, terminal nodes to be developed are chosen based on small formulas. 4
  • 6. Example 3: Exploration-exploitation in a MDP Markov Decision Process (MDP): At time t, the agent is in state xt ∈ X and may choose any action ut ∈ U. The environment responds by randomly moving into a new state xt+1 ∼ P(·|xt , ut ), and giving the agent the reward rt ∼ ρ(xt , ut ). An approach for interacting with MDPs when P and r unknown: a layer that learns (directly or indirectly) an approximate state-action value function Q : X × U → R atop of which comes the exploration/exploitation policy. Typical exploration-exploration policies are made of small formulas. ǫ-Greedy: ut = arg max u∈U Q(xt , u) with probability 1 − ǫ and ut at random wiht probability ǫ. Softmax policy: ut ∼ Psm(·) where Psm(u) = expQ(xt ,u)/τ P u∈U expQ(xt ,u)/τ . 5
  • 7. Pros and cons for small formulas Pros: (i) Lend themselves to a theoretical analysis (ii) No computing time required for designing a solution (iii) No a priori knowledge on the problem required. Cons: (i) Formulas published not used as such in practice and solutions often engineered to the problem at hand (typically the case for Monte-Carlo Tree Search (MCTS) techniques) (ii) Computing time often available before starting the exploration-exploitation task (iii) A priori knowledge on the problem often available and not exploited by the formulas. 6
  • 8. Learning for exploration-(exploitation) in RL 1. Define a set of training problems (TP) for exploration-(exploitation). A priori knowledge can (should) be used for defining the set TP. 2. Define a rich set of candidate exploration-(exploitation) strategies (CS) with numerical parametrizations/with formulas. 3. Define a performance criterion PC : CS × TP → R such that PC(strategy, training problem) gives the performance of strategy on training problem. 4. Solve arg max strategy∈CS training prob.∈TP PC(strategy, training prob.) using an optimisation tool. NB: approach inspired by current methods for tuning the parameters of existing formulas; differs by the fact that (i) it searches in much richer spaces of strategies (ii) the search procedure is much more systematic. 7
  • 9. Example 1: Learning for multi-armed bandit problems Multi-armed bandit problem: Let i ∈ {1, 2, . . . , K} be the K arms of the bandit problem. νi is the reward distribution of arm i and µi its expected value. T is the total number of plays. bt is the arm played by the agent at round t and rt ∼ νbt is the reward it obtains. Information: Ht = [b1, r1, b2, r2, . . . , bt , rt ] is a vector that gathers all the information the agent has collected during the first t plays. H the set of all possible histories of any length t. Policy: The agent’s policy π : H → {1, 2, . . . , K} processes at play t the information Ht−1 to select the arm bt ∈ {1, 2, . . . , K}: bt = π(Ht−1). 8
  • 10. Notion of regret: Let µ∗ = maxk µk be the expected reward of the optimal arm. The regret of π is : Rπ T = Tµ∗ − T t=1 rt . The expected regret is E[Rπ T ] = K k=1 E[Tk (T)](µ∗ − µk ) where Tk (T) is the number of times the policy has drawn arm k on the first T rounds. Objective: Find a policy π∗ that for a given K minimizes the expected regret, ideally for any T and any {νi}K i=1 (equivalent to maximizing the expected sum of rewards obtained). Index-based policy: (i) During the first K plays, play sequentially each arm (ii) For each t > K, compute for every machine k the score index(Hk t−1, t) where Hk t−1 is the history of rewards for machine k (iii) Play the arm with the largest score. 9
  • 11. 1. Define a set of training problems (TP) for exploration-exploitation. A priori knowledge can (should) be used for defining the set TP. In our simulations a training set is made of 100 bandit problems with Bernouilli distributions, two arms and the same horizon T. Every Bernouilli distribution is generated by selecting at random in [0, 1] its expectation. Three training sets generated, one for each value of the training horizon T ∈ {10, 100, 1000}. NB: In the spirit of supervised learning, the learned strategies are evaluated on test problems different from the training problems. The first three test sets are generated using the same procedure. The second three test sets are also generated in a similar way but by considering truncated Gaussian distributions in the interval [0, 1] for the rewards. The mean and the standard deviation of the Gaussian distributions are selected at random in [0, 1]. The test sets count 10,000 elements. 10
  • 12. 2. Define a rich set of candidate exploration-exploitation strategies (CS). Candidate exploration-exploitation strategies are index-based policies. Set of index-based policies contains all the policies that can be represented by: index(Hk t , t) = θ · φ(Hk t , t) where θ is the vector of parameters to be optimized and φ(·, ·) a hand-designed feature extraction function. We suggest to use as feature extraction function the one that leads the indexes: p i=0 p j=0 p k=0 p l=0 θi,j,k,l( √ ln t)i · ( 1 √ Tk )j · (rk )k · (σk )l where θi,j,k,l are parameters. p = 2 in our simulations ⇒ 81 parameters to be learned. 11
  • 13. 3. Define a performance criterion PC : CS × TP → R such that PC(strategy, training problem) gives the performance of strategy on training problem. Performance criterion of an index-based policy on a multi-armed bandit problem is chosen as the expected regret obtained by this policy. NB: Again in the spirit of supervised learning, a regularization term could be added to the above mentioned performance criterion to avoid overfitting. 12
  • 14. 4. Solve arg max strategy∈CS training prob.∈TP PC(strategy, training prob.) using an optimisation tool. Objective function has a complex relation with the parameter and may contain many local minima. Global optimisation approaches are advocated. Estimation of Distribution Algorithms (EDA) are used here as global optimization tool. EDAs rely on a probabilistic model to describe promising regions of the search space and to sample good candidate solutions. This is performed by repeating iterations that first sample a population of np candidates using the current probabilistic model and then fit a new probabilistic model given the b < np best candidates. 13
  • 15. Simulation results Policy quality on a test set measured by its average expected regret on this test set. Learned index based-policies compared with 5 other index-based policies: UCB1, UCB1-BERNOULLI, UCB2, UCB1-NORMAL, UCB-V. These 5 policies are made of small formulas which may have parameters. Two cases studied: (i) default values for the parameters or (ii) parameters optimized on a training set. 14
  • 16. Policy Training Parameters Bernoulli test problems Horizon T=10 T=100 T=1000 Policies based on small formulas. Default parameters UCB1 - C = 2 1.07 5.57 20.1 UCB1-BERNOULLI - 0.75 2.28 5.43 UCB1-NORMAL - 1.71 13.1 31.7 UCB2 - α = 10−3 0.97 3.13 7.26 UCB-V - c = 1, ζ = 1 1.45 8.59 25.5 ǫn-GREEDY - c = 1, d = 1 1.07 3.21 11.5 Learned policies T=10 . . . 0.72 2.37 15.7 T=100 (81 parameters) 0.76 1.82 5.81 T=1000 . . . 0.83 2.07 3.95 Observations: (i) UCB1-BERNOULLI is the best policy based on small formulas. (ii) The learned policy for a training horizon is always better than any policy based on a small formula if the test horizon is the same. (iii) Performances of the learned policies with respect to the formulas based policies degrade when the training horizon is not equal to the test horizon. 15
  • 17. Policy Training Parameters Bernoulli test problems Horizon T=10 T=100 T=1000 Policies based on small formulas. Optimized parameters T=10 C = 0.170 0.74 2.05 4.85 UCB1 T=100 C = 0.173 0.74 2.05 4.84 T=1000 C = 0.187 0.74 2.08 4.91 T=10 α = 0.0316 0.97 3.15 7.39 UCB2 T=100 α = 0.000749 0.97 3.12 7.26 T=1000 α = 0.00398 0.97 3.13 7.25 T=10 c = 1.542, ζ = 0.0631 0.75 2.36 5.15 UCB-V T=100 c = 1.681, ζ = 0.0347 0.75 2.28 7.07 T=1000 c = 1.304, ζ = 0.0852 0.77 2.43 5.14 T=10 c = 0.0499, d = 1.505 0.79 3.86 32.5 ǫn-GREEDY T=100 c = 1.096, d = 1.349 0.95 3.19 14.8 T=1000 c = 0.845, d = 0.738 1.23 3.48 9.93 Learned policies T=10 . . . 0.72 2.37 15.7 T=100 (81 parameters) 0.76 1.82 5.81 T=1000 . . . 0.83 2.07 3.95 Main observation: The learned policy for a training horizon is always better than any policy based on a small formula if the test horizon is the same. 16
  • 18. Policy Training Parameters Gaussian test problems Horizon T=10 T=100 T=1000 Policies based on small formulas. Optimized parameters T=10 C = 0.170 1.05 6.05 32.1 UCB1 T=100 C = 0.173 1.05 6.06 32.3 T=1000 C = 0.187 1.05 6.17 33.0 T=10 α = 0.0316 1.28 7.91 40.5 UCB2 T=100 α = 0.000749 1.33 8.14 40.4 T=1000 α = 0.00398 1.28 7.89 40.0 T=10 c = 1.542, ζ = 0.0631 1.01 5.75 26.8 UCB-V T=100 c = 1.681, ζ = 0.0347 1.01 5.30 27.4 T=1000 c = 1.304, ζ = 0.0852 1.13 5.99 27.5 T=10 c = 0.0499, d = 1.505 1.01 7.31 67.57 ǫn-GREEDY T=100 c = 1.096, d = 1.349 1.12 6.38 46.6 T=1000 c = 0.845, d = 0.738 1.32 6.28 37.7 Learned policies T=10 . . . 0.97 6.16 55.5 T=100 (81 parameters) 1.05 5.03 29.6 T=1000 . . . 1.12 5.61 27.3 Main observation: A learned policy is always better than any policy based on a small formula if the test horizon is equal to the training horizon. 17
  • 19. Example 2: Learning for tree search. The case of look-ahead trees for deterministic planning Deterministic planning: At time t, the agent is in state xt ∈ X and may choose any action ut ∈ U. The environment responds by moving into a new state xt+1 = f(xt , ut ), and giving the agent the reward rt = ρ(xt , ut ). The agent wants to maximize its return defined as: ∞ t=0 γt rt where γ ∈ [0, 1[. Functions f and r are known. Look-ahead tree: Particular type of policy that represents at every instant t the set of all possible trajectories starting from xt by a tree. A look-ahead tree policy explores this tree until the computational resources (measured here in terms of number of tree nodes developed) are exhausted. When exhausted, it outputs the first action of the branch along which the highest discounted rewards are collected. 18
  • 20. xt+2xt+2 uA uAρ(xt+1, uA ) ρ(xt , uB ) xt−2 xt−1ut−1 ut−2 ρ(xt−1, ut−1) ρ(xt−2, ut−2) uB uB uB xt uA ρ(xt , uA ) xt+1 uA uB ρ(xt+1, uB ) Scoring the terminal nodes for exploration: The tree is developed incrementally by always expending first the terminal node which has the highest score. Scoring functions are usually small formulas. (“best-first search” tree-exploration approach). Objective: Find an exploration strategy for the look-ahead tree policy that maximizes the return of the agent, whatever the initial state of the system. 19
  • 21. 1. Define a set of training problems (TP) for exploration-(exploitation). A priori knowledge can (should) be used for defining the set TP. The training problems are made of the elements of the deterministic planning problem. They only differ by the initial state. If information is available on the true initial state of the problem, it should be used for defining the set of initial states X0 used for defining the training set. If the initial state is known perfectly well, one may consider only one training problem. 20
  • 22. 2. Define a rich set of candidate exploration strategies (CS). The candidate exploration strategies all grow the tree incrementally, expending always the terminal node which has the highest score according to a scoring function. The set of scoring functions contain all the functions that take as input a tree and a terminal node and that can be represented by: θ · φ(tree, terminal node) where θ is the vector of parameters to be optimized and φ(·, ·) a hand-designed feature extraction function. For problem where X ⊂ Rm , we use as φ: (x[1], . . . , x[m], dx[1], . . ., dx[m], rx[1], . . ., rx[m]) where x is the state associated with the terminal node, d is its depth and r is the reward collected before reaching this node. 21
  • 23. 3. Define a performance criterion PC : CS × TP → R such that PC(strategy, training problem) gives the performance of strategy on training problem. The performance criterion of a scoring function is the return on the test problem of the look-ahead policy it leads to. The computational budget for the tree development when evaluating the performance criterion should be chosen equal to the computational budget available when controlling the ’real system’. 22
  • 24. 4. Solve arg max strategy∈CS training prob.∈TP PC(strategy, training prob.) using an optimisation tool. Objective function has a complex relation with the parameter and may contain many local minima. Global optimisation approaches are advocated. Estimation of Distribution Algorithms (EDA) are used here as global optimization tool. NB: Gaussian processes for global optimisation have also been tested. They are more complex optimisation tools and they have been found to be able to require significantly less evaluation of objective function to identify a near-optimal solution. 23
  • 25. Simulation results Small formulas based scoring functions used for comparison: scoremindepth = −(depth terminal node) scoregreedy1 = reward obtained before reaching terminal node scoregreedy2 = sum of disc. rew. from top to terminal node scoreoptimistic = scoregreedy2 + γdepth terminal node 1 − γ Br where Br is an upperbound on the reward function. In the results hereafter, performance of a scoring function evaluated by averaging the return of the agent over a set of initial states. Results generated for different computational budgets. The score function has always been optimized for the budget used for performance evaluation. 24
  • 26. 2-dimensional test problem Dynamics: (yt+1, vt+1) = (yt , vt ) + (vt , ut )0.1; X = R2 ; U = {−1, +1}; ρ((yt , vt ), ut ) = max(1 − y2 t+1, 0); γ = 0.9. The initial states chosen at random in [−1, 1] × [−2, 2] for building the training problems. Same states used for evaluation. Main observation: (i) Optimized trees outperform other methods (ii) Small computational budget needed for reaching near-optimal performances. 25
  • 27. HIV infection control problem Problem with six state variables, two control actions and a highly non linear dynamics. A single training ’initial state’ is used. Policies first evaluated when the system starts from this state. 26
  • 28. Main observations: (i) Optimized look ahead trees clearly outperforms other methods. (ii) Very small computational budget required for reaching near optimal performances. 27
  • 29. Policies evaluated when the system starts from a state rather far from the initial state used for training: Main observation: Optimized look ahead trees perform very well even when the ’training state’ is not the test state. 28
  • 30. A side note on these optimized look-ahead trees Optimizing look-ahead trees is a direct policy search method for which parameters to be optimized are those of the tree exploration procedure rather than those of standard function approximators (e.g, neural networks, RBFs) as it is usually the case. Require very little memory (compared to dynamic programming methods and even other policy search techniques); not that many parameters need to be optimized; robust with respect to the initial states chosen for training, etc. Right problem statement for comparing different techniques: “How to compute the best policy given an amount of off-line and on-line computational resources knowing fully the environment and not the initial state?” 29
  • 31. Complement to other techniques rather than competitor: look-ahead tree policies could be used as an “on-line correction mechanism” for policies computed off-line. (for tree exploration and selection of action ut ) uB ρ(xt+1, uB ) xt+2xt+2 uA uAρ(xt+1, uA ) ρ(xt , uB ) Other policy used for scoring the nodes uB uB uB xt uA ρ(xt , uA ) xt+1 uA 30
  • 32. Example 3: Learning for exploration-exploitation in finite MDPs Instantiation of the general learning paradigm for exploration-exploitation for RL to this case is very much similar to what has been described for multi-bandit problems. Good results already obtained by considering large set of canditate exploration-exploitation policies made of small formulas. Next step would be to experiment the approach on MDPs with continuous spaces. 31
  • 33. Automatic learning of (small) formulas for exploration-exploitation? Remind: There are pros for using small formulas ! The learning approach for exploration-exploitation in RL could help to find new (small) formulas people have not thought of before by considering as candidate strategies a large set of (small) formulas. An example of formula for index-based policies found to perform very well on bandit problems by such an approach: rk + C Tk . 32
  • 34. Conclusion and future works Learning for exploration/exploitation: excellent performances that suggest a bright future for this approach. Several directions for future research: Optimisation criterion: Approach targets an exploration-exploitation policy that gives the best average performances on a set of problems. Risk adverse criteria could also be thought of. Regularization terms could also be added to avoid overfitting. Design of customized optimisation tools: Particularly relevant when the set of candidate policies is made of formulas. Theoretical analysis: E.g., what are the properties ofthe strategies learned in generalization? 33
  • 35. Presentation based on (order of appearance) “Learning to play K-armed bandit problems”. F. Maes, L. Wehenkel and D. Ernst. In Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART 2012), Vilamoura, Algarve, Portugal, 6-8 February 2012. (8 pages). “0ptimized look-ahead tree search policies”. F. Maes, L. Wehenkel and D. Ernst. In Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL 2011), Athens, Greece, September 9-11, 2011. (12 pages). “Learning exploration/exploitation strategies for single trajectory reinforcement learning”. M. Castronovo, F. Maes, R. Fonteneau and D. Ernst. In Proceedings of the 10th European Workshop on Reinforcement Learning (EWRL 2012), Edinburgh, Scotland, June 30-July 1 2012. (8 pages). “Automatic discovery of ranking formulas for playing with multi-armed bandits”. F. Maes, L. Wehenkel and D. Ernst. In Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL 2011), Athens, Greece, September 9-11, 2011. (12 pages). “Optimized look-ahead tree policies: a bridge between look-ahead tree policies and direct policy search”. T. Jung, D. Ernst, L.Wehenkel and F. Maes. Submitted. “Learning exploration/exploitation strategies: the multi-armed bandit case” F. Maes, L. Wehenkel and D. Ernst. Submitted. 34