- The document discusses using function approximators for batch reinforcement learning problems where the only available information is a set of trajectories.
- It argues that function approximators have limitations in addressing risk-sensitive criteria, safety, optimal use of trajectories, and generating new experiments.
- An alternative approach called "rebuilding trajectories" is proposed, which does not use function approximators. It involves analyzing and recombining pieces of the original trajectories to compute policies and estimates.
Data Fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. We show how this may be accomplished in the Bayesian paradigm by constructing non-exchangeable hierarchical models with submodels for each of the several data sources. In the UQ setting, where we wish to synthesize evidence from large and slow Simulation models and possibly other data sources, it can be much more efficient to construct Gaussian Process Emulators of the Simulation models, and perform Data Fusion in the Emulators rather than the Simulators. We introduce an abstract model sitting for Fusion, and illustrate several examples from a single case study: the forecasting of hazard from Pyroclastic Density Currents (PDCs) near an active volcano.
Data Fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. We show how this may be accomplished in the Bayesian paradigm by constructing non-exchangeable hierarchical models with submodels for each of the several data sources. In the UQ setting, where we wish to synthesize evidence from large and slow Simulation models and possibly other data sources, it can be much more efficient to construct Gaussian Process Emulators of the Simulation models, and perform Data Fusion in the Emulators rather than the Simulators. We introduce an abstract model sitting for Fusion, and illustrate several examples from a single case study: the forecasting of hazard from Pyroclastic Density Currents (PDCs) near an active volcano.
Those are the slides for my Master course on Monte Carlo Statistical Methods given in conjunction with the Monte Carlo Statistical Methods book with George Casella.
Conventional tools in array signal processing have traditionally relied on the availability of a large number of samples acquired at each sensor or array element (antenna, hydrophone, microphone, etc.). Large sample size assumptions typically guarantee the consistency of estimators, detectors, classifiers and multiple other widely used signal processing procedures. However, practical scenario and array mobility conditions, together with the need for low latency and reduced scanning times, impose strong limits on the total number of observations that can be effectively processed. When the number of collected samples per sensor is small, conventional large sample asymptotic approaches are not relevant anymore. Recently, large random matrix theory tools have been proposed in order to address the small sample support problem in array signal processing. In fact, it has been shown that the most important and longstanding problems in this field can be reformulated and studied according to this asymptotic paradigm. By exploiting the latest advances in large random matrix theory and high dimensional statistics, a novel and unconventional methodology can be established, which provides an unprecedented treatment of the finite sample-per-sensor regime. In this talk, we will see that random matrix theory establishes a unifying framework for the study of array signal processing techniques under the constraint of a small number of observations per sensor, which has radically changed the way in which array processing methodologies have been traditionally established. We will show how this unconventional way of revisiting classical array processing has lead to major advances in the design and analysis of signal processing techniques for multidimensional observations.
The MAIN CONTRIBUTION is an on-line heuristic law to set the training process and to modify the NN topology based on the Levenberg-Marquardt method.
An Area Predictor Filter using nonlinear autoregressive model based on neural networks for time series forecasting is introduced.
The core of the proposal is to analyze the roughness (long or short term stochastic dependence) of time series evaluated by the Hurst parameter (H).
The proposed law adapts in real time the topology of the filter at each stage of time series, changing the number of pattern, the number of iterations and the input vector length.
The main results show a good performance of the predictor, considering in particular to time series whose H parameter has a high roughness of signal, which is evaluated by HS and HA, respectively.
These results encouraged to continue working on new adjustment algorithms for time series modeling natural phenomena.
Knowledge of cause-effect relationships is central to the field of climate science, supporting mechanistic understanding, observational sampling strategies, experimental design, model development and model prediction. While the major causal connections in our planet's climate system are already known, there is still potential for new discoveries in some areas. The purpose of this talk is to make this community familiar with a variety of available tools to discover potential cause-effect relationships from observed or simulation data. Some of these tools are already in use in climate science, others are just emerging in recent years. None of them are miracle solutions, but many can provide important pieces of information to climate scientists. An important way to use such methods is to generate cause-effect hypotheses that climate experts can then study further. In this talk we will (1) introduce key concepts important for causal analysis; (2) discuss some methods based on the concepts of Granger causality and Pearl causality; (3) point out some strengths and limitations of these approaches; and (4) illustrate such methods using a few real-world examples from climate science.
This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
For the discovery of a regression relationship between y and x, a vector of p potential predictors, the flexible nonparametric nature of BART (Bayesian Additive Regression Trees) allows for a much richer set of possibilities than restrictive parametric approaches. To exploit the potential monotonicity of the predictors, we introduce mBART, a constrained version of BART that incorporates monotonicity with a multivariate basis of monotone trees, thereby avoiding the further confines of a full parametric form. Using mBART to estimate such effects yields (i) function estimates that are smoother and more interpretable, (ii) better out-of-sample predictive performance and (iii) less post-data uncertainty. By using mBART to simultaneously estimate both the increasing and the decreasing regions of a predictor, mBART opens up a new approach to the discovery and estimation of the decomposition of a function into its monotone components.
(This is joint work with H. Chipman, R. McCulloch and T. Shively).
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Those are the slides for my Master course on Monte Carlo Statistical Methods given in conjunction with the Monte Carlo Statistical Methods book with George Casella.
Conventional tools in array signal processing have traditionally relied on the availability of a large number of samples acquired at each sensor or array element (antenna, hydrophone, microphone, etc.). Large sample size assumptions typically guarantee the consistency of estimators, detectors, classifiers and multiple other widely used signal processing procedures. However, practical scenario and array mobility conditions, together with the need for low latency and reduced scanning times, impose strong limits on the total number of observations that can be effectively processed. When the number of collected samples per sensor is small, conventional large sample asymptotic approaches are not relevant anymore. Recently, large random matrix theory tools have been proposed in order to address the small sample support problem in array signal processing. In fact, it has been shown that the most important and longstanding problems in this field can be reformulated and studied according to this asymptotic paradigm. By exploiting the latest advances in large random matrix theory and high dimensional statistics, a novel and unconventional methodology can be established, which provides an unprecedented treatment of the finite sample-per-sensor regime. In this talk, we will see that random matrix theory establishes a unifying framework for the study of array signal processing techniques under the constraint of a small number of observations per sensor, which has radically changed the way in which array processing methodologies have been traditionally established. We will show how this unconventional way of revisiting classical array processing has lead to major advances in the design and analysis of signal processing techniques for multidimensional observations.
The MAIN CONTRIBUTION is an on-line heuristic law to set the training process and to modify the NN topology based on the Levenberg-Marquardt method.
An Area Predictor Filter using nonlinear autoregressive model based on neural networks for time series forecasting is introduced.
The core of the proposal is to analyze the roughness (long or short term stochastic dependence) of time series evaluated by the Hurst parameter (H).
The proposed law adapts in real time the topology of the filter at each stage of time series, changing the number of pattern, the number of iterations and the input vector length.
The main results show a good performance of the predictor, considering in particular to time series whose H parameter has a high roughness of signal, which is evaluated by HS and HA, respectively.
These results encouraged to continue working on new adjustment algorithms for time series modeling natural phenomena.
Knowledge of cause-effect relationships is central to the field of climate science, supporting mechanistic understanding, observational sampling strategies, experimental design, model development and model prediction. While the major causal connections in our planet's climate system are already known, there is still potential for new discoveries in some areas. The purpose of this talk is to make this community familiar with a variety of available tools to discover potential cause-effect relationships from observed or simulation data. Some of these tools are already in use in climate science, others are just emerging in recent years. None of them are miracle solutions, but many can provide important pieces of information to climate scientists. An important way to use such methods is to generate cause-effect hypotheses that climate experts can then study further. In this talk we will (1) introduce key concepts important for causal analysis; (2) discuss some methods based on the concepts of Granger causality and Pearl causality; (3) point out some strengths and limitations of these approaches; and (4) illustrate such methods using a few real-world examples from climate science.
This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
For the discovery of a regression relationship between y and x, a vector of p potential predictors, the flexible nonparametric nature of BART (Bayesian Additive Regression Trees) allows for a much richer set of possibilities than restrictive parametric approaches. To exploit the potential monotonicity of the predictors, we introduce mBART, a constrained version of BART that incorporates monotonicity with a multivariate basis of monotone trees, thereby avoiding the further confines of a full parametric form. Using mBART to estimate such effects yields (i) function estimates that are smoother and more interpretable, (ii) better out-of-sample predictive performance and (iii) less post-data uncertainty. By using mBART to simultaneously estimate both the increasing and the decreasing regions of a predictor, mBART opens up a new approach to the discovery and estimation of the decomposition of a function into its monotone components.
(This is joint work with H. Chipman, R. McCulloch and T. Shively).
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
This presentation contains following topics -
What is Animation?
History?
Types of Animation
Advantages & Disadvantages
Uses
Any request for any presentation is accepted : nbhavsar506@gmail.com
Professional or Business Presentations will me charged.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
This presentation covers the various types of multimedia, the advantages and disadvantages of their use as well as how multimedia can be used in education.
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
This is a presntation of my second year intership on optimal stochastic theory and how we can apply it on some financial application then how we can solve such problems using finite differences methods!
Enjoy it !
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
This is an introductory to optimal stochastic control theory with two applications in finance: Merton portfolio problem and Investement/consumption problem with numerical results using finite differences approach
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
Talk at EURECOM, France.
It overviews regression in several of its forms: regularized, constrained, and mixed. It builds the bridge between machine learning and dynamical models.
This is a talk where Prof. Damien ERNST explains how Reinforcement Learning could be used to tackle various problems in the field of energy markets and for the energy transition.
This is the conference I gave for the kickoff meeting of AI4Energy.
You can also access a power point version of the presentation with better quality images and the video from this URL address: http://hdl.handle.net/2268/252633
In this presentation, we discuss ambitious solutions for fighting climate change and present the first results of the Katabata project. The goal of this project was to install weather stations in very windy parts of Greenland, as a first step to harvest wind energy there.
Download a power point version of this presentation with higher quality images at the following address: https://orbi.uliege.be/handle/2268/251827
In this presentation, we discuss several major engineering projects that should be put in place for fighting climate change at a cheap cost. Among others: a global electrical grid, carbon capture technologies, power-to-gas devices.
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Université de Liège (ULg)
Current global environmental challenges require vigorous and diverse actions in the energy sector. One solution that has recently attracted interest consists in harnessing high-quality variable renewable energy resources in remote locations, while using transmission links to transport the power to end users. In this context, a comparison of western European and Greenland wind regimes is proposed. By leveraging a regional atmospheric model specically designed to accurately capture polar phenomena, local climatic features of southern Greenland are identied to be particularly conducive to extensive renewable electricity generation from wind. A methodology to assess how connecting remote locations to major demand centres would benet the latter from a resource availability standpoint is introduced and applied to the aforementioned Europe-Greenland case study, showing superior and complementary wind generation potential in the considered region of Greenland with respect to selected European sites.
Ce document présente le décret sur les communautés d'énergie renouvelable qui sera bientôt en vigueur en Wallonie, la partie francophone de la Belgique. Il s'agit d'une pièce de législation avant-gardiste dans le domaine des réseaux électriques.
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Université de Liège (ULg)
This presentation explores the potential of power-to-gaz
technologies for a deep decarbonization of our economies. A case study carried out on the Belgian energy system is discussed.
This is a summary of the work I did over these last years in power systems and in reinforcement learning (and with my very latest work about neuromulation in reinforcement learning).
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Université de Liège (ULg)
Presentation given on the 3rd of December 2018 in Paris, at the event where I was awarded by the SEE the prestigious Blondel Medal for my research work. This presentation gives a brief overview of the work I have carried out over these last twenty years.
This is the material of a 4 hour class given in the framework of a EES-UETP class (Electrical Energy Systems - University Enterprise Training Partnership). The first part gives a brief overview of the applications of reinforcement learning for solving decision-making problems related to electrical systems. The second part explains how to build intelligent agents using reinforcement learning.
Electricity retailing in Europe: remarkable events (with a special focus on B...Université de Liège (ULg)
This short presentation gives a quick overview of the significant changes that have happened in the electricity retailing business in Europe, since the liberalization of the sector.
Very nice presentation done on the Belgian offshore wind potential. It has been done by a student in the "Sustainable energy class" that I am giving at the ULiège. http://blogs.ulg.ac.be/damien-ernst/genv0002-1-sustainable-energy/
Time to make a choice between a fully liberal or fully regulated model for th...Université de Liège (ULg)
Prof. Damien Ernst explaining why it is time to make a choice between a fully liberal or fully regulated model for the electrical industry. Link to a video of the conference given at the end of the presentation.
Conference on the electrification of the Democratic Republic of the Congo. Link to a video of the conference at the end of the presentation (in French).
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
Beyond function approximators for batch mode reinforcement learning: rebuilding trajectories
1. Beyond function approximators for
batch mode reinforcement learning:
rebuilding trajectories
Damien Ernst
University of Li`ege, Belgium
2010 NIPS Workshop on
“Learning and Planning from Batch Time Series Data”
2. Batch mode Reinforcement Learning ≃
Learning a high-performance policy for a sequential decision
problem where:
• a numerical criterion is used to define the performance of
a policy. (An optimal policy is the policy that maximizes
this numerical criterion.)
• “the only” (or “most of the”) information available on the
sequential decision problem is contained in a set of
trajectories.
Batch mode RL stands at the intersection of three worlds:
optimization (maximization of the numerical criterion),
system/control theory (sequential decision problem) and
machine learning (inference from a set of trajectories).
3. A typical batch mode RL problem
Discrete-time dynamics:
xt+1 = f(xt , ut , wt ) t = 0, 1, . . . , T − 1 where xt ∈ X, ut ∈ U
and wt ∈ W. wt is drawn at every time step according to Pw (·).
Reward observed after each system transition:
rt = ρ(xt , ut , wt ) where ρ : X × U × W → R is the reward
function.
Type of policies considered: h : {0, 1, . . . , T − 1} × X → U.
Performance criterion: Expected sum of the rewards
observed over the T-length horizon
PCh
(x) = Jh
(x) = E
w0,...,wT−1
[
T−1
t=0 ρ(xt , h(t, xt ), wt )] with x0 = x
and xt+1 = f(xt , h(t, xt ), wt ).
Available information: A set of elementary pieces of
trajectories Fn = {(xl
, ul
, rl
, yl
)}n
l=1 where yl
is the state
reached after taking action ul
in state xl
and rl
the
instantaneous reward associated with the transition. The
functions f, ρ and Pw are unknown. 2
4. Batch mode RL and function approximators
Training function approximators (radial basis functions, neural
nets, trees, etc) using the information contained in the set of
trajectories is a key element to most of the resolution schemes
for batch mode RL problems with state-action spaces having a
large (infinite) number of elements.
Two typical uses of FAs for batch mode RL:
• the FAs model of the sequential decision problem (in
our typical problem f, r and Pw ). The model is afterwards
exploited as if it was the real problem to compute a
high-performance policy.
• the FAs represent (state-action) value functions which
are used in iterative schemes so as to converge to a
(state-action) value function from which a
high-performance policy can be computed. Iterative
schemes based on the dynamic programming principle
(e.g., LSPI, FQI, Q-learning). 3
5. Why look beyond function approximators ?
FAs based techniques: mature, can successfully solve many
real life problems but:
1. not well adapted to risk sensitive performance criteria
2. may lead to unsafe policies - poor performance
guarantees
3. may make suboptimal use of near-optimal trajectories
4. offer little clues about how to generate new experiments in
an optimal way
4
6. 1. not well adapted to risk sensitive performance criteria
An example of risk sensitive performance criterion:
PCh
(x) =
−∞ if P(
T−1
t=0 ρ(xt , h(t, xt ), wt ) < b) > c
Jh
(x) otherwise.
FAs with dynamic programming: very problematic because
(state-action) value functions need to become functions that
take as values “probability distributions of future rewards” and
not “expected rewards”.
FAs with model learning: more likely to succeed; but what
about the challenges of fitting the FAs to model the distribution
of future states reached (rewards collected) by policies and not
only an average behavior ?
5
7. 2. may lead to unsafe policies - poor performance
guarantees
• Benchmark: • Trajectory set • Trajectory set not
puddle world covering the puddle covering the puddle
• RL algorithms: ⇒ Optimal policy ⇒ Suboptimal
FQI with trees (unsafe) policy
Typical performance guarantee in the deterministic case for
FQI = (estimated return by FQI of the policy it outputs minus
constant×(’size’ of the largest area of the state space not
covered by the sample)).
6
8. 3. may make suboptimal use of near-optimal
trajectories
Suppose a deterministic batch mode RL problem and that in
the set of trajectory, there is a trajectory:
(xopt. traj.
0 , u0, r0, x1, u1, r1, x2, . . . , xT−2, uT−2, rT−2, xT−1, uT−1, rT−1, xT )
where the ut s have been selected by an optimal policy.
Question: Which batch mode RL algorithms will output a
policy which is optimal for the initial state xopt. traj.
0 whatever the
other trajectories in the set ? Answer: Not that many and
certainly not those using parametric FAs.
In my opinion: batch mode RL algorithms can only be
successful on large-scale problems if (i) in the set of
trajectories, many trajectories have been generated by
(near-)optimal policies (ii) the algorithms exploit very well the
information contained in those (near-)optimal trajectories.
7
9. 4. offer little clues about how to generate new
experiments in an optimal way
Many real-life problems are variants of batch mode RL
problems for which (a limited number of) additional trajectories
can be generated (under various constraints) to enrich the
initial set of trajectories.
Question: How should these new trajectories be generated ?
Many approaches based on the analysis of the FAs produced
by batch mode RL methods have been proposed; results are
mixed.
8
10. Rebuilding trajectories
We conjecture that mapping the set of trajectories into FAs
generally lead to the loss of essential information for
addressing these four issues ⇒ We have developed a new line
of research for solving batch mode RL that does not use at all
FAs.
Line of research articulated around the the rebuilding of
artificial (likely “broken”) trajectories by using the set of
trajectories input of the batch mode RL problem; a rebuilt
trajectory is defined by the elementary pieces of trajectory it is
made of.
The rebuilt trajectories are analysed to compute various things:
a high-performance policy, performance guarantees, where to
sample, etc.
9
11. BLUE ARROW = elementary piece of trajectory
Set of trajectories Examples of 5-length
given as input of the rebuilt trajectories made
batch RL problem from elements of this set
10
12. Model-Free Monte Carlo Estimator
Building an oracle that estimates the performance of a policy:
important problem in batch mode RL.
Indeed, if an oracle is available, problem of estimating a
high-performance policy can be reduced to an optimization
problem over a set of candidate policies.
If a model of sequential decision problem is available, a Monte
Carlo estimator (i.e., rollouts) can be used to estimate the
performance of a policy.
We detail an approach that estimates the performance of a
policy by rebuilding trajectories so as to mimic the behavior of
the Monte Carlo estimator.
11
13. Context in which the approach is presented
Discrete-time dynamics:
xt+1 = f(xt , ut , wt ) t = 0, 1, . . . , T − 1 where xt ∈ X, ut ∈ U
and wt ∈ W. wt is drawn at every time step according to Pw (·)
Reward observed after each system transition:
rt = ρ(xt , ut , wt ) where ρ : X × U × W → R is the reward
function.
Type of policies considered: h : {0, 1, . . . , T − 1} × X → U.
Performance criterion: Expected sum of the rewards
observed over the T-length horizon
PCh
(x) = Jh
(x) = E
w0,...,wT−1
[
T−1
t=0 ρ(xt , h(t, xt ), wt )] with x0 = x
and xt+1 = f(xt , h(t, xt ), wt ).
Available information: A set of elementary pieces of
trajectories Fn = {(xl
, ul
, rl
, yl
)}n
l=1. f, ρ and Pw are unknown.
Approach aimed at estimating Jh
(x) from Fn.
12
14. Monte Carlo Estimator
Generate nbTraj T-length trajectories by simulating the system
starting from the initial state x0; for every trajectory compute the
sum of rewards collected; average these sum of rewards over
the nbTraj trajectories to get an estimate MCEh
(x0) of Jh
(x0).
trajectory 2
r1
r2
r0
r4
r3 = r(x3, h(3, x3), w3)trajectory 1
sum rew. traj. 1 = 4
i=0 ri
MCEh(x0) = 1
3
3
i=1 sum rew. traj. i
w3 ∼ Pw (·)
x4 = f(x3, h(3, x3), w3)
x3
Illustration with
and T = 5
nbTraj = 3
x0
trajectory 3
Bias MCEh
(x0) = E
nbTraj ∗T rand. var. w∼Pw (·)
[MCEh
(x0) − Jh
(x0)]= 0
Var. MCEh
(x0) = 1
nbTraj
(Var. of the sum of rewards along a traj.)
13
15. Description of Model-free Monte Carlo Estimator
(MFMC)
Principle: To rebuild nbTraj T-length trajectories using the
elements of the set Fnand to average the sum of rewards
collected along the rebuilt trajectories to get an estimate
MFMCEh
(Fn, x0) of Jh
(x0).
Trajectories rebuilding algorithm: Trajectories are
sequentially rebuilt; an elementary piece of trajectory can only
be used once; trajectories are grown in length by selecting at
every instant t = 0, 1, . . . , T − 1 the elementary piece of
trajectory (x, u, r, y) that minimizes the
distance ∆((x, u), (xend
, h(t, xend
)))
where xend
is the ending state of the already rebuilt part of the
trajectory (xend
= x0 if t = 0).
14
16. Remark: When sequentially selecting the pieces of
trajectories, no information on the value of the disturbance w
“behind” the new piece of elementary trajectory
(x, u, r = ρ(x, u, w), y = f(x, u, w)) that is going to be selected
is given if only (x, u) and the previous elementary pieces of
trajectories selected are known. Important for having a
meaningful estimator !!!
rebuilt trajectory 2
r3 r21
r9
rebuilt trajectory 1
x0
rebuilt trajectory 3
r18
Illustration with
T = 5 and
(x18, r18, u18, y18)
sum rew. re. traj. 1 = r3
+ r18
+ r21
+ r7
+ r9
MFMCEh(Fn, x0) = 1
3
3
i=1 sum rew. re. traj. i
nbTraj = 3,
r7
F24 = {(xl
, rl
, ul
, yl
)}24
l=1
15
17. Analysis of the MFMC
Random set ˜Fn defined as follows:
Made of n elementary pieces of trajectory where the first two
components of an element (xl
, ul
) are given by the first two
element of the lth element of Fn and the last two are generated
by drawing for each l a disturbance signal wl
at random from
PW (·) and taking rl
= ρ(xl
, ul
, wl
) and yl
= f(xl
, ul
, wl
).
Fn is a realization of the random set ˜Fn.
Bias and variance of MFMCE defined as:
Bias MFMCEh
( ˜Fn, x0) = E
w1,...,wn∼Pw
[MFMCEh
( ˜Fn, x0) − Jh
(x0)]
Var. MFMCEh
( ˜Fn, x0) =
E
w1,...,wn∼Pw
[(MFMCEh
( ˜Fn, x0) − E
w1,...,wn∼Pw
[MFMCEh
( ˜Fn, x0)])2
]
We provide bounds of the bias and variance of this
estimator.
16
18. Assumptions
1] The functions f, ρ and h are Lipschitz continuous:
∃Lf , Lρ, Lh ∈ R+
: ∀(x, x′
, u, u′
, w) ∈ X2
× U2
× W;
f(x, u, w) − f(x′
, u′
, w) X ≤ Lf ( x − x′
X + u − u′
U)
|ρ(x, u, w) − ρ(x′
, u′
, w)| ≤ Lρ( x − x′
X + u − u′
U)
h(t, x) − h(t, x′
) U ≤ Lh x − x′
∀t ∈ {0, 1, . . . , T − 1}.
2] The distance ∆ is chosen such that:
∆((x, u), (x′
, u′
)) = ( x − x′
X + u − u′
U).
17
19. Characterization of the bias and the variance
Theorem.
Bias MFMCEh
( ˜Fn, x0) ≤ C ∗ sparsity of Fn(nbTraj ∗ T)
Var. MFMCEh
( ˜Fn, x0) ≤
( Var. MCEh(x0) + 2C ∗ sparsity of Fn(nbTraj ∗ T))2
with C = Lρ
T−1
t=0
T−t−1
i=0 [Lf (1 + Lh)]i
and with the
sparsity of Fn(k) defined as the minimal radius r such that all
balls in X × U of radius r contain at least k state-action pairs
(xl
, ul
) of the set Fn = {(xl
, ul
, rl
, yl
)}n
l=1.
18
20. Test system
Discrete-time dynamics: xt+1 = sin(π
2 (xt + ut + wt )) with
X = [−1, 1], U = [−1
2 , 1
2 ], W = [−0.1
2 , −0.1
2 ] and Pw (·) a uniform
pdf.
Reward observed after each system transition:
rt = 1
2π e− 1
2 (x2
t +u2
t )
+ wt
Performance criterion: Expected sum of the rewards
observed over a 15-length horizon (T = 15).
We want to evaluate the performance of the policy
h(t, x) = −x
2 when x0 = −0.5.
19
21. Simulations for nbTraj = 10 and size of Fn = 100, ..., 10000.
Model-free Monte Carlo Estimator Monte Carlo Estimator
20
22. Simulations for nbTraj = 1, . . . , 100 and size of Fn = 10, 000.
Model-free Monte Carlo Estimator Monte Carlo Estimator
21
23. Remember what was said about RL + FAs:
1. not well adapted to risk sensitive performance criteria
Suppose the risk sensitive performance criterion:
PCh
(x) =
−∞ if P(Jh
(x) < b) > c
Jh
(x) otherwise
where Jh
(x) = E[
T−1
t=0 ρ(xt , h(t, xt ), wt )].
MFMCE adapted to this performance criterion:
Rebuilt nbTraj starting from x0 using the set Fn as done with the
MFMCE estimator. Let sum rew traj i be the sum of rewards
collected along the ith trajectory. Output as estimation of
PCh
(x0) :
−∞ if
nbTraj
i=1
I{sum rew traj i<b}
nbTraj
> c
nbTraj
i=1
sum rew traj i
nbTraj
otherwise.
22
24. MFMCE in the deterministic case
We consider from now on that: xt+1 = f(xt , ut ) and
rt = ρ(xt , ut ).
One single trajectory is sufficient to compute exactly Jh
(x0) by
Monte Carlo estimation.
Theorem. Let [(xlt
, ult
, rlt
, ylt
)]T−1
t=0 be the trajectory rebuilt by
the MFMCE when using the distance measure
∆((x, u), (x′
, u′
)) = x − x′
+ u − u′
. If f, ρ and h are
Lipschitz continuous, we have
|MFMCEh
(x0) − Jh
(x0)| ≤
T−1
t=0
LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
where yl−1
= x0 and LQN
= Lρ( N−1
t=0 [Lf (1 + Lh)]t
).
23
25. Previous theorem extends to whatever rebuilt trajectory:
Theorem. Let [(xlt
, ult
, rlt
, ylt
)]T−1
t=0 be any rebuilt trajectory. If f,
ρ and h are Lipschitz continuous, we have
|
T−1
t=0
rlt
− Jh
(x0)| ≤
T−1
t=0
LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
where ∆((x, u), (x′
, u′
)) = x − x′
+ u − u′
, yl−1
= x0 and
LQN
= Lρ(
N−1
t=0 [Lf (1 + Lh)]t
).
r2 = r(x2, h(2, x2))
r3
trajectory generated by policy h
∆2 = LQ3
( yl1 − xl2 + h(2, yl1 ) − ul2 )
rl4
rl1
xl2 , ul2
| 4
t=0 rt − 4
t=0 rlt | ≤ 4
t=0 ∆t
r1
r0x0
r5
rl0
rl2
rl3yl1
24
26. Computing a lower bound on a policy
From previous theorem, we have for any rebuilt trajectory
[(xlt
, ult
, rlt
, ylt
)]T−1
t=0 :
Jh
(x0) ≥
T−1
t=0
rlt
−
T−1
t=0
LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
This suggests to find the rebuilt trajectory that maximizes the
right-hand side of the inequality to compute a tight lower bound
on h. Let:
lower bound(h, x0, Fn)
max
[(xlt ,ult ,rlt ,ylt )]T−1
t=0
T−1
t=0 rlt
−
T−1
t=0 LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
25
27. A tight upper bound on Jh
(x) can be defined and computed in
a similar way:
upper bound(h, x0, Fn)
min
[(xlt ,ult ,rlt ,ylt )]T−1
t=0
T−1
t=0 rlt
+ T−1
t=0 LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
Why are these bounds tight ? Because:
∃C ∈ R+
: Jh
(x) − lower bound(h, x, Fn) ≤ C ∗ sparsity of Fn(1)
upper bound(h, x, Fn) − Jh
(x) ≤ C ∗ sparsity of Fn(1)
Functions lower bound(h, x0, Fn) and higher bound(h, x0, Fn)
can be implemented in a “smart way” by seeing the problem as
a problem of finding the shortest path in a graph. Complexity
linear with T and quadratic with |Fn|.
26
28. Remember what was said about RL + FAs:
2. may lead to unsafe policies - poor performance
guarantees
Let H be a set of candidate high-performance policies. To
obtain a policy with good performance guarantees, we suggest
to solve the following problem:
h ∈ arg max
h∈H
lower bound(h, x0, Fn)
If H is the set of open-loop policies, solving the above
optimization problem can be seen as identifying an “optimal”
rebuilt trajectory and outputting as open-loop policy the
sequence of actions taken along this rebuilt trajectory.
27
29. • Trajectory set covering the puddle:
FQI with trees h ∈ arg max
h∈H
lower bound(h, x0, Fn)
• Trajectory set not covering the puddle:
FQI with trees h ∈ arg max
h∈H
lower bound(h, x0, Fn)
28
30. Remember what was said about RL + FAs:
3. may make suboptimal use of near-optimal
trajectories
Suppose a deterministic batch mode RL problem and that in
Fn you have the elements of the trajectory:
(xopt. traj.
0 , u0, r0, x1, u1, r1, x2, . . . , xT−2, uT−2, rT−2, xT−1, uT−1, rT−1, xT )
where the ut s have been selected by an optimal policy.
Let H be the set of open-loop policies. Then, the sequence of
actions h ∈ arg max
h∈H
lower bound(h, xopt. traj.
0 , Fn) is an optimal
one whatever the other trajectories in the set.
Actually, the sequence of action h outputted by this algorithm
tends to be an append of subsequences of actions
belonging to optimal trajectories.
29
31. Remember what was said about RL + FAs:
4. offer little clues about how to generate new
experiments in an optimal way
The functions lower bound(h, x0, Fn) and
upper bound(h, x0, Fn) can be exploited for generating new
trajectories.
For example, suppose that you can sample the state-action
space several times so as to generate m new elementary
pieces of trajectories to enrich your initial set Fn. We have
proposed a technique to determine m “interesting” sampling
locations based on these bounds.
This technique - which is still preliminary - targets sampling
locations that lead to the largest bound width decrease for
candidate optimal policies.
30
32. Closure
Rebuilding trajectories: interesting concept for solving many
problems related to batch mode RL.
Actually, the solution outputted by many RL algorithms (e.g.,
model-learning with kNN, fitted Q iteration with trees) can be
characterized by a set of “rebuilt trajectories”.
⇒ I suspect that this concept of rebuilt trajectories could lead
to a general paradigm for analyzing and designing RL
algorithms.
31
33. Presentation based on (in order of appearance):
“Model-free Monte Carlo-like policy evaluation”. R. Fonteneau, S.A. Murphy, L.
Wehenkel and D. Ernst. In Proceedings of The Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR
W&CP Volume 9, pages 217-224, Chia Laguna, Sardinia, Italy, May 2010.
“Inferring bounds on the performance of a control policy from a sample of
trajectories”. R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In
Proceedings of the IEEE International Symposium on Adaptive Dynamic
Programming and Reinforcement Learning (ADPRL-09), pages 117-123.
Nashville, United States, March 30 - April 2, 2009.
“A cautious approach to generalization in reinforcement learning”. R.
Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of the 2nd
International Conference on Agents and Artificial Intelligence (ICAART 2010),
Valencia, Spain, January 2010. (10 pages).
“Generating informative trajectories by using bounds on the return of control
policies”. R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In
Proceedings of the Workshop on Active Learning and Experimental Design
2010 (in conjunction with AISTATS 2010), Italy, May 2010. (2 pages).
32