"The Metropolis adjusted Langevin Algorithm
for log-concave probability measures in high
dimensions", talk by Andreas Elberle at the BigMC seminar, 9th June 2011, Paris
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 5.
More info at http://summerschool.ssa.org.ua
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
Β
In biochemically reactive systems with small copy numbers of one or more reactant molecules, the dynamics are dominated by stochastic effects. To approximate those systems, discrete state-space and stochastic simulation approaches have been shown to be more relevant than continuous state-space and deterministic ones. These stochastic models constitute the theory of Stochastic Reaction Networks (SRNs). In systems characterized by having simultaneously fast and slow timescales, existing discrete space-state stochastic path simulation methods, such as the stochastic simulation algorithm (SSA) and the explicit tau-leap (explicit-TL) method, can be very slow. In this talk, we propose a novel implicit scheme, split-step implicit tau-leap (SSI-TL), to improve numerical stability and provide efficient simulation algorithms for those systems. Furthermore, to estimate statistical quantities related to SRNs, we propose a novel hybrid Multilevel Monte Carlo (MLMC) estimator in the spirit of the work by Anderson and Higham (SIAM Multiscal Model. Simul. 10(1), 2012). This estimator uses the SSI-TL scheme at levels where the explicit-TL method is not applicable due to numerical stability issues, and then, starting from a certain interface level, it switches to the explicit scheme. We present numerical examples that illustrate the achieved gains of our proposed approach in this context.
"The Metropolis adjusted Langevin Algorithm
for log-concave probability measures in high
dimensions", talk by Andreas Elberle at the BigMC seminar, 9th June 2011, Paris
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 5.
More info at http://summerschool.ssa.org.ua
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
Β
In biochemically reactive systems with small copy numbers of one or more reactant molecules, the dynamics are dominated by stochastic effects. To approximate those systems, discrete state-space and stochastic simulation approaches have been shown to be more relevant than continuous state-space and deterministic ones. These stochastic models constitute the theory of Stochastic Reaction Networks (SRNs). In systems characterized by having simultaneously fast and slow timescales, existing discrete space-state stochastic path simulation methods, such as the stochastic simulation algorithm (SSA) and the explicit tau-leap (explicit-TL) method, can be very slow. In this talk, we propose a novel implicit scheme, split-step implicit tau-leap (SSI-TL), to improve numerical stability and provide efficient simulation algorithms for those systems. Furthermore, to estimate statistical quantities related to SRNs, we propose a novel hybrid Multilevel Monte Carlo (MLMC) estimator in the spirit of the work by Anderson and Higham (SIAM Multiscal Model. Simul. 10(1), 2012). This estimator uses the SSI-TL scheme at levels where the explicit-TL method is not applicable due to numerical stability issues, and then, starting from a certain interface level, it switches to the explicit scheme. We present numerical examples that illustrate the achieved gains of our proposed approach in this context.
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
Β
ICML 2021 tutorial on random matrix theory and machine learning. Part 1 covers: 1. A brief history of Random Matrix Theory, 2. Classical Random Matrix Ensembles (basic building blocks)
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Β
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the populationβbased EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and levelβbased analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
Basic concepts and how to measure price volatility
Presented by Carlos Martins-Filho at the AGRODEP Workshop on Analytical Tools for Food Prices
and Price Volatility
June 6-7, 2011 β’ Dakar, Senegal
For more information on the workshop or to see the latest version of this presentation visit: http://www.agrodep.org/first-annual-workshop
i give some indeas on how to use asymptotic series and expansion to prove Riemann Hypothesis, solve integral equations and even define a regularized integral of powers
This presentation describes my experience with nRF24L01, Arduino, Bus Pirate and various other hardware toys when somebody who does software gets into contact with "real stuff".
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
Β
ICML 2021 tutorial on random matrix theory and machine learning. Part 1 covers: 1. A brief history of Random Matrix Theory, 2. Classical Random Matrix Ensembles (basic building blocks)
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Β
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the populationβbased EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and levelβbased analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
Basic concepts and how to measure price volatility
Presented by Carlos Martins-Filho at the AGRODEP Workshop on Analytical Tools for Food Prices
and Price Volatility
June 6-7, 2011 β’ Dakar, Senegal
For more information on the workshop or to see the latest version of this presentation visit: http://www.agrodep.org/first-annual-workshop
i give some indeas on how to use asymptotic series and expansion to prove Riemann Hypothesis, solve integral equations and even define a regularized integral of powers
This presentation describes my experience with nRF24L01, Arduino, Bus Pirate and various other hardware toys when somebody who does software gets into contact with "real stuff".
A Glimpse into Developing Software-Defined Radio by PythonAlbert Huang
Β
Software-defined radio~(SDR) has been emerging for many years in
various fields, including military, commercial communication
systems, and scientific research, e.g. space exploration. GNU Radio
is an open source SDR framework written in Python. This talk will introduce from basic concept of software-defined radio and various
front-end hardware, and then illustrate how to use Python to develop
SDR.
Presentation on Financial Crimes. Money is one of the most important reasons behind all forms of crime whether Cyber or Internet crimes, Physical or Theft crimes. With the advancement of technology the crime has not decelerated but only esteemed and many more new techniques were by people and they were popularly called as Blackhat hackers. In this presentations we give an over view of the whole scenario.
Overview and analysis on the economics model of crime by Becker (1968). Including a case study on the Three Strikes Law in California, USA, using differences in differences methodology
Radio related project ideas using a Raspberry Pi. Covers use of RTL SDR USB, WSPR using WsprryPi, and Packet Radio using Direwolf, ax25 and PiLinBPQ (BPR32)
This presentation by Prof. John M. Connor from Purdue University, West Lafayette, US was made during the discussion on "Sanctions in Anti-trust cases" held at the 15th Global Forum on Competition on 2 December 2016. More papers and presentations on the topic can be found out at www.oecd.org/competition/globalforum/competition-and-sanctions-in-antitrust-cases.htm
Capital Asset Pricing Model (CAPM)
A model that describes the relationship between risk and expected return. The general idea behind CAPM is that investors need to be compensated in two ways: time value of money & risk. The time value of money is represented by the risk-free (rf) rate in the formula and compensates the investors for placing money in any investment over a period of time. The other half of the formula represents risk and calculates the amount of compensation the investor needs for taking on additional risk. This is calculated by taking a risk gauge (beta) that compares the returns of the asset to the market over a period of time and to the market premium (Rm-rf).
This presentation by Prof. Hwang LEE from the Korean University School of Law was made during the discussion on "Sanctions in Anti-trust cases" held at the 15th Global Forum on Competition on 2 December 2016. More papers and presentations on the topic can be found out at www.oecd.org/competition/globalforum/competition-and-sanctions-in-antitrust-cases.htm
The slides comprehends a firm understanding of the formation and functioning of British Economy
Highlights:
Foundation of British Economy
Nature of The Economy
Britainβs Current Economic Scenario Β‘ο‘ London Stock Exchange
London vs. Economy
Role of The Government
Involvement in International Trade
Forecast on British Economy
Reinforcement learning: hidden theory, and new super-fast algorithms
Lecture presented at the Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering,
February 21, 2018
Stochastic Approximation algorithms are used to approximate solutions to fixed point equations that involve expectations of functions with respect to possibly unknown distributions. The most famous examples today are TD- and Q-learning algorithms. The first half of this lecture will provide an overview of stochastic approximation, with a focus on optimizing the rate of convergence. A new approach to optimize the rate of convergence leads to the new Zap Q-learning algorithm. Analysis suggests that its transient behavior is a close match to a deterministic Newton-Raphson implementation, and numerical experiments confirm super fast convergence.
Based on
@article{devmey17a,
Title = {Fastest Convergence for {Q-learning}},
Author = {Devraj, Adithya M. and Meyn, Sean P.},
Journal = {NIPS 2017 and ArXiv e-prints},
Year = 2017}
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPer Kristian Lehre
Β
We demonstrate how to estimate the expected optimisation time of UMDA, an estimation of distribution algorithm, using the level-based theorem. The talk was given at the GECCO 2015 conference in Madrid, Spain.
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
Β
We describe how to estimate the optimisation time of the UMDA, an estimation of distribution algorithm, using the level-based theorem. The paper was presented at GECCO 2015 in Madrid.
I am Ben R. I am a Statistics Assignment Expert at statisticshomeworkhelper.com. I hold a Ph.D. in Statistics, from University of Denver, USA. I have been helping students with their homework for the past 5 years. I solve assignments related to Statistics.
Visit statisticshomeworkhelper.com or email info@statisticshomeworkhelper.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignment.
Mean Absolute Percentage Error for regression models, presentation of the paper published in Neurocomputing, 2016.
http://www.sciencedirect.com/science/article/pii/S0925231216003325
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESTahia ZERIZER
Β
In this article we study a general model of nonlinear difference equations including small parameters of multiple scales. For two kinds of perturbations, we describe algorithmic methods giving asymptotic solutions to boundary value problems.
The problem of existence and uniqueness of the solution is also addressed.
This talk considers parameter estimation in the two-component symmetric Gaussian mixtures in $d$ dimensions with $n$ independent samples. We show that, even in the absence of any separation between components, with high probability, theEMalgorithm converges to an estimate in at most $O(\sqrt{n} \log n)$ iterations, which is within $O((d/n)^{1/4} (\log n)^{3/4})$ in Euclidean distance to the true parameter, provided that $n=\Omega(d \log^2 d)$. This is within a logarithmic factor to the minimax optimal rate of $(d/n)^{1/4}$. The proof relies on establishing (a) a non-linear contraction behavior of the populationEMmapping (b) concentration of theEMtrajectory near the population version, to prove that random initialization works. This is in contrast to previous analysis in Daskalakis, Tzamos, and Zampetakis (2017) that requires sample splitting and restart theEMiteration after normalization, and Balakrishnan, Wainwright, and Yu (2017) that requires strong conditions on both the separation of the components and the quality of the initialization. Furthermore, we obtain the asymptotic efficient estimation when the signal is stronger than the minimax rate.
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automatainfopapers
Β
Dana Simian, Florin Stoica, A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata, Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION, 29-31 May 2010, Catania, Italy, ISSN 1790-5117, ISBN 978-954-92600-5-2, pp. 450-454
Financial Trading as a Game: A Deep Reinforcement Learning Approachθ¬η ι»
Β
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used Ξ΅-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
Β
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. Whatβs changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Β
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But thereβs more:
In a second workflow supporting the same use case, youβll see:
Your campaign sent to target colleagues for approval
If the βApproveβ button is clicked, a Jira/Zendesk ticket is created for the marketing design team
Butβif the βRejectβ button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Β
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Β
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
Β
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties β USA
Expansion of bot farms β how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks β Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Β
100 things I know
1. 100 things I know.
Part I of III
Reinaldo Uribe M
Mar. 4, 2012
2. SMDP Problem Description.
1. In a Markov Decision Process, a (learning) agent is embedded
in an envionment and takes actions that aο¬ect that environment.
States: s β S.
Actions: a β As ; A = sβS As .
(Stationary) system dynamics:
transition from s to s after taking a, with probability
a
Pss = p(s |s, a)
Rewards: Ra . Def. r(s, a) = E Ra | s, a
ss ss
At time t, the agent is in state st , takes action at , transitions to
state st+1 and observes reinforcement rt+1 with expectation
r(st , at ).
3. SMDP Problem Description.
2. Policies, value and optimal policies.
An element Ο of the policy space Ξ indicates what action,
Ο(s), to take at each state.
The value of a policy from a given state, v Ο (s) is the expected
cumulative reward received starting in s and following Ο:
β
v Ο (s) = E Ξ³ t r(st , Ο(st )) | s0 = s, Ο
t=0
0 < Ξ³ β€ 1 is a discount factor.
An optimal policy, Ο β has maximum value at every state:
Ο β (s) β argmax v Ο (s) βs
ΟβΞ
Οβ
v β (s) = v (s) β₯ v Ο (s) βΟ β Ξ
4. SMDP Problem Description.
3. Discount
Makes inο¬nite-horizon value bounded if rewards are bounded.
Ostensibly makes rewards received sooner more desirable than
those received later.
But, exponential terms make analysis awkward and harder...
... and Ξ³ has unexpected, undesirable eο¬ects, as shown in Uribe
et al. 2011
Therefore, hereon Ξ³ = 1.
See section Discount, at the end, for discussion.
5. SMDP Problem Description.
4. Average reward models.
A more natural long term measure of optimality exists
for such cyclical tasks, based on maximizing the average
reward per action. Mahadevan 1996
nβ1
1
ΟΟ (s) = lim E r(st , Ο(st )) | s0 = 0, Ο
nββ n
t=0
Optimal policy:
Οβ (s) β₯ ΟΟ (s) βs, Ο β Ξ
Remark: All actions equally costly.
6. SMDP Problem Description
5. Semi-Markov Decision Process: usual approach, transition
times.
Agent is in state st and takes action Ο(st ) at decision epoch t.
After an average of Nt units of time, the sistem evolves to
state st+1 and the agent observes rt+1 with expectation
r(st , Ο(st )).
In general, Nt (st , at , st+1 ).
Gain (of a policy at a state):
nβ1
Ο
E t=0 r(st , Ο(st )) | s0 = s, Ο
Ο (s) = lim
nββ nβ1
E t=0 Nt | s0 = s, Ο
Optimizing gain still maximizes average reward per action, but
actions are no longer equally weighted. (Unless all Nt = 1)
7. SMDP Problem Description
6.a Semi-Markov Decision Process: explicit action costs.
Taking an action takes time, costs money, or consumes
energy. (Or any combination thereof)
Either way, real valued cost kt+1 not necessarily related to
process rewards.
Cost can depend on a, s and (less common in practice) s .
Generally, actions have positive cost. We simply require all
policies to have positive expected cost.
Wlog the magnitude of the smallest nonzero average action
cost is forced to be unity:
|k(a, s)| β₯ 1 βk(a, s) = 0
8. SMDP Problem Description
6.b Semi-Markov Decision Process: explicit action costs.
Cost of a policy from a state:
nβ1
cΟ (s) = lim E k(st , Ο(st )) | s0 = s, Ο
nββ
t=0
So cΟ (s) > 0 βΟ β Ξ , s.
Nt = k(st , Ο(st )). Only their deο¬nition/interpretation
changes.
Gain
v Ο (s)/n
ΟΟ (s) =
cΟ (s)/n
9. SMDP Problem Description
7. Optimality of Ο β :
Ο β β Ξ with gain
nβ1
E t=0 r(st , Ο(st )) | s0 = s, Ο β β
v Ο (s)
Οβ β
Ο (s) = Ο (s) = lim = Οβ
nββ nβ1
= s, Ο β c (s)
E t=0 k(st , Ο(st )) | s0
is optimal if
Οβ (s) β₯ ΟΟ (s) βs, Ο β Ξ ,
as it was in ARRL.
Notice that the optimal policy doesnβt necessarily maximize v Ο or
minimize cΟ . Only optimizes their ratio.
10. SMDP Problem Description
8. Policies in ARRL and SMDPs are evaluated using the
average-adjusted sum of rewards:
nβ1
H Ο (s) = lim E (r(st , Ο(st )) β ΟΟ (s)) | s0 = s, Ο
nββ
t=0
Puterman 1994, Abounadi et al. 2001, Ghavamzadeh & Mahadevan 2007
This signals the existence of bias optimal policies that, while
gain optimal, also maximize the transitory rewards received
before entering recurrence.
We are interested in gain-optimal policies only.
(It is hard enough...)
11. SMDP Problem Description
9. The Unichain Property
A process is unichain if every policy has a single, unique
recurrent class.
I.e. if for every policy, all recurrent states communicate
between them.
All methods rely on the unichain property. (Because, if it
holds:)
ΟΟ (s) is constant for all s.
ΟΟ (s) = ΟΟ
Gain and value expressions simplify. (See next)
However, deciding if a problem is unichain is NP-Hard.
Tsitsiklis 2003
12. SMDP Problem Description
10. Unichain property under recurrent states. Feinberg & Yang, 2010
A state is recurrent if it belongs to a recurrent class of every
policy.
A recurrent state can be found, or proven not to exist, in
polynomial time.
If a recurrent state exists, determining whether the unichain
property holds can be done in polynomial time.
(We are not going to actually do itβit requires knowledge of
the system dynamicsβbut good to know!)
Recurrent states seem useful. In fact, existence of a recurrent
state is more critical to our purposes that the unichain
property.
Both will be required in principle for our methods/analysis,
until their necessity is furher qualiο¬ed in section Unichain
Considerations below.
14. Generic Learning Algorithm
11. The relevant expressions under our assumptions simplify, losing
dependence on s0
The following Bellman equation holds for average-adjusted state
value:
H Ο (s) = r(s, Ο(s)) β k(s, Ο(s))(ΟΟ ) + EΟ H Ο (s ) (1)
Ghavamzadeh & Mahadevan 2007
Reinforcement Learning methods exploit Eq. (1), running the
process and substituting:
State for state-action pair value.
Expected for obseved reward and cost.
ΟΟ for an estimate.
H Ο (s ) for its current estimate.
15. Generic Learning Algorithm
12.
Algorithm 1 Generic SMDP solver
Initialize
repeat forever
Act
Do RL to ο¬nd value of current Ο Usually 1-step Q-learning
Update Ο.
16. Generic Learning Algorithm
13.
Model-based state value update:
H t+1 (st ) β max r(st , a) + Ea H t (st+1 )
a
Ea emphasizes that expected value of next state depends on
action chosen/taken.
Model free state-action pair value update:
Qt+1 (st , at ) β (1 β Ξ³t ) Qt (st , at )+
Ξ³t rt+1 β Οt ct+1 + max Qt (st+1 , a)
a
In ARRL, ct = 1 βt
17. Generic Learning Algorithm
14.a Table of algorithms. ARRL
Algorithm Gain update
t
AAC r(si , Ο i (si ))
i=0
Jalali and Ferguson 1989
Οt+1 β
t+1
t+1
RβLearning Ο β (1 β Ξ±)Οt +
Schwartz 1993 Ξ± rt+1 + max Qt (st+1 , a) β max Qt (st , a)
a a
HβLearning Οt+1 β (1βΞ±t )Οt +Ξ±t rt+1 β H t (st ) + H t (st+1 )
Ξ±t
Tadepalli and Ok 1998 Ξ±t+1 β
Ξ±t + 1
SSP Q-Learning Οt+1 β Οt + Ξ±t min Qt (Λ, a)
s
Abounadi et al. 2001 a
t
HAR r(si , Ο i (si ))
i=0
Ghavamzadeh and Mahadevan 2007
Οt+1 β
t+1
18. Generic Learning Algorithm
14.b Table of algorithms. SMDPRL
Algorithm Gain update
SMART t
Das et al. 1999
r(si , Ο i (si ))
i=0
Οt+1 β t
MAX-Q c(si , Ο i (si ))
Ghavamzadeh and Mahadevan 2001 i=0
19. SSP Q-Learning
15. Stochastic Shortest Path Q-Learning
Most interesting. ARRL
If unichain and exists s recurrent (Assumption 2.1 ):
Λ
SSP Q-learning is based on the observation that
the average cost under any stationary policy is
simply the ratio of expected total cost and expected
time between two successive visits to the reference
state [Λ]
s
Thus, they propose (after Bertsekas 1998) making the process
episodic, splitting s into the (unique) initial and terminal
Λ
states.
If the Assumption holds, termination has probability 1.
Only the value/cost of the initial state are important.
Optimal solution βcan be shown to happenβ when H(Λ) = 0.
s
(See next section)
20. SSP Q-Learning
16. SSPQ Ο update.
Οt+1 β Οt + Ξ±t min Qt (Λ, a),
s
a
where
2
Ξ±t β β; Ξ±t < β.
t t
But it is hard to prove boundedness of {Οt }, so suggested instead
Οt+1 β Ξ Οt + Ξ±t min Qt (Λ, a) ,
s
a
with Ξ(Β·) a projection to [βK, K] and Οβ β (βK, K).
21. A Critique
17. Complexity.
Unknown.
While RL is PAC.
18. Convergence.
Not always guaranteed (ex. R-Learning).
When proven, asymptotic:
convergence to the optimal policy/value if all state-action
pairs are visited inο¬nite times.
Usually proven depending on decaying learning rates, which
make learning even slower.
22. A Critique
19. Convergence of Ο updates.
... while the second βslowβ iteration gradually guides
[Οt ] to the desired value.
Abounadi et al. 2001
It is the slow one!
Must be so for suο¬cient approximation of current policy value
for improvement.
Initially biased towards (likely poor) observed returns at the
start.
A long time must probably pass following the optimal policy
for Ο to converge to actual value.
23. Our method
20.
Favours an understanding of the βΟ term, either alone in
ARRL or as a factor of costs in SMDPs, not so much as an
approximation to average rewards but as a punishment for
taking actions, which must be made βworth itβ by the rewards.
I.e. nudging.
Exploits the splitting of SSP Q-Learning, in order to focus on
the value/cost of a single state, s.
Λ
Thus, also assumes the existence of a recurrent state, and
that the unichain policy holds. (For the time being)
Attempts to ensure an accelerated convergence of Ο updates.
In a context in which certain, eο¬cient convergence can be
easily introduced.
25. Fractional programming
21. So, βBertsekas splittingβ of s into initial sI and terminal sT .
Λ
Then, from sI
Any policy Ο β Ξ has an expected return until termination
v Ο (sI ),
and an expected cost until termination cΟ (sI ).
v Ο (sI )
The ARRL problem, then, becomes max Ο
ΟβΞ c (sI )
Lemma
v Ο (sI )
argmax = argmax v Ο (s) + Οβ (βcΟ (s))
ΟβΞ cΟ (sI ) ΟβΞ
For Οβ such that max v Ο (s) + Ο(βcΟ (s)) = 0
ΟβΞ
26. Fractional programming
22. Implications.
Assume the gain, Οβ is known.
Then, the nonlinear SMDP problem reduces to RL.
Which is better studied, well understood, simpler, and for
which sophisticated, eο¬cient algorithms exist.
If we only use (r β Οβ c)(s, a, s ).
Problem: Οβ is usually not known.
27. Nudging
23. Idea:
Separate reinforcement learning (leave it to the pros) from
updating Ο.
Thus, value-learning becomes method-free.
We can use any old RL method.
Gain update is actually the most critical step.
Punish too little, and the agent will not care about hurrying,
only collecting reward.
Punish too much, and the agent will only care about ο¬nishing
already.
In that sense, (r β Οc) is like picking fruit inside a maze.
28. Nudging
24. The problem reduces to a sequence of RL problems.
For a sequence of (temporarily ο¬xed) Οk
Some of the methods already provide an indication of the sign
of Ο updates.
We just donβt hurry to update Ο after taking a single action.
Plus the method comes armed with a termination condition:
As soon as H k (sI ) = 0 then Ο k = Ο β .
29. Nudging
25.
Algorithm 2 Nudged SSP Learning
Initialize
repeat
Set reward scheme to (r β cΟ)
Solve by any RL method.
Update Ο From current H Ο (sI )
until H Ο (sI ) = 0
30. w β l space
26. D
We will propose a method for updating Ο and show that it
minimizes uncertainty between steps. For that, we will use a
transformation that extends the work of our CIG paper. But First.
Let D be a bound on the magnitude of unnudged reward
D β₯ lim sup{H Ο (sI ) | Ο = 0}
ΟβΞ
D β€ lim inf {H Ο (sI ) | Ο = 0}
ΟβΞ
Observe interval (βD, D) bounds Οβ but the upper bound is tight
only in ARRL if all of D reward is received in a single step from sI .
31. w β l space
27. All policies Ο β Ξ , from (that is, at) sI have:
real expected value |v Ο (sI )| β€ D.
positive cost cΟ (sI ) β₯ 1
28.a w β l transformation:
D+v Ο (sI ) Dβv Ο (sI )
w= 2cΟ (sI ) l= 2cΟ (sI )
33. w β l space
29. Properties:
w, l β₯ 0
w, l β€ D
D
w+l = β€D
cΟ (s I)
v Ο (sI ) = D β l=0
v Ο (sI ) = βD β w=0
lim (w, l) = (0, 0)
cΟ (sI )ββ
30. Inverse transformation:
Ο Ο
v Ο (sI ) = D wΟ βlΟ
w +l cΟ (sI ) = D wΟ1 Ο
+l
35. w β l space
31. Value.
w Ο β lΟ
v Ο (sI ) = D
D w Ο + lΟ
Level sets are lines.
wβaxis, expected D.
lβaxis, expected βD.
w = l, expected 0.
5D
βD
β0.
l Optimization β fanning
from l = 0.
0
Not convex, but splits the
0.5D
space.
So optimizers are vertices of
D
w D convex hull of policies.
36. w β l space
32. Cost.
D
1
cΟ (sI ) = D
wΟ+ lΟ
Level sets are lnes with slope
β1.
w + l = D, expected cost 1.
l
Cost decreases with distance
1
to the origin.
Cost optimizers (both max
2
and min) also vertices.
4
8
w D
37. w β l space
33. The origin.
Policies of inο¬nite expected cost.
Mean the problem is not unichain or sI is not recurrent.
And are troublesome for optimizing value.
So under our assumptions, the origin does not belong to the
space
39. Nudged value in the w β l space
35. Nudged value, for some Ο.
argmax v Ο (sI ) β ΟcΟ (sI )
ΟβΞ
w Ο β lΟ 1
= argmax D Ο + lΟ
β ΟD Ο
ΟβΞ w w + lΟ
w Ο β lΟ β Ο
= argmax D
ΟβΞ w Ο + lΟ
40. Nudged value in the w β l space
36. Nudged value level sets
Λ
(For a set Ο and all policies Ο with a given h)
Λ
Λ
Dβh Ο D
lΟ =
Λ
wΛ β Ο
Λ
D+h D+hΛ
Lines!
Λ
Slope depends only on h (i.e., not on Ο)
41. Nudged value in the w β l space
37. Pencil of lines
Λ Λ
For a set Ο, any two h and h level set lines have intersection:
Ο Ο
,β
2 2
Pencil of lines with that vertex.
42. Nudged value in the w β l space
D 38. Zero nudged value.
Dβ0 Ο D
lΟ =
Λ
wΛ β Ο
D+0 D+0
l lΟ = w Ο β Ο
Λ Λ
βD
Unity slope.
0 Negative values above, positive
below.
w D
D

Ο


Οο£·
ο£Ά
ο£·
ο£·
If whole cloud above w = l, some
 ,β ο£·
 
ο£
2 2
ο£Έ
negative nudging is the optimizer.
(Encouragement)
44. Nudged value in the w β l space
40. Initial bounds on Οβ .
βD β€ Οβ β€ D
(Duh! but nice geometry)
45. Enclosing triangle
42. Nomenclature
D
41. Deο¬nition.
ABC such that:
ABC β w β l space.
(wβ , lβ ) β ABC. l B
β
Slope of AB segment, unity. mΞ²
mΞ±
wA β€ wB 1
β C
wA β€ wC Aβ mΞ³
w D
46. Enclosing Triangle
43. (New) bounds on Ο.
Def. Slope mΞΆ projection of point X(wX , lX ) to w = βl line.
mΞΆ wX β lX
XΞΆ =
mΞΆ + 1
Bounds:
AΞ± = BΞ± β€ Οβ β€ CΞ±
wA β lA = wB β lB β€ Οβ β€ wC β lC
44. So, collinearity (of A, B and C) implies optimality.
(Even if there are multiple optima)
47. Right and left uncertainty
45. Iterating inside an enclosing triangle.
1 Set Ο to some value within the bounds
Λ
(wA β lA β€ Ο β€ wC β lC ).
Λ
2 Solve problem with rewards (r β Οc).
Λ
46. Optimality.
If h(sI ) = 0
Done!
Optimal policy found for current problem solves SMDP and
termination condition has been met.
48. Right and left uncertainty
47.a If h(sI ) > 0
Right uncertainty.
l
B
β
β
S
T
β
β C
Aβ
w
y1
49. Right and left uncertainty
47.b Right uncertainty.
Derivation:
y1 = SΞ± β TΞ±
1
= ((1 β mΞ² )wS β (1 β mΞ³ )wT β (mΞ³ β mΞ² )wC )
2
Maximization:
β 2s ab(Ο/2 β CΞ² )(Ο/2 β CΞ³ ) + a(Ο/2 β CΞ² ) + b(Ο/2 β CΞ³ )
y1 =
c
s = sign(mΞ² β mΞ³ )
a = (1 β mΞ³ )(mΞ² + 1)
b = (1 β mΞ² )(mΞ³ + 1)
c = (b β a) = 2(mΞ³ β mΞ² )
50. Right and left uncertainty
48.a If h(sI ) < 0
Left uncertainty.
l
B
β
β
β C
Aβ R
β
y2
w
51. Right and left uncertainty
48. Left uncertainty.
Is maximum where expected.
(When value level set crosses B)
y2 = RΞ± β QΞ±
β (Ο/2 β BΞ± )
y2 = (BΞ± β BΞ³ )
(Ο/2 β BΞ³ )
52. Right and left uncertainty
49. Fundamental lemma.
As Ο grows, maximal right uncertainty is monotonically
Λ
decreasing and maximal left uncertainty is monotonically
increasing, and both are non-negative with minimum 0.
53. Optimal nudging
50.
Find Ο (between the bounds, obviously) such that the
Λ
maximum resulting uncertainty, either left or right, is min.
Since both are monotonic and have minimum 0, min max
when maximum left and right uncertainties are equal.
Remark: bear in mind this (β) is the worst case. It can
terminate immediately.
Ο is gain, but neither biased towards observations (initial or
Λ
otherwise), nor slowly updated.
Optimal nudging is βoptimalβ in the sense that with this
update the maximum uncertainty range of resulting Ο values is
minimum.
54. Optimal nudging
51. Enclosing triangle into enclosing triangle.
52. Strictly smaller (both area and, importantly, resulting
uncertainty)
55. Obtaining an initial enclosing triangle
53. Setting Ο(0) = 0 and solving.
Maximizes reward irrespective of cost. (Usual RL problem)
Can be interpreted geometrically as fanning from the w axis
to ο¬nd the policy with w, l coordinates that subtends the
smallest angle.
The resulting optimizer maps to a point somewhere along a
line with intercept at the origin.
54. Optimum of the SMDP problem above but not behind that
line.
Else, contradiction.
56. Obtaining an initial enclosing triangle
56. Either way, after iter. 0, uncertainty reduces in at least half.
57. Conic intersection
57. Maximum right uncertainty is a conic!
  
c β(b + a) βCΞ± c r
β ο£ β(b + a) β
r y1 1 c (CΞ² a + CΞ³ b) ο£Έ ο£ y1 ο£Έ = 0
2
βCΞ± c (CΞ² a + CΞ³ b) CΞ± c 1
58. Maximum left uncertainty is a conic!
  
0 1 (BΞ³ β CΞ³ ) r
β β
r y2 1 ο£ 1 0 βBΞ³ ο£Έ ο£ y2 ο£Έ = 0
(BΞ³ β CΞ³ ) βBΞ³ β2BΞ± (BΞ³ β CΞ³ ) 1
58. Conic intersection
59. Intersecting them is easy.
60. And cheap. (Requires in principle constant time and simple
matrix operations)
61. So plug it in!
59. Termination Criteria
62.
We want to reduce uncertainty to Ξ΅.
Because it is a good idea. (Right?)
So thereβs your termination condition right there.
63. Alternatively, stop when |h(k) (sI )| < .
64. In any case, if the same policy remains optimal and the sign of
its nudged value changes between iterations, stop:
It is the optimal solution of the SMDP problem.
60. Finding D
65. A quick and dirty method:
1 Maximize cost (or episode length, all costs equal 1).
2 Multiply by largest unsigned reinforcement.
66. So, at most one RL problem more.
67. If D is estimated too large, wider initial bounds and longer
computation, but ok.
68. If D is estimated too small (by other methods, of course),
points outside the triangle in w β l space. (But where?)
61. Recurring state + unichain considerations
69. Feinberg and Yang: Deciding whether the unichain condition
holds can be done in polynomial time if a recurring state exists.
70. Existence of a recurring state is common in practice.
71. (Future work) It can maybe be induced using Ξ΅βMDPs.
(Maybe).
72. At least one case in which no unichain is no problem: games.
Certainty of positive policies.
Non-positive chains.
73. Happens! (See experiments)
62. Complexity
74. Discounted RL is PAC (βeο¬cient).
75. In problem size parameters (|S|, |A|) and 1/Ξ³.
76. Episodic undiscounted RL is also PAC.
(Following similar arguments, but slightly more intricate
derivations)
77. So we call a PAC (βeο¬cient) method a number of times.
63. Complexity
78. Most worstest case foreverest when choosing Ο(k) is not
reducing uncertainty.
79. Reducing it in half is a better bound for our method.
80. ... and it is a tight bound...
81. ... in cases that are nearly optimal from the outset.
82. So, at worst, log 1 calls to PAC:
Ξ΅
PAC!
64. Complexity
83. Whoops, we proved complexity! Thatβs a ο¬rst for SMDP
(or ARRL, for that matter).
84. And we inherit convergence from invoked RL, so thereβs
also that.
65. Tipically much faster
85. Worst case happens when we are βalready there.
86. Otherwise, depends, but certainly better.
87. Multi-iteration reduction in uncertainty way better than 0.5Β· ,
because it accumulates geometrically.
88. Empirical complexity better than the already very good upper
bound.
66. Bibliography I
S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine
Learning, 22(1):159β195, 1996.
Reinaldo Uribe, Fernando Lozano, Katsunari Shibata, and Charles Anderson. Discount and speed/execution
tradeoο¬s in markov decision process games. In Computational Intelligence and Games (CIG), 2011 IEEE
Conference on, pages 79β86. IEEE, 2011.