This document provides an introduction to reinforcement learning. It begins with an overview of reinforcement learning and how it differs from supervised and unsupervised learning. It then discusses how to model reinforcement learning problems using Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). The document provides details on MDPs, including how to define the value function and solve MDPs using value iteration and policy iteration. It also discusses how to learn MDP models and use Q-learning, a model-free reinforcement learning method.
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
Reinforcement Learning (RL) is a genre of Machine Learning in which an agent learns to choose optimal actions in different states in order to reach its specified goal, solely by interacting with the environment through trial and error. Unlike supervised learning, the agent does not get examples of "correct" actions in given states as ground truth. Instead, it has to use feedback from the environment (which can be sparse and delayed) to improve its policy over time. The formulation of the RL problem closely resembles the way in which human beings learn to act in different situations. Hence it is often considered the gateway to achieving the goal of Artificial General Intelligence.
The motivation of this talk is to introduce the audience to key theoretical concepts like formulation of the RL problem using Markov Decision Process (MDP) and solution of MDP using dynamic programming and policy gradient based algorithms. State-of-the-art deep reinforcement learning algorithms will also be covered. A case study of the application of reinforcement learning in robotics will also be presented.
We propose a distributed deep learning model to learn control policies directly from high-dimensional sensory input using reinforcement learning (RL). We adapt the DistBelief software framework to efficiently train the deep RL agents using the Apache Spark cluster computing framework.
Reinforcement Learning is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence. Its goal is to capture higher logic and use more adaptable algorithms than classical Machine Learning.
Formally it denotes a set of algorithms that deal with sequential decision-making and have the potential capability to make highly intelligent decisions depending on their local environment.
Reinforcement Learning problems can be described as an agent that has to make decisions in its environment in order to optimize a cumulative reward, and it is clear that this formalization applies to a great variety of tasks in many different fields.
In this talk, the main features of the most important Reinforcement Learning algorithms will be illustrated and deepened, with some concrete and explanatory examples.
Bio:
Marco Del Pra
Marco was born in Venice 41 years ago, has two master's degrees (Computer Science and Mathematics), and has two important publications in applied mathematics.
He has been working in Artificial Intelligence for 10 years, mainly as a freelancer. Among others, he worked for the European Commission's Joint Research Center, for Cuebiq, and as Data Science Lead for Microsoft's Artificial Intelligence projects in Italy.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Exploration Strategies in Reinforcement LearningDongmin Lee
I presented about "Exploration Strategies in Reinforcement Learning" at AI Robotics KR.
- Exploration strategies in RL
1. Epsilon-greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration (e.g., Entropy Regularization in RL)
Thank you.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Web Optimization is a Reinforcement Learning problem. Q-Learning is introduced as a way to integrate AB Testing, Attribution, and Predictive Targeting.
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
Reinforcement Learning (RL) is a genre of Machine Learning in which an agent learns to choose optimal actions in different states in order to reach its specified goal, solely by interacting with the environment through trial and error. Unlike supervised learning, the agent does not get examples of "correct" actions in given states as ground truth. Instead, it has to use feedback from the environment (which can be sparse and delayed) to improve its policy over time. The formulation of the RL problem closely resembles the way in which human beings learn to act in different situations. Hence it is often considered the gateway to achieving the goal of Artificial General Intelligence.
The motivation of this talk is to introduce the audience to key theoretical concepts like formulation of the RL problem using Markov Decision Process (MDP) and solution of MDP using dynamic programming and policy gradient based algorithms. State-of-the-art deep reinforcement learning algorithms will also be covered. A case study of the application of reinforcement learning in robotics will also be presented.
We propose a distributed deep learning model to learn control policies directly from high-dimensional sensory input using reinforcement learning (RL). We adapt the DistBelief software framework to efficiently train the deep RL agents using the Apache Spark cluster computing framework.
Reinforcement Learning is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence. Its goal is to capture higher logic and use more adaptable algorithms than classical Machine Learning.
Formally it denotes a set of algorithms that deal with sequential decision-making and have the potential capability to make highly intelligent decisions depending on their local environment.
Reinforcement Learning problems can be described as an agent that has to make decisions in its environment in order to optimize a cumulative reward, and it is clear that this formalization applies to a great variety of tasks in many different fields.
In this talk, the main features of the most important Reinforcement Learning algorithms will be illustrated and deepened, with some concrete and explanatory examples.
Bio:
Marco Del Pra
Marco was born in Venice 41 years ago, has two master's degrees (Computer Science and Mathematics), and has two important publications in applied mathematics.
He has been working in Artificial Intelligence for 10 years, mainly as a freelancer. Among others, he worked for the European Commission's Joint Research Center, for Cuebiq, and as Data Science Lead for Microsoft's Artificial Intelligence projects in Italy.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Exploration Strategies in Reinforcement LearningDongmin Lee
I presented about "Exploration Strategies in Reinforcement Learning" at AI Robotics KR.
- Exploration strategies in RL
1. Epsilon-greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration (e.g., Entropy Regularization in RL)
Thank you.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Web Optimization is a Reinforcement Learning problem. Q-Learning is introduced as a way to integrate AB Testing, Attribution, and Predictive Targeting.
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
High Level Overview of Predictive Analytics and how it can be used for Personalization and targeting. We walk through a toy problem using online linear regression
Turning Analysis into Action with APIs - Superweek2017Mark Edmondson
Presentation given by Peter Meyer and Mark Edmondson and Superweek 2017 in Hungary. Includes three examples of using APIs in a tag management solution to give better data to make decisions and use predictions. 1.
Predictive Conversion Modeling - Lifting Web Analytics to the next levelPetri Mertanen
Annalect presentation at Superweek 2017: Predictive Conversion Modeling - Lifting Web Analytics to the next level. Presented by Petri Mertanen, Director of Digital Analytics and Ron Luhtanen, Data Science Analyst. #SPWK
Damion Brown - The Missing Automation Layer - Superweek 2017Damion Brown
You use APIs and third-party services to automate the extraction of data from a web analytics tool. But what about automating the sending of data too? In this hands-on talk, Damion shows you how to use services like IFTTT and Zapier to augment the clickstream with context.
Using a mix of RSS, web hooks, open APIs and the Google Analytics Measurement Protocol, you can grab things like company blog posts, new backlinks, and social interactions like Twitter mentions, Instagram posts, and more. It's all possible by making use of the increasingly-connected web of services and there are even possibilities of tracking the internet of things with your favourite web analytics tool.
Should Digital Analysts Become More Data Science-y?Tim Wilson
Presentation by Tim Wilson at Superweek 2017 in Budapest, Hungary. The presentation explores why digital analysts should consider adding some data science skills to their toolset, what types of tools that entails, and what sort of additional value that will help them deliver to their organizations
Radical Analytics, Superweek Hungary, January 2017Stéphane Hamel
We have been doing it all wrong… the idea was to gather business requirements from stakeholders, define KPIs, and create a solution design. The thing is… either your stakeholders (if client side) or clients (if agency) do not have a clue or they do not know how to properly articulate their needs. Do not ask them!
Instead, take the lead - be the expert; Show them the light; pave the way!
In this presentation, Stéphane Hamel, a recognized industry leader and author of the Digital Analytics Maturity Model, will propose a radical new approach to digital analytics. He will share tricks and examples that could transform the way you do your job. Utopia or Nirvana? It will be yours to decide!
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemIosif Itkin
Alexei Promsky, Dmitry Kondtratyev, A.P. Ershov Institute of Informatics Systems, Novosibirsk
12 - 14 November 2015
Tools and Methods of Program Analysis in St. Petersburg
Simulators play a major role in analyzing multi-modal transportation networks. As their complexity increases, optimization becomes an increasingly challenging task. Current calibration procedures often rely on heuristics, rules of thumb and sometimes on brute-force search. Alternatively, we provide a statistical method which combines a distributed, Gaussian Process Bayesian optimization method with dimensionality reduction techniques and structural improvement. We then demonstrate our framework on the problem of calibrating a multi-modal transportation network of city of Bloomington, Illinois. Our framework is sample efficient and supported by theoretical analysis and an empirical study. We demonstrate on the problem of calibrating a multi-modal transportation network of city of Bloomington, Illinois. Finally, we discuss directions for further research.
An overview of Rademacher Averages, a fundamental concept from statistical learning theory that can be used to derive uniform sample-dependent bounds to the deviation of samples averages from their expectations.
To make Reinforcement Learning Algorithms work in the real-world, one has to get around (what Sutton calls) the "deadly triad": the combination of bootstrapping, function approximation and off-policy evaluation. The first step here is to understand Value Function Vector Space/Geometry and then make one's way into Gradient TD Algorithms (a big breakthrough to overcome the "deadly triad").
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
4. What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources
Reinforcement learning (RL): provide the learning agent
with a reward function and let it figure out the best
strategy for obtaining large rewards
Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28
5. What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources
Reinforcement learning (RL): provide the learning agent
with a reward function and let it figure out the best
strategy for obtaining large rewards
RL has been used in such diverse applications as:
Business strategy planning
Aircraft control
Optimal routing (data packets, vehicles, etc)
Robot motion control
Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28
6. How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:
Introduction to
Reinforcement
Learning
Edward Balaban
State space models:
no uncertainty
Preliminaries
Markov Decision Processes (MDPs):
uncertainty in action effects
POMDP
MDP
Resources
Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action effects and current state
3 / 28
7. How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:
Introduction to
Reinforcement
Learning
Edward Balaban
State space models:
no uncertainty
Preliminaries
Markov Decision Processes (MDPs):
uncertainty in action effects
POMDP
MDP
Resources
Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action effects and current state
Other modeling frameworks exist, e.g. Predictive State Representation:
Generalizations of POMDPs that were shown to have both a
greater representational capacity than POMDPs and yield
representations that are at least as compact (Singh et al, 2004
and Even-Dar et al, 2005)
Represent the state of a dynamic system by tracking occurrence
probabilities of a set of future events (tests), conditioned on past
events (histories)
Rely solely on observable quantities (unlike POMDPs)
3 / 28
9. Markov Decision Process (MDP)
Introduction to
Reinforcement
Learning
States:
S = {s1 , ..., s|S| }
Actions:
A = {a1 , ..., a|A| }
Preliminaries
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
MDP
Rewards:
R:S→R
Policy:
π : S → A, Π is the set of all policies
Edward Balaban
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
4 / 28
10. Markov Decision Process (MDP)
Introduction to
Reinforcement
Learning
States:
S = {s1 , ..., s|S| }
Actions:
A = {a1 , ..., a|A| }
Preliminaries
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
MDP
Rewards:
R:S→R
Policy:
π : S → A, Π is the set of all policies
Edward Balaban
Learning
Solving
Continuous state MDP
Example
POMDP
Value function:
Resources
V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]
4 / 28
11. Markov Decision Process (MDP)
Introduction to
Reinforcement
Learning
States:
S = {s1 , ..., s|S| }
Actions:
A = {a1 , ..., a|A| }
Preliminaries
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
MDP
Rewards:
R:S→R
Policy:
π : S → A, Π is the set of all policies
Edward Balaban
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
Value function:
V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]
Bellman Equation:
V π (s) = R(s) + γ
T (s, a, s )V π (s )
s ∈S
Optimal Value function:
V ∗ = max V π (s)
π
4 / 28
12. Markov Decision Process (MDP), continued
Bellman Equation for the optimal value function:
T (s, a, s )V ∗ (s )
V ∗ (s) = R(s) + max γ
a∈A
s ∈S
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Policy:
Continuous state MDP
π ∗ (s) = arg max γ
a∈A
∗
V (s) = V
T (s, a, s )V ∗ (s )
s ∈S
π∗
Example
POMDP
Resources
(s) ≥ V π (s)
5 / 28
13. Learning an MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Usually S, A, and γ are known.
Learning
Solving
Continuous state MDP
Example
# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s
POMDP
Resources
6 / 28
14. Learning an MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Usually S, A, and γ are known.
Learning
Solving
Continuous state MDP
Example
# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s
POMDP
Resources
Similarly, if R is unknown, can also pick our estimate of the
expected immediate reward R(s) in state s to be the average
reward observed in that state.
6 / 28
15. Introduction to
Reinforcement
Learning
Solving an MDP: Value Iteration
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example
∀s ∈ S, V (s) ← 0
Repeat until convergence:
POMDP
Resources
∀s ∈ S, V (s) ← R(s) + max γ
a∈A
s ∈S
T (s, a, s )V (s )
7 / 28
16. Introduction to
Reinforcement
Learning
Convergence
Edward Balaban
From the definition of Bellman operator:
Preliminaries
MDP
||B(V1 ) − B(V2 )||∞ = max R(s) + γmax
s∈S
Psa (s )V1 (s ) − R(s) − γmax
a∈A
Psa (s )V2 (s )
a∈A
s ∈S
s
Learning
Solving
∈S
(1)
Continuous state MDP
Example
POMDP
= γ · max max
s∈S
Psa (s )V1 (s ) − max
a∈A
Psa (s )V2 (s )
(2)
a∈A
s ∈S
s
Resources
∈S
To go further, we need to understand whether the two maximization operations over the set of actions
for V1 and V2 can be combined. To do that, let’s use the following definitions:
f1 (a) =
Psa (s )V1 (s )
(3)
Psa (s )V2 (s )
(4)
s ∈S
f2 (a) =
s
∈S
8 / 28
17. Introduction to
Reinforcement
Learning
Convergence, continued
∗
In order to, for the moment, get rid of the max operators, let’s also define a1 as the action that
∗
maximizes f1 and a2 as the action that maximizes f2 . Then
max
a∈A
s ∈S
Preliminaries
∗
∗
Psa (s )V2 (s ) can be written as |f1 (a1 ) − f2 (a2 )|.
Psa (s )V1 (s ) − max
a∈A
Edward Balaban
s
MDP
∈S
Learning
Solving
∗
∗
∗
∗
∗
∗
Since f1 (a1 ) ≥ f2 (a1 ) and f2 (a2 ) ≥ f1 (a2 ) (by the virtue of a1 and a2 maximizing f1 and f2 ,
respectively), we can “unpack” the absolute value operator as follows:
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
f1 (a1 ) − f2 (a2 ) ≤ f1 (a1 ) − f2 (a1 )
f2 (a2 ) − f1 (a1 ) ≤ f2 (a2 ) − f1 (a2 )
Continuous state MDP
Example
(5)
POMDP
(6)
Resources
Then it is also true that
∗
∗
f1 (a1 ) − f2 (a2 ) ≤ |f1 (a1 ) − f2 (a1 )|
(7)
∗
f2 (a2 )
− f1 (a2 )|
(8)
f1 (a1 ) − f2 (a2 ) ≤ max |f1 (a) − f2 (a)|
(9)
−
∗
f1 (a1 )
≤
∗
|f2 (a2 )
∗
And, finally, it should also be true for ∀a that
∗
∗
a
∗
f2 (a2 )
−
∗
f1 (a1 )
≤ max |f2 (a) − f1 (a)|
(10)
a
Therefore we can conclude that
|max f1 (a) − max f2 (a)| ≤ max |f1 (a) − f2 (a)|
a
a
(11)
a
9 / 28
18. Introduction to
Reinforcement
Learning
Convergence, continued
Edward Balaban
Then Equation 2 can be rewritten as an inequality:
Preliminaries
||B(V1 ) − B(V2 )||∞ ≤ γ · max max
Psa (s )V1 (s ) −
Psa (s )V2 (s )
(12)
s∈S a∈A
s ∈S
s
MDP
Learning
∈S
Solving
Continuous state MDP
Simplifying further, we get:
Example
POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ · max max
Psa (s ) V1 (s ) − V2 (s )
(13)
Resources
s∈S a∈A
s ∈S
By using the triangle inequality and the fact that Psa (s ) ≥ 0, we can rewrite the above expression as
Psa (s ) V1 (s ) − V2 (s )
||B(V1 ) − B(V2 )||∞ ≤ γ · max max
(14)
s∈S a∈A
s ∈S
Psa (s ) V1 (s ) − V2 (s ) can be seen as the expectation of V1 (s ) − V2 (s ) . It is, therefore,
s ∈S
no greater than the maximum value that V1 (s ) − V2 (s ) can take. Thus the above inequality can
be written as:
||B(V1 ) − B(V2 )||∞ ≤ γ · max max max V1 (s ) − V2 (s )
(15)
s∈S a∈A s ∈S
10 / 28
19. Introduction to
Reinforcement
Learning
Convergence, continued
Edward Balaban
The remaining expression on the right can only be maximized with respect to s , so we can simplify to
Preliminaries
MDP
||B(V1 ) − B(V2 )||∞ ≤ γ · max V1 (s ) − V2 (s )
(16)
s ∈S
Learning
Solving
Continuous state MDP
What we have on the right hand side now is the definition of infinity norm, therefore we finally obtain:
Example
POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ||V1 − V2 ||∞
(17)
Resources
We’ll prove that the Bellman operator has at most one fixed point by contradiction. Let’s assume that
there are two distinct fixed points, V1 and V2 . Since B(V1 ) = V1 and B(V2 ) = V2 , the inequality
obtained in part (a) becomes
||V1 − V2 ||∞ ≤ γ||V1 − V2 ||∞
(18)
(1 − γ)||V1 − V2 ||∞ ≤ 0
(19)
Since γ ∈ [0, 1), then 1 − γ > 0. An infinity norm of any variable is non-negative, so the only way for
the above expression to be true is if ||V1 − V2 ||∞ = 0, and, consequently, if V1 = V2 . Therefore we
proved that the Bellman operator has at most one fixed point.
11 / 28
20. Using an MDP with value iteration
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Repeat:
Learning
Solving
Continuous state MDP
Execute π in the MDP for some number of trials.
Using the accumulated experience in the MDP, update
estimates for T (s, a, s ) (and R, if applicable)
Example
POMDP
Resources
Apply value iteration to get a new estimated value
function V
Update π to be the greedy policy with respect to V
12 / 28
21. Solving an MDP: Policy Iteration
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Initialize π randomly.
Repeat until convergence:
V ← Vπ
∀s ∈ S, π(s) = arg max
a∈A
Continuous state MDP
Example
POMDP
Resources
s
∈S T (s, a, s )V (s )
V ← V π can be done efficiently by solving Bellman’s
equations as a system of linear equations.
13 / 28
22. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
14 / 28
23. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
14 / 28
24. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Resources
14 / 28
25. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).
Resources
Q(s, a) ← Q(s, a) + α(error )
14 / 28
26. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).
Resources
Q(s, a) ← Q(s, a) + α(error )
Q(s, a) ← Q(s, a) + α(sensed − predicted)
14 / 28
27. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).
Resources
Q(s, a) ← Q(s, a) + α(error )
Q(s, a) ← Q(s, a) + α(sensed − predicted)
Q(s, a) ← Q(s, a) + α([γ(r + max Q(s , a )] − [Q(s, a)])
a
Stochastic update with step size α.
14 / 28
28. Continuous State MDP
A more realistic form of MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
15 / 28
29. Continuous State MDP
A more realistic form of MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Needs a simulator
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
15 / 28
30. Continuous State MDP
A more realistic form of MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Needs a simulator
MDP
Learning
Solving continuous-state MDPs:
Solving
Continuous state MDP
Example
LQR
POMDP
Fitted Value Iteration
Resources
15 / 28
31. An example MDP - the inverted pendulum
Introduction to
Reinforcement
Learning
A thin pole is connected via a free hinge to a cart
Edward Balaban
The cart can move laterally on a smooth table surface
Preliminaries
Failure occurs if:
the angle of the pole deviates by more than a certain amount
from the vertical position
the cart’s position goes out of bounds
MDP
Learning
Solving
Continuous state MDP
Example
The objective is to develop a controller to balance the pole
POMDP
The only actions the controller can take is accelerate the cart either left
or right
Resources
The algorithm cannot use any knowledge of the dynamics of the
underlying system
16 / 28
34. Partially Observable Markov Decision Process
(MDP)
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
States:
S = {s1 , ..., s|S| }
MDP
Actions:
A = {a1 , ..., a|A| }
POMDP
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
Observations:
Z = {z1 , ..., z|Z | }
Observation probabilities:
O(z, a, s ) = Pr (z |s, a)
Belief state (agent):
b = {b(s1 ), . . . , b(s|S| )} : S → [0, 1]|S| ,
|S|
i=1
Definition
Solving
Example
System Health
Management
Resources
b(si ) = 1
Belief space:
B - the set of all belief states (infinite)
Initial belief:
b0
Rewards:
R:S→R
Policy:
π : B → A|B| , Π is the set of all policies
18 / 28
35. Solving a POMDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Solving a realistic POMDP exactly is often computationally
intractable.
POMDP
Definition
Solving
Example
Approximate method families:
System Health
Management
Resources
Point-based methods
Monte Carlo methods
Generalization methods
19 / 28
36. Example: Prognostic Decision Making
Introduction to
Reinforcement
Learning
System Degradation
Edward Balaban
All of aerospace systems experience degradation
Preliminaries
Degradation can be use- or time-dependent
MDP
The operating environment is often a significant
factor
POMDP
Definition
JAXA Hayabusa
Solving
Example
System Health
Management
Resources
20 / 28
37. Example: Prognostic Decision Making
Introduction to
Reinforcement
Learning
System Degradation
Edward Balaban
All of aerospace systems experience degradation
Preliminaries
Degradation can be use- or time-dependent
MDP
The operating environment is often a significant
factor
POMDP
Definition
JAXA Hayabusa
Solving
Example
Faults
System Health
Management
Degradation can accelerate if a fault occurs
Resources
In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232
20 / 28
38. Example: Prognostic Decision Making
Introduction to
Reinforcement
Learning
System Degradation
Edward Balaban
All of aerospace systems experience degradation
Preliminaries
Degradation can be use- or time-dependent
MDP
The operating environment is often a significant
factor
POMDP
Definition
JAXA Hayabusa
Solving
Example
Faults
System Health
Management
Degradation can accelerate if a fault occurs
Resources
In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232
System Health Management (SHM)
Recent designs, e.g. S-92, have more SHM
capabilities (in fault detection and diagnosis)
Still, maintenance is predominantly done based
on fixed schedules
In-flight emergencies are handled through skill
and ingenuity of the crew and ground control
Sikorsky S-92
20 / 28
39. How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Fault magnitude estimation,
MDP
Degradation trajectory prediction (prognostics).
POMDP
Definition
Solving
Example
System Health
Management
Resources
21 / 28
40. How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Fault magnitude estimation,
MDP
Degradation trajectory prediction (prognostics).
POMDP
Definition
Research on how to utilize prognostic health information is in the very early
stages, however.
Solving
Example
System Health
Management
Resources
21 / 28
41. How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Fault magnitude estimation,
MDP
Degradation trajectory prediction (prognostics).
POMDP
Definition
Research on how to utilize prognostic health information is in the very early
stages, however.
Prognostic Decision Making (PDM)
Solving
Example
System Health
Management
Resources
The process of selecting system actions informed by predictions of
the future system health state
PDM can help with the following, for example:
Component life extension,
Fault mitigation,
Mission replanning
Crew decision support in emergencies,
Condition-based maintenance,
Asset allocation.
21 / 28
42. Introduction to
Reinforcement
Learning
System
Described as a continuous-state, continuous-action POMDP:
Edward Balaban
State space:
S ⊆ Rn
Preliminaries
Action space:
A ⊆ Rm
MDP
Observations:
Z ⊆ Rp
POMDP
Transition function:
T (s, a, s ) = pdf (s |s, a) : S × A × S → [0, ∞)
Definition
Observation function:
O(z , a, s ) = pdf (z |s , a) : S×A×Z → [0, ∞)
Example
Belief state:
b(s) = pdf (s)
Belief space:
B - the set of all belief states
Initial belief:
b0
Belief update:
b az (s ) ∝ O(z , a, s )
Solving
System Health
Management
Resources
T (s, a, s )b(s)ds
S
Policy:
π(a, b) = pdf (a|b) : A × B → [0, ∞), Π is the
set of all policies
Costs:
C = {c1 (s, a), ..., c|C | (s, a)} : S × A → R|C |
Rewards:
R(s, r ) = pdf (r |s) : S × R → [0, ∞)
Objectives:
F = {f1 (s), . . . , f|F | (s)} : S → R|F |
Constraints:
G = {g1 (s), . . . , g|G| (s)} : S → B|G|
22 / 28
43. Introduction to
Reinforcement
Learning
System Degradation
Let H = {h1 , . . . , hH } be the vector of system health parameters
incorporated into the state
Edward Balaban
Preliminaries
Fault: Gfault ∈ G defines significant deviations from the expected
nominal behavior. A fault occurs if ∃i, gi (s) = true, gi ∈ Gfault .
MDP
Failure: Gfailure ∈ G defines states where the system loses functional
capability with respect to a health parameter h ∈ H
System failure F : S → B, a boolean function indicating when the entire
system is effectively non-functional (F is defined via the Gfailure set)
End of Life (EoL): tEoL : F (s) = true
POMDP
Definition
Solving
Example
System Health
Management
Resources
Remaining Useful Life: RUL = tEoL − t
h
1
fault threshold
tfault
EoL
failure threshold
0
t
23 / 28
47. Introduction to
Reinforcement
Learning
Decision Making
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management
Resources
the process of finding (or approximating) π ∗ , such that
π ∗ = arg max J π (bt )
π∈Π
24 / 28
48. Case Study: UAV Mission Replanning
Given:
An initial mission route (not necessarily optimized) which includes
waypoint parameter constraints (e.g. on airspeed or bank angle).
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Each waypoint is associated with a payoff value
MDP
A healthy vehicle is able to complete the entire route within the energy
and component health constraints
POMDP
Transition costs between a pair of waypoints are history-dependent
A fault occurs that makes it impossible to complete the mission before
the End of Life (EoL)
Definition
Solving
Example
System Health
Management
Resources
Find:
A policy π that maximizes mission payoff and extends the remaining useful
life
25 / 28
49. Introduction to
Reinforcement
Learning
Reasoning Architecture
Vehicle Simulation
(including prognostic
models)
Edward Balaban
health and energy cost estimates
Diagnoser
Preliminaries
MDP
input route and
parameter constraints
candidate route
observations
POMDP
Definition
Solving
Decision Maker
Vehicle
Example
System Health
Management
initial fault set
current fault set
Resources
Particle Filter is currently used as the decision-making algorithm
Decision Maker picks ordered waypoint subsets and parameter values for
candidate routes and proposed routes
The vehicle simulation is 6DOF, with prognostic models for battery and
motor temperatures, as well as the battery state of charge
The fault mode currently implemented is increased motor friction
The fault leads to increased current consumption and motor/battery
overheating
26 / 28
50. Mission Replanning Simulation
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management
Resources
27 / 28
52. Resources
Sutton and Barto book:
http:
//webdocs.cs.ualberta.ca/˜sutton/book/ebook/
Intro to POMDPs:
http://cs.brown.edu/research/ai/pomdp/
tutorial/index.html
Stanford Autonomous Helicopter project:
http://heli.stanford.edu
NASA Vehicle Health Management (Intelligent Systems
Division): http://ti.arc.nasa.gov/tech/dash/
pcoe/publications/
E. Balaban and J. J. Alonso, “A Modeling Framework
for Prognostic Decision Making and its Application to
UAV Mission Planning”, in proceedings of the Annual
Conference of the Prognostics and Health Management
Society, 2013, pp. 1-12.:
https://c3.nasa.gov/dashlink/resources/881/
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources
28 / 28