1. AlphaZero uses self-play reinforcement learning to train a neural network to evaluate board positions and select moves. It trains offline by playing games against itself, using the results to iteratively improve its network.
2. During online play, AlphaZero uses Monte Carlo tree search with the neural network to select moves. It evaluates many random simulations of possible future games to a certain depth, using the network to approximate values beyond that depth.
3. The success of AlphaZero is due to skillfully combining known reinforcement learning techniques like self-play training, neural network function approximation, and Monte Carlo tree search with powerful computational resources.
ECCV2010: feature learning for image classification, part 4zukun
This document discusses techniques for unsupervised feature learning from unlabeled data using neural networks. It describes using sparse autoencoders to learn feature hierarchies in an unsupervised manner by training networks to reconstruct their inputs while enforcing sparsity constraints. Convolutional deep belief networks are also discussed as a method for hierarchical probabilistic modeling of audio, images and video. The document concludes that unsupervised feature learning has achieved state-of-the-art results on various tasks such as object classification, activity recognition and speech processing.
This document discusses using deep reinforcement learning and deep learning techniques for agent-based models. It discusses using deep learning to approximate policy and value functions, using imitation learning to learn from expert demonstrations, and using Q-learning and model-based reinforcement learning to optimize agent behavior. Micro-emulations use deep learning to model individual agent behavior, while macro-emulations aim to emulate the overall system behavior. Open problems include using reinforcement learning to find optimal policies given an agent-based model simulator.
This document discusses object detection using Adaboost and various techniques. It begins with an overview of the Adaboost algorithm and provides a toy example to illustrate how it works. Next, it describes how Viola and Jones used Adaboost with Haar-like features and an integral image representation for rapid face detection in images. It achieved high detection rates with very low false positives. The document also discusses how Schneiderman and Kanade used a parts-based representation with localized wavelet coefficients as features for object detection and used statistical independence of parts to obtain likelihoods for classification.
The document proposes using bandit structured prediction to train neural machine translation models with weak feedback in the form of task loss evaluations instead of full labeled data. It applies this approach to domain adaptation by training on one domain and evaluating on another. Control variates are used to reduce variance and improve generalization. Experimental results show the approach outperforms linear models and prior work, successfully adapting a model from Europarl to News Commentary and TED talks with improved BLEU scores over baselines on both in-domain and out-of-domain test sets.
Surrogate models emulate expensive computer simulations. The objective is to approximate a function, $f$, of $d$ variables to a given tolerance, $\varepsilon$, using as few function values as possible, preferably $O(d)$. We explain how tractability theory provides lower bounds on the number of function values required for any possible method. We also propose method for sampling $f$ and approximating $f$ that achieves this objective and the kind of underlying structure that $f$ must have for success.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
This document describes the author's approach to the OpenAI Retro Contest which focuses on transfer learning in reinforcement learning. The author uses a method called Fast Learner that first learns a general representation of the state space using stacked denoising and variational autoencoders trained on images from the training levels. This learned representation is then used as input to a PPO agent which is trained on one simple training level. The agent is then evaluated on its transfer performance on unseen test levels, demonstrating an average 80% improvement over a scratch PPO agent and 24% over a PPO agent trained directly on one level.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
ECCV2010: feature learning for image classification, part 4zukun
This document discusses techniques for unsupervised feature learning from unlabeled data using neural networks. It describes using sparse autoencoders to learn feature hierarchies in an unsupervised manner by training networks to reconstruct their inputs while enforcing sparsity constraints. Convolutional deep belief networks are also discussed as a method for hierarchical probabilistic modeling of audio, images and video. The document concludes that unsupervised feature learning has achieved state-of-the-art results on various tasks such as object classification, activity recognition and speech processing.
This document discusses using deep reinforcement learning and deep learning techniques for agent-based models. It discusses using deep learning to approximate policy and value functions, using imitation learning to learn from expert demonstrations, and using Q-learning and model-based reinforcement learning to optimize agent behavior. Micro-emulations use deep learning to model individual agent behavior, while macro-emulations aim to emulate the overall system behavior. Open problems include using reinforcement learning to find optimal policies given an agent-based model simulator.
This document discusses object detection using Adaboost and various techniques. It begins with an overview of the Adaboost algorithm and provides a toy example to illustrate how it works. Next, it describes how Viola and Jones used Adaboost with Haar-like features and an integral image representation for rapid face detection in images. It achieved high detection rates with very low false positives. The document also discusses how Schneiderman and Kanade used a parts-based representation with localized wavelet coefficients as features for object detection and used statistical independence of parts to obtain likelihoods for classification.
The document proposes using bandit structured prediction to train neural machine translation models with weak feedback in the form of task loss evaluations instead of full labeled data. It applies this approach to domain adaptation by training on one domain and evaluating on another. Control variates are used to reduce variance and improve generalization. Experimental results show the approach outperforms linear models and prior work, successfully adapting a model from Europarl to News Commentary and TED talks with improved BLEU scores over baselines on both in-domain and out-of-domain test sets.
Surrogate models emulate expensive computer simulations. The objective is to approximate a function, $f$, of $d$ variables to a given tolerance, $\varepsilon$, using as few function values as possible, preferably $O(d)$. We explain how tractability theory provides lower bounds on the number of function values required for any possible method. We also propose method for sampling $f$ and approximating $f$ that achieves this objective and the kind of underlying structure that $f$ must have for success.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
This document describes the author's approach to the OpenAI Retro Contest which focuses on transfer learning in reinforcement learning. The author uses a method called Fast Learner that first learns a general representation of the state space using stacked denoising and variational autoencoders trained on images from the training levels. This learned representation is then used as input to a PPO agent which is trained on one simple training level. The agent is then evaluated on its transfer performance on unseen test levels, demonstrating an average 80% improvement over a scratch PPO agent and 24% over a PPO agent trained directly on one level.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, members of the Amazon Machine Learning team will provide a short background on Deep Learning focusing on relevant application domains and an introduction to using the powerful and scalable Deep Learning framework, MXNet. At the end of this tutorial you’ll gain hands on experience targeting a variety of applications including computer vision and recommendation engines as well as exposure to how to use preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.
Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho
Slides presented as a work to Artificial Intelligence's class at IME-USP. This presentation is about how reinforcement learning is applied to a Tetris game.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
The document provides an overview and outline of the course "Optimization for Machine Learning". Key points:
- The course covers topics like convexity, gradient methods, constrained optimization, proximal algorithms, stochastic gradient descent, and more.
- Mathematical modeling and computational optimization for machine learning are discussed. Optimization algorithms like gradient descent and stochastic gradient descent are important for learning model parameters.
- Convex optimization problems have desirable properties like every local minimum being a global minimum. Gradient descent and related algorithms are guaranteed to converge for convex problems.
- Convex sets and functions are introduced, including characterizations using epigraphs and subgradients. Convex functions have useful properties like continuity and satisfying Jensen's inequality.
This document provides an introduction to machine learning, covering key topics such as what machine learning is, common learning algorithms and applications. It discusses linear models, kernel methods, neural networks, decision trees and more. It also addresses challenges in machine learning like balancing fit and robustness, and evaluating model performance using techniques like ROC curves. The goal of machine learning is to build models that can learn from data to make predictions or decisions.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
Tutorial presented at ACM SIGIR/SIGKDD Africa Summer School on Machine Learning for Data Mining and Search (AFIRM 2020) conference in Cape Town, South Africa.
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
This document discusses gradient boosted regression trees (GBRT) and their implementation in scikit-learn. It begins with an introduction to machine learning concepts like classification, regression, and decision trees. It then covers the basics of boosting and gradient boosting, describing how GBRT works by sequentially fitting trees to residuals. The rest of the document demonstrates scikit-learn's GBRT implementation, provides tips on regularization and hyperparameters, and presents a case study on house price prediction.
This document provides an overview of machine learning concepts. It discusses big data and the need for machine learning to extract structure from data. It explains that machine learning involves programming computers to optimize performance using examples or past experience. Learning is useful when human expertise is limited or changes over time. The document also summarizes applications of machine learning like classification, regression, clustering, and reinforcement learning. It provides examples of each type of learning and discusses concepts like bias-variance tradeoff, overfitting, underfitting and more.
This document summarizes the correspondence between single-layer neural networks and Gaussian processes (GPs). It reviews how the outputs of a single-layer neural network converge to a GP in the infinite-width limit, with the network's covariance function determined by its architecture. The document derives the mean and covariance functions for the GP corresponding to a single-layer network, and notes that different network outputs are independent GPs.
Stochastic optimization from mirror descent to recent algorithmsSeonho Park
The document discusses stochastic optimization algorithms. It begins with an introduction to stochastic optimization and online optimization settings. Then it covers Mirror Descent and its extension Composite Objective Mirror Descent (COMID). Recent algorithms for deep learning like Momentum, ADADELTA, and ADAM are also discussed. The document provides convergence analysis and empirical studies of these algorithms.
The document discusses function approximation and pattern recognition using neural networks. It introduces concepts like the perceptron, multi-layer perceptrons, backpropagation algorithm, supervised and unsupervised learning. It provides examples of using neural networks for function approximation and pattern recognition problems. Matlab code is also presented to illustrate training a neural network on sample datasets.
The document summarizes a presentation titled "Yoyak" given by Heejong Lee at ScalaDays 2015. The presentation introduces Yoyak, a static analysis framework developed by the speaker. It covers the following topics:
- Static analysis and abstract interpretation theory
- Implementation highlights of the Yoyak framework
- Experiences using Scala in developing Yoyak
- The roadmap for future development of Yoyak
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
This document provides an overview and introduction to using the statistical programming language R. It begins with basic commands for performing calculations and creating vectors, matrices, and data frames. It then covers importing and exporting data, basic graphs and statistical distributions, correlations, linear and nonlinear regression, advanced graphics, and accessing financial data packages. The document concludes with proposing practical tasks for workshop participants to work with financial data in R.
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
Deep learning models with millions or billions of parameters should overfit according to classical theory, but they do not. The emerging theory of double descent seeks to explain why larger neural networks can generalize well. Random matrix theory provides a tractable framework to model double descent through random feature models, where the number of random features controls model capacity. In the high-dimensional limit, the test error of random feature regression exhibits a double descent shape that can be computed analytically.
The document summarizes a deep learning programming course for artificial intelligence. The course covers topics like machine learning, deep learning, convolutional neural networks, recurrent neural networks, and applications of deep learning in medicine. It provides an overview of each week's topics, including an introduction to AI and machine learning in week 3, deep learning in week 4, and applications of AI in medicine in week 5.
Managing Uncertainties in Hardware-Software Codesign ProjectsJones Albuquerque
This document proposes a stochastic model and design scenario analysis approach for managing uncertainties in hardware-software codesign projects. It describes representing a system using hierarchical, sequencing, and development views. Teams provide probabilistic estimates for tasks. A stochastic integer linear program formulation incorporates the views and estimates to analyze design scenarios varying objectives, risks, and constraints. Examples analyzing partitioning, risks, and convergence are presented for evaluating the approach on sample systems.
The document provides an introduction to machine learning. It discusses the author's path to becoming a data scientist and some key machine learning concepts, including:
- Required skills at different experience levels for machine learning roles
- Popular machine learning approaches like deep learning and reinforcement learning
- Common machine learning problems like one shot learning and imbalanced datasets
- How machine learning works by using tricks on data through parametric models and free parameters
- Key questions in machine learning like what to teach, how to teach, and to what entity
- Popular machine learning frameworks like TensorFlow that automate tasks
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, members of the Amazon Machine Learning team will provide a short background on Deep Learning focusing on relevant application domains and an introduction to using the powerful and scalable Deep Learning framework, MXNet. At the end of this tutorial you’ll gain hands on experience targeting a variety of applications including computer vision and recommendation engines as well as exposure to how to use preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.
Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho
Slides presented as a work to Artificial Intelligence's class at IME-USP. This presentation is about how reinforcement learning is applied to a Tetris game.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
The document provides an overview and outline of the course "Optimization for Machine Learning". Key points:
- The course covers topics like convexity, gradient methods, constrained optimization, proximal algorithms, stochastic gradient descent, and more.
- Mathematical modeling and computational optimization for machine learning are discussed. Optimization algorithms like gradient descent and stochastic gradient descent are important for learning model parameters.
- Convex optimization problems have desirable properties like every local minimum being a global minimum. Gradient descent and related algorithms are guaranteed to converge for convex problems.
- Convex sets and functions are introduced, including characterizations using epigraphs and subgradients. Convex functions have useful properties like continuity and satisfying Jensen's inequality.
This document provides an introduction to machine learning, covering key topics such as what machine learning is, common learning algorithms and applications. It discusses linear models, kernel methods, neural networks, decision trees and more. It also addresses challenges in machine learning like balancing fit and robustness, and evaluating model performance using techniques like ROC curves. The goal of machine learning is to build models that can learn from data to make predictions or decisions.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
Tutorial presented at ACM SIGIR/SIGKDD Africa Summer School on Machine Learning for Data Mining and Search (AFIRM 2020) conference in Cape Town, South Africa.
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
This document discusses gradient boosted regression trees (GBRT) and their implementation in scikit-learn. It begins with an introduction to machine learning concepts like classification, regression, and decision trees. It then covers the basics of boosting and gradient boosting, describing how GBRT works by sequentially fitting trees to residuals. The rest of the document demonstrates scikit-learn's GBRT implementation, provides tips on regularization and hyperparameters, and presents a case study on house price prediction.
This document provides an overview of machine learning concepts. It discusses big data and the need for machine learning to extract structure from data. It explains that machine learning involves programming computers to optimize performance using examples or past experience. Learning is useful when human expertise is limited or changes over time. The document also summarizes applications of machine learning like classification, regression, clustering, and reinforcement learning. It provides examples of each type of learning and discusses concepts like bias-variance tradeoff, overfitting, underfitting and more.
This document summarizes the correspondence between single-layer neural networks and Gaussian processes (GPs). It reviews how the outputs of a single-layer neural network converge to a GP in the infinite-width limit, with the network's covariance function determined by its architecture. The document derives the mean and covariance functions for the GP corresponding to a single-layer network, and notes that different network outputs are independent GPs.
Stochastic optimization from mirror descent to recent algorithmsSeonho Park
The document discusses stochastic optimization algorithms. It begins with an introduction to stochastic optimization and online optimization settings. Then it covers Mirror Descent and its extension Composite Objective Mirror Descent (COMID). Recent algorithms for deep learning like Momentum, ADADELTA, and ADAM are also discussed. The document provides convergence analysis and empirical studies of these algorithms.
The document discusses function approximation and pattern recognition using neural networks. It introduces concepts like the perceptron, multi-layer perceptrons, backpropagation algorithm, supervised and unsupervised learning. It provides examples of using neural networks for function approximation and pattern recognition problems. Matlab code is also presented to illustrate training a neural network on sample datasets.
The document summarizes a presentation titled "Yoyak" given by Heejong Lee at ScalaDays 2015. The presentation introduces Yoyak, a static analysis framework developed by the speaker. It covers the following topics:
- Static analysis and abstract interpretation theory
- Implementation highlights of the Yoyak framework
- Experiences using Scala in developing Yoyak
- The roadmap for future development of Yoyak
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
This document provides an overview and introduction to using the statistical programming language R. It begins with basic commands for performing calculations and creating vectors, matrices, and data frames. It then covers importing and exporting data, basic graphs and statistical distributions, correlations, linear and nonlinear regression, advanced graphics, and accessing financial data packages. The document concludes with proposing practical tasks for workshop participants to work with financial data in R.
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
Deep learning models with millions or billions of parameters should overfit according to classical theory, but they do not. The emerging theory of double descent seeks to explain why larger neural networks can generalize well. Random matrix theory provides a tractable framework to model double descent through random feature models, where the number of random features controls model capacity. In the high-dimensional limit, the test error of random feature regression exhibits a double descent shape that can be computed analytically.
The document summarizes a deep learning programming course for artificial intelligence. The course covers topics like machine learning, deep learning, convolutional neural networks, recurrent neural networks, and applications of deep learning in medicine. It provides an overview of each week's topics, including an introduction to AI and machine learning in week 3, deep learning in week 4, and applications of AI in medicine in week 5.
Managing Uncertainties in Hardware-Software Codesign ProjectsJones Albuquerque
This document proposes a stochastic model and design scenario analysis approach for managing uncertainties in hardware-software codesign projects. It describes representing a system using hierarchical, sequencing, and development views. Teams provide probabilistic estimates for tasks. A stochastic integer linear program formulation incorporates the views and estimates to analyze design scenarios varying objectives, risks, and constraints. Examples analyzing partitioning, risks, and convergence are presented for evaluating the approach on sample systems.
The document provides an introduction to machine learning. It discusses the author's path to becoming a data scientist and some key machine learning concepts, including:
- Required skills at different experience levels for machine learning roles
- Popular machine learning approaches like deep learning and reinforcement learning
- Common machine learning problems like one shot learning and imbalanced datasets
- How machine learning works by using tricks on data through parametric models and free parameters
- Key questions in machine learning like what to teach, how to teach, and to what entity
- Popular machine learning frameworks like TensorFlow that automate tasks
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
RLTopics_2021_Lect1.pdf
1. Topics in Reinforcement Learning:
Rollout and Approximate Policy Iteration
ASU, CSE 691, Spring 2021
Links to Class Notes, Videolectures, and Slides at
http://web.mit.edu/dimitrib/www/RLbook.html
Dimitri P. Bertsekas
dbertsek@asu.edu
Lecture 1
Course Overview
Bertsekas Reinforcement Learning 1 / 36
2. Outline
1 AlphaZero - Off-Line Training and On-Line Play
2 History, General Concepts
3 About this Course
4 Exact Dynamic Programming - Deterministic Problems
5 Examples: Finite-State/Discrete/Combinatorial DP Problems
6 Organizational Issues
Bertsekas Reinforcement Learning 2 / 36
3. AlphaGo (2016) and AlphaZero (2017)
AlphaZero (Google-Deep Mi
Plays different!
Learned from scratch ... with 4 hours of training!
Plays much better than all chess programs
Same algorithm learned multiple games (Go, Shogi)
AlphaZero is not just playing better, it has discovered a new way to play!
The methodology is based on the principal ideas of this course:
Off-line training/policy iteration - Self learning
Approximations with value and policy NN approximations
Parallel computation
On-line play by multistep lookahead and cost function approximations
Bertsekas Reinforcement Learning 4 / 36
4. AlphaZero Off-Line Training by Policy Iteration Using Self-Generated
Data
f1 f2 f3 f4 f5 f6 f7
Neural Network Features Approximate Cost ˜
Jµ Policy Improvement
Neural Network Features Approximate Cost ˜
Jµ Policy Improvement
F̂ = {f1, f2, f3, f4, f5, f6, f7}
Representative Feature States dfi f ¯
f with Aggregation
Current Policy µ Improved Policy µ̃µ̂
TµΦr Φr = ΠTµΦr
Generate “Improved” Policy µ̂
State Space Feature Space Subspace J = {Φr | s ∈ ℜs}
#s
ℓ=1 Fℓ(i, v)rℓ
r = (r1, . . ., rs)
State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting of
Features
f1 f2 f3 f4 f5 f6 f7
res Approximate Cost ˜
Jµ Policy Improvement
res Approximate Cost ˜
Jµ Policy Improvement
{f1, f2, f3, f4, f5, f6, f7}
e States dfi f ¯
f with Aggregation
roved Policy µ̃µ̂
Policy µ̂
pace Subspace J = {Φr | s ∈ ℜs}
#s
ℓ=1 Fℓ(i, v)rℓ
Fs(i, v) F1(i, v) F2(i, v) Linear Weighting of
f1 f2 f3 f4 f5 f6 f7
al Network Features Approximate Cost ˜
Jµ Policy Improvement
al Network Features Approximate Cost ˜
Jµ Policy Improvement
F̂ = {f1, f2, f3, f4, f5, f6, f7}
sentative Feature States dfi f ¯
f with Aggregation
nt Policy µ Improved Policy µ̃µ̂
Φr = ΠTµΦr
rate “Improved” Policy µ̂
Space Feature Space Subspace J = {Φr | s ∈ ℜs}
#s
ℓ=1 Fℓ(i, v)rℓ
r1, . . ., rs)
i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting of
d) Plays much better than all computer programs F(i) Cost ˆ
J
!
F(i)
"
Value Function Player Features Mapping
... MCTS Lookahead Minimization Cost-to-go Approximation
D MINIMIZATION ROLLOUT States xk+2
xk, uk, wk) +
k+ℓ−1
$
m=k+1
gk
!
xm, µm(xm), wm
"
+ ˜
Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x̃
policy Parametric approximation at the end Monte Carlo tree search
AlphaZero (Google-Deep Mind) Plays much better than all compute
Plays different! Approximate Value Function Player Features Mappin
At State xk Current state x0 ... MCTS Lookahead Minimization C
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States
min
uk,µk+1,...,µk+ℓ−1
E
#
gk(xk, uk, wk) +
k+ℓ−1
$
m=k+1
gk
!
xm, µm(xm), w
Subspace S = {Φr | r ∈ ℜs} x∗ x̃
pproximation u1
k u2
k u3
k u4
k u5
k Self-Learning/Policy Iteration Constraint Relaxation
oogle-Deep Mind) Plays much better than all computer programs F(i) Cost ˆ
J
!
F(i)
"
Approximate Value Function Player Features Mapping
Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
u3
k u4
k u5
k Self-Learning/Policy Iteration Constraint Relaxation
urs of training! Current “Improved”
lays much better than all computer programs F(i) Cost ˆ
J
!
F(i)
"
e Function Player Features Mapping
MCTS Lookahead Minimization Cost-to-go Approximation
INIMIZATION ROLLOUT States xk+2
uk, wk) +
k+ℓ−1
$
m=k+1
gk
!
xm, µm(xm), wm
"
+ ˜
Jk+ℓ(xk+ℓ)
%
bspace S = {Φr | r ∈ ℜs} x∗ x̃
Tail problem approximation u1
k u2
k u3
k u4
k u5
k Self-Learning/Policy Iteration Constraint
Learned from scratch ... with 4 hours of training! Current “Improved”
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F(i) C
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approx
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
min
uk,µk+1,...,µk+ℓ−1
E
#
gk(xk, uk, wk) +
k+ℓ−1
$
m=k+1
gk
!
xm, µm(xm), wm
"
+ ˜
Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x̃
f1 f2 f3 f4 f5 f6 f7
ures Approximate Cost ˜
Jµ Policy Improvement
ures Approximate Cost ˜
Jµ Policy Improvement
{f1, f2, f3, f4, f5, f6, f7}
re States dfi f ¯
f with Aggregation
roved Policy µ̃µ̂
Policy µ̂
Space Subspace J = {Φr | s ∈ ℜs}
#s
ℓ=1 Fℓ(i, v)rℓ
b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting of
Neural Net Policy Evaluation Improvement of Current Policy µ by
ookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation ˜
J
Adaptive Simulation Terminal cost approximation Heuristic Policy
mulation with
Cost ˜
Jµ
!
F(i), r
"
of i ≈ Jµ(i) Jµ(i) Feature Map
˜
Jµ
!
F(i), r
"
: Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate States
Iq
Use a Neural Scheme or Other Scheme
Neural Network Policy Evaluation Improvement of Current Policy µ
by Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation ˜
J
Adaptive Simulation Terminal cost approximation Heuristic Policy
Simulation with
Cost ˜
Jµ
!
F(i), r
"
of i ≈ Jµ(i) Jµ(i) Feature Map
˜
Jµ
!
F(i), r
"
: Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate States
Neural Network Policy Evaluation Improveme
by Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Subopt
Approximation ˜
J
Adaptive Simulation Terminal cost approxim
Simulation with
Cost ˜
Jµ
!
F(i), r
"
of i ≈ Jµ(i) Jµ(i) Feature M
˜
Jµ
!
F(i), r
"
: Feature-based parametric archite
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation P
Use a Neural Network or Other Scheme Form
Value Policy
Termination State Infinite Horizon Approximation Sub
Controls u ∈ U(x)
x y Shortest N-Stage Distance x-to-y J∗(1) = J∗(2)
αJk(2) (2αrk, 2αrk)
Value Policy
Termination State Infinite Horizon App
Controls u ∈ U(x)
x y Shortest N-Stage Distance x-to-y
αJk(2) (2αrk, 2αrk)
The “current" player plays games that are used to “train" an “improved" player
At a given position, the “value" of a position and the “move probabilities" (the
player) are approximated by a deep neural nets (NN)
Successive NNs are trained using self-generated data and a form of regression
A form of randomized policy improvement is used: Monte-Carlo Tree Search
(MCTS)
AlphaZero bears similarity to earlier works, e.g., TD-Gammon (Tesauro,1992), but
is more complicated because of the MCTS and the deep NN
The success of AlphaZero is due to a skillful implementation/integration of known
ideas, and awesome computational power
Bertsekas Reinforcement Learning 5 / 36
5. AlphaZero/AlphaGo On-Line Play by Approximation in Value Space:
Multistep Lookahead, Rollout, and Cost Approximation
Selective Depth Lookahead Tree
Feature Extraction Features: Material Balance, uk = µd
k xk(Ik)
Mobility, Safety, etc Weighting of Features Score Position Evaluator
States xk+1 States xk+2
x0 xk im 1 im . . . (0, 0) (N, N) (N, 0) ī (N, N) N 0 g(i) ¯
I N 2
N i
s i1 im 1 im . . . (0, 0) (N, N) (N, 0) ī (N, N) N 0 g(i) ¯
I N 2 N
i
u1
k u2
k u3
k u4
k Selective Depth Adaptive Simulation Tree Projections of
Leafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector (x) Approximator (x)0r
` Stages Riccati Equation Iterates P P0 P1 P2
2 1
2
P
P +1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-Step
Shortest path problem xk
xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-
proximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation ˜
Jk+ℓ
Rollout, Model Predictive Control
b+
k b−
k Permanent trajectory Pk Tentative trajectory Tk
min
uk
E
!
gk(xk, uk, wk)+ ˜
Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go ˜
Jk+1
Optimal control sequence {u∗
0, . . . , u∗
k, . . . , u∗
N−1} Simplify E{·}
Tail subproblem Time x∗
k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′
N
′
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-Step
Shortest path problem xk
xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-
proximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation ˜
Jk+ℓ
Rollout, Model Predictive Control
b+
k b−
k Permanent trajectory Pk Tentative trajectory Tk
min
uk
E
!
gk(xk, uk, wk)+ ˜
Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go ˜
Jk+1
Optimal control sequence {u∗
0, . . . , u∗
k, . . . , u∗
N−1} Simplify E{·}
Tail subproblem Time x∗
k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
One-step Lookahead with ˜
J(j) =
!
y∈A φjyr∗
y
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
.
.
. (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /
∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy
Maxu State xk Policy µ̃k(xk, rk) h̃(u, xk, rk) h̃(c, x, r) h̃u(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ̃ by µ̃(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) ˆ
J(i, v) = r′φ(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
Aggregate Problem Cost Vector r∗ ˜
J1 = Corrected V Enlarged State Space
Aggregate States Cost ˜
J0 Cost ˜
J1 Cost r∗ *Best Score*
Representative States Controls u are associated with states i Optimal Aggregate Costs r∗
x y1 y2 y3
1
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with ˜
J(j) =
!
y∈A φjyr∗
y
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
.
.
. (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /
∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy
Maxu State xk Policy µ̃k(xk, rk) h̃(u, xk, rk) h̃(c, x, r) h̃u(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ̃ by µ̃(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) ˆ
J(i, v) = r′φ(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
Aggregate Problem Cost Vector r∗ ˜
J1 = Corrected V Enlarged State Space
Aggregate States Cost ˜
J0 Cost ˜
J1 Cost r∗ *Best Score*
Representative States Controls u are associated with states i Optimal Aggregate Costs r∗
x y1 y2 y3
1
19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage N N − 1 c(N) c(N − 1) k k + 1
tic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
er µk Control uk = µk(pk) . . . Q-Factors Current State xk
u′
k u′′
k xk+1 x′
k+1 x′′
k+1
nal State t Length = 1
ation
ble π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
st 1 Cost 1 −
√
u
J(1) = min
!
c, a + J(2)
"
J(2) = b + J(1)
µ2 Jµ3 Jµ0
) xk F(xk) F(x) xk+1 F(xk+1) xk+2 x∗ = F(x∗) Fµk
(x) Fµk+1
(x)
s i1 im 1 im . . .
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector (x) Approximator (x)0r
` Stages Riccati Equation Iterates P P0 P1 P2
2 1
2
P
P +1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
s i1 im 1 im . . .
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector (x) Approximator (x)0r
` Stages Riccati Equation Iterates P P0 P1 P2
2 1
2
P
P +1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N 1 c(N) c(N 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 u Cost 1 Cost 1
p
u
J(1) = min c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00 Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F(xk) F(x) xk+1 F(xk+1) xk+2 x⇤ = F(x⇤) Fµk
(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
Cost-to-go approximation Expected value approximation
Optimal cost J∗ Jµ1 (x)/Jµ0 (x) = K1/K0 L0 r
TµJ Jµ = TµJµ Jµ̃ = Tµ̃Jµ̃ Cost of base policy µ
Cost of rollout policy µ̃ Optimal Base Rollout Terminal Score Ap-
proximation
Simplified minimization
Changing System, Cost, and Constraint Parameters
Linearized Bellman Eq. at Jµ Yields Rollout Policy µ̃
Through Tµ̃Jµ = T Jµ Lookahead Minimization
Value iterations
Rollout with Base Off-Line Obtained Policy
Policy Improvement with Base Policy µ
Policy evaluations for µ and µ̃
min
u∈U(x)
n
!
y=1
pxy(u)
"
g(x, u, y) + α ˜
J(y)
#
Multiagent Q-factor minimization xk Possible States xk+1 xk+m+1
Termination State Constraint Set X X = X X̃ Multiagent
r
b2 + 1 1 − r
αb2 K̃ K K∗ Kk Kk+1 F(K) = αrK
r+αb2K + 1
Current Partial Folding Moving Obstacle
Complete Folding Corresponding to Open
Expert
Rollout with Base Policy m-Step Value Network Policy Network
Approximation of E{·}: Approximate minimization:
min
u∈U(x)
n
!
y=1
pxy(u)
"
g(x, u, y) + α ˜
J(y)
#
x1
k, u1
k u2
k x2
k dk τ
Q-factor approximation
Value Policy
Termination State Infinite Horizon Approximat
Controls u ∈ U(x)
x y Shortest N-Stage Distance x-to-y J∗(1) =
αJk(2) (2αrk, 2αrk)
Terminal Position Evaluation
Value Policy
Termination State Infinite Horizon Approximation Subsp
Controls u ∈ U(x)
x y Shortest N-Stage Distance x-to-y J∗(1) = J∗(2) =
αJk(2) (2αrk, 2αrk)
Terminal Position Evaluation
Value Policy
Termination State Infinite Horizon Approximation Subspace Bellma
Controls u ∈ U(x)
x y Shortest N-Stage Distance x-to-y J∗(1) = J∗(2) = 0 Exact
αJk(2) (2αrk, 2αrk)
Terminal Position Evaluation
Current Position
Corrected ˜
J ˜
J J* Cost ˜
Jµ
!
F(i), r
"
of i ≈ Jµ(i) Jµ(i) Feature Map
˜
Jµ
!
F(i), r
"
: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate States
I1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F(i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ̂ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗
ℓ Cost function ˜
J0(i) Cost function ˜
J1(j)
Approximation in a space of basis functions Plays much better than
all chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost ˜
Jµ of
Evaluate Approximate Cost ˜
Jµ
!
F(i)
"
of
F(i) =
!
F1(i), . . . , Fs(i)
"
: Vector of Features of i
˜
J
!
F(i)
"
: Feature-based architecture Final Features
Current Position
Corrected ˜
J ˜
J J* Cost ˜
Jµ
!
F(i), r
"
of i ≈ Jµ(i) Jµ(i) Feature Map
˜
Jµ
!
F(i), r
"
: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate States
I1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F(i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ̂ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗
ℓ Cost function ˜
J0(i) Cost function ˜
J1(j)
Approximation in a space of basis functions Plays much better than
all chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost ˜
Jµ of
Evaluate Approximate Cost ˜
Jµ
!
F(i)
"
of
F(i) =
!
F1(i), . . . , Fs(i)
"
: Vector of Features of i
Current Position
Player Corrected ˜
J ˜
J J* Cost ˜
Jµ
!
F(i), r
"
of i ≈ Jµ(i) Jµ(i) Featur
Map
˜
Jµ
!
F(i), r
"
: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate State
I1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F(i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ̂ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗
ℓ Cost function ˜
J0(i) Cost function ˜
J1(j)
Approximation in a space of basis functions Plays much better tha
all chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost ˜
Jµ of
Evaluate Approximate Cost ˜
Jµ
!
F(i)
"
of
! "
Off-line training yields a “value network" and a “policy network" that provide a
position evaluation function and a default/base policy to play
On-line play improves the default/base policy by:
I Searching forward for several moves
I Simulating the base policy for some more moves
I Approximating the effect of future moves by using the terminal position evaluation
Backgammon programs by Tesauro (mid 1990s) use a similar methodology
Bertsekas Reinforcement Learning 6 / 36
6. Our Objectives in this Course
Develop the AlphaZero/AlphaGo methodology, but more generally for
(Approximate) Dynamic Programming, including Operations Research/planning
type problems
(Approximately) Optimal Control, including Model Predictive Control and Robotics
Multiagent versions of Dynamic Programming and Optimal Control
Challenging/intractable discrete and combinatorial optimization problems
Decision and control in a changing environment (adaptive control).
We will:
Discuss off-line training methods of approximate policy iteration and Q-learning to
construct “value networks" and “policy networks"
Discuss on-line control methods based on approximation in value space, one-step
or multistep lookahead, rollout, etc
Develop the underlying theory relying on intuition and some math analysis
Develop the methodology in a unified framework that allows dealing with
deterministic and stochastic, discrete and continuous problems
Bertsekas Reinforcement Learning 7 / 36
7. Evolution of Approximate DP/RL
Decision/
Control/DP
Principle of
Optimality
Markov Decision
Problems
POMDP
Policy Iteration
Value Iteration
AI/RL
Learning through
Data/Experience
Simulation,
Model-Free Methods
Feature-Based
Representations
A*/Games/
Heuristics
Complementary
Ideas
Late 80s-Early 90s
Historical highlights
Exact DP, optimal control (Bellman, Shannon, and others 1950s ...)
AI/RL and Decision/Control/DP ideas meet (late 80s-early 90s)
First major successes: Backgammon programs (Tesauro, 1992, 1996)
Algorithmic progress, analysis, applications, first books (mid 90s ...)
Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)
AlphaGo and AlphaZero (DeepMind, 2016, 2017)
Bertsekas Reinforcement Learning 9 / 36
8. Approximate DP/RL Methodology is now Ambitious and Universal
Exact DP applies (in principle) to a very broad range of optimization problems
Deterministic <—-> Stochastic
Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces
One decision maker <—-> Two player games
... BUT is plagued by the curse of dimensionality and need for a math model
Approximate DP/RL overcomes the difficulties of exact DP by:
Approximation (use neural nets and other architectures to reduce dimension)
Simulation (use a computer model in place of a math model)
State of the art:
Broadly applicable methodology: Can address a very broad range of challenging
problems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc
There are no methods that are guaranteed to work for all or even most problems
There are enough methods to try with a reasonable chance of success for most
types of optimization problems
Role of the theory: Guide the art, delineate the sound ideas
Bertsekas Reinforcement Learning 10 / 36
9. A Relevant Quotation from 25 Years Ago
From preface of Neuro-Dynamic Programming, Bertsekas and Tsitsiklis, 1996
A few years ago our curiosity was aroused by reports on new methods in reinforcement
learning, a field that was developed primarily within the artificial intelligence community,
starting a few decades ago. These methods were aiming to provide effective
suboptimal solutions to complex problems of planning and sequential decision making
under uncertainty, that for a long time were thought to be intractable.
Our first impression was that the new methods were ambitious, overly optimistic, and
lacked firm foundation. Yet there were claims of impressive successes and indications
of a solid core to the modern developments in reinforcement learning, suggesting that
the correct approach to their understanding was through dynamic programming.
Three years later, after a lot of study, analysis, and experimentation, we believe that our
initial impressions were largely correct. This is indeed an ambitious, often ad hoc,
methodology, but for reasons that we now understand much better, it does have the
potential of success with important and challenging problems.
This assessment still holds true!
Bertsekas Reinforcement Learning 11 / 36
10. References of this Course
This course is research-oriented. It aims:
To explore the state of the art of approximate DP/RL at a graduate level
To explore in depth some special research topics (rollout, policy iteration)
To provide the opportunity for you to explore research in the area
Main references:
Bertsekas, Reinforcement Learning and Optimal Control, Athena Scientific, 2019
Bertsekas, Rollout, Policy Iteration, and Distributed Reinforcement Learning,
Athena Scientific, 2020
Bertsekas: Class notes based on the above, and focused on our special RL topics.
Slides, papers, and videos from the 2019 ASU course; check my web site
Supplementary references
Exact DP: Bertsekas, Dynamic Programming and Optimal Control, Vol. I (2017),
Vol. II (2012) (also contains approximate DP material)
Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996
Sutton and Barto, 1998, Reinforcement Learning (new edition 2018, on-line)
Bertsekas Reinforcement Learning 13 / 36
11. Terminology in RL/AI and DP/Control
RL uses Max/Value, DP uses Min/Cost
Reward of a stage = (Opposite of) Cost of a stage.
State value = (Opposite of) State cost.
Value (or state-value) function = (Opposite of) Cost function.
Controlled system terminology
Agent = Decision maker or controller.
Action = Decision or control.
Environment = Dynamic system.
Methods terminology
Learning = Solving a DP-related problem using simulation.
Self-learning (or self-play in the context of games) = Solving a DP problem using
simulation-based policy iteration.
Planning vs Learning distinction = Solving a DP problem with model-based vs
model-free simulation.
Bertsekas Reinforcement Learning 14 / 36
12. Notation in RL/AI and DP/Control
Reinforcement learning uses transition probability notation p(s, a, s0
) (s, s0
are
states, a is action), which is standard in finite-state Markovian Decision Problems
(MDP)
Control theory uses discrete-time system equation xk+1 = f(xk , uk , wk ), which is
standard in continuous spaces control problems
Operations research uses both notations [typically pij (u) for transition probabilities]
These two notational systems are mathematically equivalent but:
Transition probabilities are cumbersome for deterministic problems and continuous
spaces problems
System equations are cumbersome for finite-state MDP
We use both notational systems:
For the first 3/4 of the course we use system equations
For the last 1/4 of the course we use transition probabilities
Bertsekas Reinforcement Learning 15 / 36
13. A Key Idea: Sequential Decisions w/ Approximation in Value Space
...
...
Current
State
Next
State
Decision
Cost
Decisions/Costs
Current Stage Future Stages
Exact DP: Making optimal decisions in stages (deterministic state transitions)
On-line: At current state, apply decision that minimizes
Current Stage Cost + J∗
(Next State)
where J∗
(Next State) is the optimal future cost (computed off-line).
This defines an optimal policy (an optimal control to apply at each state and stage)
Approximate DP: Use approximate cost ˜
J instead of J∗
On-line: At current state, apply decision that minimizes (perhaps approximately)
Current Stage Cost + ˜
J(Next State)
This defines a suboptimal policy
Bertsekas Reinforcement Learning 16 / 36
14. Major Approaches to Compute the Approximate Cost Function ˜
J
Problem approximation
Use as ˜
J the optimal cost function of a related problem (computed by exact DP)
Rollout and model predictive control
Use as ˜
J the cost function of some policy (computed somehow, perhaps according to
some simplified optimization process)
Use of neural networks and other feature-based architectures
They serve as function approximators (usually obtained through off-line training)
Use of simulation to generate data to “train" the architectures
Approximation architectures involve parameters that are “optimized" using data
Policy iteration/self-learning, repeated policy changes
Multiple policies are sequentially generated; each is used to provide the data to train
the next
Bertsekas Reinforcement Learning 17 / 36
15. Finite Horizon Deterministic Problem
...
...
Permanent trajectory Pk Tentative trajectory Tk
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i,
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computat
Good approximation Poor Approximation σ(
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search Fir
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Sta
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um+2
Set of States (u1) Set of States (u1, u2) Set of State
Run the Heuristics From Each Candidate (m+2)-Sol
Set of States (ũ1) Set of States (ũ1, ũ2) Neural Netw
Set of States u = (u1, . . . , uN ) Current m-Solution (
Cost G(u) Heuristic N-Solutions u = (u1, . . . , uN−1
Permanent trajectory Pk Tentative trajectory Tk
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computation of ˜
J:
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Futu
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um+2) (m+2)-Solu
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (ũ1, . . . ,
Set of States (ũ1) Set of States (ũ1, ũ2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (ũ1, . . . , ũm)
Permanent trajectory Pk Tentative trajectory Tk
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computation of ˜
J:
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (ũ1, . . . , ũm, um+1)
Set of States (ũ1) Set of States (ũ1, ũ2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (ũ1, . . . , ũm)
Cost G(u) Heuristic N-Solutions u = (u1, . . . , uN−1)
Permanent trajectory Pk Tentative trajectory Tk
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computation of ˜
J:
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (ũ1, . . . , ũm,
Set of States (ũ1) Set of States (ũ1, ũ2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (ũ1, . . . , ũm)
Cost G(u) Heuristic N-Solutions u = (u1, . . . , uN−1)
Permanent trajectory Pk Tentative trajectory
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i)
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i,
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Com
Good approximation Poor Approximati
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree searc
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stag
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1
Set of States (u1) Set of States (u1, u2) Set of
Run the Heuristics From Each Candidate (m+
Set of States (ũ1) Set of States (ũ1, ũ2) Neura
Set of States u = (u1, . . . , uN ) Current m-Solu
Cost G(u) Heuristic N-Solutions u = (u1, . . . ,
Permanent trajectory Pk Tentative trajectory Tk
Stage k Future Stges
Control uk Cost gk(xk, uk) xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) J
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computa
Good approximation Poor Approximation σ
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search F
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 S
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um
Set of States (u1) Set of States (u1, u2) Set of Sta
Run the Heuristics From Each Candidate (m+2)-So
Permanent trajectory Pk Tentative t
Stage k Future Stages
Control uk Cost gk(xk, uk) xk xk+1 x
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg m
Subspace M = {Φr | r ∈ ℜm} Based
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j
Good approximation Poor Appr
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo t
Feature Extraction
Node Subset S1 SN Aggr. States Stag
Candidate (m+2)-Solutions (ũ1, . . . , ũ
Set of States (u1) Set of States (u1, u
Run the Heuristics From Each Candid
y Pk Tentative trajectory Tk
es
xk, uk) x0 xk xk+1 xN xN x′
N
xN x′
N
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
| r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
u)
!
g(i, u, j) + ˜
J(j)
"
Computation of ˜
J:
ion Poor Approximation σ(ξ) = ln(1 + eξ)
) Monte Carlo tree search First Step “Future”
Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
olutions (ũ1, . . . , ũm, um+1, um+2) (m+2)-Solution
t of States (u1, u2) Set of States (u1, u2, u3)
rom Each Candidate (m+2)-Solution (ũ1, . . . , ũm, um+1)
t of States (ũ1, ũ2) Neural Network
1, . . . , uN ) Current m-Solution (ũ1, . . . , ũm)
N-Solutions u = (u , . . . , u )
System
xk+1 = fk (xk , uk ), k = 0, 1, . . . , N − 1
where xk : State, uk : Control chosen from some set Uk (xk )
Cost function:
gN (xN ) +
N−1
X
k=0
gk (xk , uk )
For given initial state x0, minimize over control sequences {u0, . . . , uN−1}
J(x0; u0, . . . , uN−1) = gN (xN ) +
N−1
X
k=0
gk (xk , uk )
Optimal cost function J∗
(x0) = min uk ∈Uk (xk )
k=0,...,N−1
J(x0; u0, . . . , uN−1)
Bertsekas Reinforcement Learning 19 / 36
16. Principle of Optimality: A Very Simple Idea
Permanent trajectory Pk Tent
Tail subproblem Time x⇤
k Futu
Stage k Future Stages Termin
Control uk Cost gk(xk, uk) x0
ũk uk x̃k+1 xk+1 x̃N xN x0
N
r = ⇧ T
( )
µ ( r) ⇧(Jµ) µ(i)
Subspace M = { r | r 2 <m}
minu2U(i)
Pn
j=1 pij(u) g(i, u, j
Good approximation Poor
max{0, ⇠} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte C
Feature Extraction
ative trajectory Tk
ure Stages Terminal Cost k N
al Cost gN (xN )
xk xk+1 xN xN x0
N
2 arg minu2U(i) Q̃µ(i, u, r)
Based on ˜
Jµ(i, r) Jµk
) + ˜
J(j) Computation of ˜
J:
Approximation (⇠) = ln(1 + e⇠)
Carlo tree search First Step “Future”
manent trajectory Pk Tentative trajectory Tk
subproblem Time x⇤
k Future Stages Terminal Cost k N
e k Future Stages Terminal Cost gN (xN )
rol uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0
N
k x̃k+1 xk+1 x̃N xN x0
N
= ⇧ T
( )
µ ( r) ⇧(Jµ) µ(i) 2 arg minu2U(i) Q̃µ(i, u, r)
pace M = { r | r 2 <m} Based on ˜
Jµ(i, r) Jµk
2U(i)
Pn
j=1 pij(u) g(i, u, j) + ˜
J(j) Computation of ˜
J:
d approximation Poor Approximation (⇠) = ln(1 + e⇠)
{0, ⇠} ˜
J(x)
0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
xtraction
minu2U(i)
Pn
j=1 pij(u) g(i, u, j) + ˜
J(j) Computation of ˜
J:
Good approximation Poor Approximation (⇠) = ln(1 + e⇠)
max{0, ⇠} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
ature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N 1
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (ũ1, . . . , ũm, um
Set of States (ũ1) Set of States (ũ1, ũ2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (ũ1, . . . , ũm)
Cost G(u) Heuristic N-Solutions u = (u1, . . . , uN 1)
Candidate (m + 1)-Solutions (ũ1, . . . , ũm, um+1)
Cost G(u) Heuristic N-Solutions
Permanent trajectory Pk Tentative trajectory Tk
Optimal control sequence {u⇤
0, . . . , u⇤
k, . . . , u⇤
N 1}
Tail subproblem Time x⇤
k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0
N
ũk uk x̃k+1 xk+1 x̃N xN x0
N
r = ⇧ T
( )
µ ( r) ⇧(Jµ) µ(i) 2 arg minu2U(i) Q̃µ(i, u, r)
Subspace M = { r | r 2 <m} Based on ˜
Jµ(i, r) Jµk
minu2U(i)
Pn
j=1 pij(u) g(i, u, j) + ˜
J(j) Computation of
Good approximation Poor Approximation (⇠) = ln(
trajectory Pk Tentative trajectory Tk
ontrol sequence {u∗
0, . . . , u∗
k, . . . , u∗
N−1}
oblem Time x∗
kFuture Stages Terminal Cost k N
ture Stages Terminal Cost gN (xN )
Cost gk(xk, uk) x0 xk xk+1 xN xN x′
N
1 xk+1 x̃N xN x′
N
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computation of ˜
J:
proximation Poor Approximation σ(ξ) = ln(1 + eξ)
˜
J(x)
k k
Optimal control sequence {u∗
0, . . . , u∗
k, . . . , u∗
N−1}
Tail subproblem Time x∗
k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′
N
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈U(i) Q̃µ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ(i, r) Jµk
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
Computation of ˜
J:
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
ture Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um+1, um+2) (m+2)-Solution
Optimal control sequence {u∗
0, . . . , u∗
k, . . .
Tail subproblem Time x∗
k Future Stages T
Stage k Future Stages Terminal Cost gN (x
Control uk Cost gk(xk, uk) x0 xk xk+1 xN
ũk uk x̃k+1 xk+1 x̃N xN x′
N
Φr = Π
!
T
(λ)
µ (Φr)
"
Π(Jµ) µ(i) ∈ arg minu∈
Subspace M = {Φr | r ∈ ℜm} Based on ˜
Jµ
minu∈U(i)
#n
j=1 pij(u)
!
g(i, u, j) + ˜
J(j)
"
C
Good approximation Poor Approxim
max{0, ξ} ˜
J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree s
Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 S
Candidate (m+2)-Solutions (ũ1, . . . , ũm, um
Principle of Optimality
THE TAIL OF AN OPTIMAL SEQUENCE IS OPTIMAL FOR THE TAIL SUBPROBLEM
Let {u∗
0 , . . . , u∗
N−1} be an optimal control sequence with corresponding state sequence
{x∗
1 , . . . , x∗
N }. Consider the tail subproblem that starts at x∗
k at time k and minimizes
over {uk , . . . , uN−1} the “cost-to-go” from k to N,
gk (x∗
k , uk ) +
N−1
X
m=k+1
gm(xm, um) + gN (xN ).
Then the tail optimal control sequence {u∗
k , . . . , u∗
N−1} is optimal for the tail subproblem.
Bertsekas Reinforcement Learning 20 / 36
17. DP Algorithm: Solves All Tail Subproblems Using the Principle of
Optimality
Idea of the DP algorithm
Solve all the tail subproblems of a given time length using the solution of all the tail
subproblems of shorter time length
By the principle of optimality: To solve the tail subproblem that starts at xk
Consider every possible uk and solve the tail subproblem that starts at next state
xk+1 = fk (xk , uk ). This gives the “cost of uk "
Optimize over all possible uk
DP Algorithm: Produces the optimal costs J∗
k (xk ) of the xk -tail subproblems
Start with
J∗
N (xN ) = gN (xN ), for all xN ,
and for k = 0, . . . , N − 1, let
J∗
k (xk ) = min
uk ∈Uk (xk )
h
gk (xk , uk ) + J∗
k+1 fk (xk , uk )
i
, for all xk .
The optimal cost J∗
(x0) is obtained at the last step: J0(x0) = J∗
(x0).
Bertsekas Reinforcement Learning 21 / 36
18. Construction of Optimal Control Sequence {u∗
0, . . . , u∗
N−1}
Start with
u∗
0 ∈ arg min
u0∈U0(x0)
h
g0(x0, u0) + J∗
1 f0(x0, u0)
i
,
and
x∗
1 = f0(x0, u∗
0 ).
Sequentially, going forward, for k = 1, 2, . . . , N − 1, set
u∗
k ∈ arg min
uk ∈Uk (x∗
k
)
h
gk (x∗
k , uk ) + J∗
k+1 fk (x∗
k , uk )
i
, x∗
k+1 = fk (x∗
k , u∗
k ).
Approximation in Value Space - Use Some ˜
Jk in Place of J∗
k
Start with
ũ0 ∈ arg min
u0∈U0(x0)
h
g0(x0, u0) + ˜
J1 f0(x0, u0)
i
,
and set
x̃1 = f0(x0, ũ0).
Sequentially, going forward, for k = 1, 2, . . . , N − 1, set
ũk ∈ arg min
uk ∈Uk (x̃k )
h
gk (x̃k , uk ) + ˜
Jk+1 fk (x̃k , uk )
i
, x̃k+1 = fk (x̃k , ũk ).
Bertsekas Reinforcement Learning 22 / 36
19. Finite-State Problems: Shortest Path View
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-
minal Cost AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-
minal Cost AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
s t uk Demand at Period k
k + 1
Initial State Stage 0 Stage 1
Artificial Terminal Node Ter
minal Cost AB AC CA CD A
ACB ACD CAB CAD
SA SB CAB CAC CCA
CAB CAD CDA CCD CB
Do not Repair Repair 1 2 n
p22 p2n p2(n 1) p2(n 1) p(n
2nd Game / Timid Play 2nd G
s t uk Demand at Period
k + 1
Initial State Stage 0 Stage
Artificial Terminal Node T
minal Cost AB AC CA CD
ACB ACD CAB CAD
SA SB CAB CAC CCA
CAB CAD CDA CCD C
Do not Repair Repair 1 2
p22 p2n p2(n 1) p2(n 1) p(
2nd Game / Timid Play 2nd
1st Game / Timid Play 1st G
s t uk Demand at Period k Stock at P
k + 1
Initial State Stage 0 Stage 1 Stage 2 Sta
Artificial Terminal Node Terminal Arcs
minal Cost AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC
CAB CAD CDA CCD CBD CDB C
Do not Repair Repair 1 2 n 1 n p11 p
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n
2nd Game / Timid Play 2nd Game / Bold P
1st Game / Timid Play 1st Game / Bold Pl
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Artificial Terminal Node Terminal Arcs with Cost Equal t
minal Cost AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-
minal Cost AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
s t uk Demand at Period k Stock at Period k Stock at Period
+ 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
s t uk Demand at Period k Stock at Period k Stock at Period
k + 1
Initial State Stage 0 Stage 1 Stage 2 Stage N 1 Stage N
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
Terminal Arcs with Cost Equal to Termina
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC
CAB CAD CDA CCD CBD CDB CA
Do not Repair Repair 1 2 n 1 n p11 p
.
.
. . . .
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)
2nd Game / Timid Play 2nd Game / Bold Pl
1st Game / Timid Play 1st Game / Bold Play
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5
SA SB CAB CAC CCA CCD CBC
CAB CAD CDA CCD CBD CDB CA
Do not Repair Repair 1 2 n 1 n p11 p
.
.
. . . .
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1
2nd Game / Timid Play 2nd Game / Bold P
1st Game / Timid Play 1st Game / Bold Pla
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.
System xk+1 = fk(xk, uk, wk) uk = µk(xk)
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
CAB CAD CDA CCD CBD CDB CA
Do not Repair Repair 1 2 n 1 n p11 p
.
.
. . . .
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1
2nd Game / Timid Play 2nd Game / Bold P
1st Game / Timid Play 1st Game / Bold Pla
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.
System xk+1 = fk(xk, uk, wk) uk = µk(xk)
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Ov
Terminal Arcs with Cost Equal to Termin
CD ABC
Artificial Terminal Node
SA SB CAB CAC CCA CCD CBC
CAB CAD CDA CCD CBD CDB C
Do not Repair Repair 1 2 n 1 n p11
.
.
. . . .
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n
2nd Game / Timid Play 2nd Game / Bold P
1st Game / Timid Play 1st Game / Bold Pla
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1
System xk+1 = fk(xk, uk, wk) uk = µk(xk)
Nodes correspond to states xk
Arcs correspond to state-control pairs (xk , uk )
An arc (xk , uk ) has start and end nodes xk and xk+1 = fk (xk , uk )
An arc (xk , uk ) has a cost gk (xk , uk ). The cost to optimize is the sum of the arc
costs from the initial node s to the terminal node t.
The problem is equivalent to finding a minimum cost/shortest path from s to t.
Bertsekas Reinforcement Learning 23 / 36
20. Discrete-State Deterministic Scheduling Example
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
k + 1
Cost of Period k Stock Ordered at Pe
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC
CAB CAD CDA CCD CBD CDB
Do not Repair Repair 1 2 n 1 n p11
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n
2nd Game / Timid Play 2nd Game / Bold
1st Game / Timid Play 1st Game / Bold P
0 0 1 0 0 1 1.5 0.5 1 1 0.5
System xk+1 = fk(xk, uk, wk) uk = µk(xk
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
wk xk uk Demand at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB A
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n p
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk
wk xk uk Demand at Period k Stock at Period k Sto
k + 1
Cost of Period k Stock Ordered at Period k Inven
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k Stock at P
k + 1
Cost of Period k Stock Ordered at Period k Inventory Sy
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Finite Horizon Problems Ch. 1
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
22 2n 2(n 1) 2(n 1) (n 1)(n 1) (n 1)n nn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µ
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC C
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n
2nd Game / Timid Play 2nd Game / Bold Pla
1st Game / Timid Play 1st Game / Bold Play
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µ
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk x
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk x
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)
2nd Game / Timid Play 2nd Game / Bold Pl
1st Game / Timid Play 1st Game / Bold Play
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5
System xk+1 = fk(xk, uk, wk) uk = µk(xk)
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
1st Game / Timid Play 1st Game / Bold Play
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µ
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk x
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk x
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Tail problem approximation u1
k u2
k u3
k u4
k u5
k Constraint Relaxation U U1 U2
At State xk
Empty schedule
min
uk,µk+1,...,µk+ℓ−1
E
!
gk(xk, uk, wk) +
k+ℓ−1
m=k+1
gk
#
xm, µm(xm), wm
$
+ ˜
Jk+ℓ(x
Subspace S = {Φr | r ∈ ℜs} x∗ x̃
Rollout: Simulation with fixed policy Parametric approximation at the end Mon
T (λ)(x) = T (x) x = P(c)(x)
Find optimal sequence of operations A, B, C, D (A must precede B and C must precede D)
DP Problem Formulation
States: Partial schedules; Controls: Stage 0, 1, and 2 decisions; Cost data shown
along the arcs
Recall the DP idea: Break down the problem into smaller pieces (tail subproblems)
Start from the last decision and go backwards
Bertsekas Reinforcement Learning 24 / 36
21. DP Algorithm: Stage 2 Tail Subproblems
wk xk uk Demand at Period k Stock at Period k Stock at Period
Cost of Period k Stock Ordered at Period k Inventory System
+ cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
xk uk Demand at Period k Stock at Period k Stock at Period
t of Period k Stock Ordered at Period k Inventory System
uk xk+1 = xk + u + k wk
ck at Period k +1 Initial State A C AB AC CA CD ABC
CB ACD CAB CAD CDA
SB CAB CAC CCA CCD CBC CCB CCD
B CAD CDA CCD CBD CDB CAB
not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k Stock at Period
1
Cost of Period k Stock Ordered at Period k Inventory System
) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
xk uk Demand at Period k Stock at Period k Stock at Period
t of Period k Stock Ordered at Period k Inventory System
uk xk+1 = xk + u + k wk
k at Period k +1 Initial State A C AB AC CA CD ABC
CB ACD CAB CAD CDA
SB CAB CAC CCA CCD CBC CCB CCD
CAD CDA CCD CBD CDB CAB
not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
Game / Timid Play 2nd Game / Bold Play
k Demand at Period k Stock at Period k Stock at Period
Period k Stock Ordered at Period k Inventory System
+1 = xk + u + k wk
Period k +1 Initial State A C AB AC CA CD ABC
ACD CAB CAD CDA
CAB CAC CCA CCD CBC CCB CCD
D CDA CCD CBD CDB CAB
epair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
ost of Period k Stock Ordered at Period k Inventory System
+ cuk xk+1 = xk + u + k wk
tock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
A SB CAB CAC CCA CCD CBC CCB CCD
AB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
nd Game / Timid Play 2nd Game / Bold Play
st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
ystem xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k
k + 1
Cost of Period k Stock Ordered at Period k I
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk
wk xk uk Demand at Period k Stock at Period k Stock
k + 1
Cost of Period k Stock Ordered at Period k Inventor
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA C
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
wk xk uk Demand at Period k Stock at Period k Stock at Peri
k + 1
Cost of Period k Stock Ordered at Period k Inventory Syste
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD AB
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 p
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
wk xk uk Demand at Period k Stock at Period k Stock at Period
Cost of Period k Stock Ordered at Period k Inventory System
+ cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
uk Demand at Period k Stock at Period k Stock at Period
f Period k Stock Ordered at Period k Inventory System
xk+1 = xk + u + k wk
at Period k +1 Initial State A C AB AC CA CD ABC
ACD CAB CAD CDA
B CAB CAC CCA CCD CBC CCB CCD
CAD CDA CCD CBD CDB CAB
Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
ame / Timid Play 2nd Game / Bold Play
me / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
1st Game / Timid Play 1st Game / Bold Play pd 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1) .
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk x
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk x
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
A Stage 2
Subproblem
Solve the stage 2 subproblems (using the terminal costs - in red)
At each state of stage 2, we record the optimal cost-to-go and the optimal decision
Bertsekas Reinforcement Learning 25 / 36
22. DP Algorithm: Stage 1 Tail Subproblems
wk xk uk Demand at Period k Stock at Period k Stock at Period
Cost of Period k Stock Ordered at Period k Inventory System
+ cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
xk uk Demand at Period k Stock at Period k Stock at Period
t of Period k Stock Ordered at Period k Inventory System
uk xk+1 = xk + u + k wk
ck at Period k +1 Initial State A C AB AC CA CD ABC
CB ACD CAB CAD CDA
SB CAB CAC CCA CCD CBC CCB CCD
B CAD CDA CCD CBD CDB CAB
not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
Game / Timid Play 2nd Game / Bold Play
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k Stock at Period
1
Cost of Period k Stock Ordered at Period k Inventory System
) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
xk uk Demand at Period k Stock at Period k Stock at Period
t of Period k Stock Ordered at Period k Inventory System
uk xk+1 = xk + u + k wk
ck at Period k +1 Initial State A C AB AC CA CD ABC
CB ACD CAB CAD CDA
SB CAB CAC CCA CCD CBC CCB CCD
B CAD CDA CCD CBD CDB CAB
not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
Game / Timid Play 2nd Game / Bold Play
k Demand at Period k Stock at Period k Stock at Period
Period k Stock Ordered at Period k Inventory System
+1 = xk + u + k wk
Period k +1 Initial State A C AB AC CA CD ABC
ACD CAB CAD CDA
CAB CAC CCA CCD CBC CCB CCD
D CDA CCD CBD CDB CAB
epair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p p p p p
Cost of Period k Stock Ordered at Period k Inventory System
+ cuk xk+1 = xk + u + k wk
tock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
A SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
nd Game / Timid Play 2nd Game / Bold Play
st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
ystem xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
inite Horizon Problems Ch. 1
k + 1
Cost of Period k Stock Ordered at Period k In
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB C
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk
wk xk uk Demand at Period k Stock at Period k Stock
k + 1
Cost of Period k Stock Ordered at Period k Inventory
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA C
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
wk xk uk Demand at Period k Stock at Period k Stock at Perio
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD AB
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
wk xk uk Demand at Period k Stock at Period k Stock at Period
k + 1
Cost of Period k Stock Ordered at Period k Inventory System
r(uk) + cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
wk xk uk Demand at Period k Stock at Period k Stock at Period
Cost of Period k Stock Ordered at Period k Inventory System
+ cuk xk+1 = xk + u + k wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
uk Demand at Period k Stock at Period k Stock at Period
of Period k Stock Ordered at Period k Inventory System
xk+1 = xk + u + k wk
at Period k +1 Initial State A C AB AC CA CD ABC
ACD CAB CAD CDA
B CAB CAC CCA CCD CBC CCB CCD
CAD CDA CCD CBD CDB CAB
Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
ame / Timid Play 2nd Game / Bold Play
me / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
.
.
.
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
11 12 1n 1(n
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1st Game / Timid Play 1st Game / Bold Play pd 1 p
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
22 2n 2(n 1) 2(n 1) (n 1)(n 1) (n 1)n nn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
Perfect-State Info Ch. 3
1
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n 1 n p11 p12 p1n p1(n 1) p2(n 1)
p22 p2n p2(n 1) p2(n 1) p(n 1)(n 1) p(n 1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1 pd pw 1 pw
0 0 1 0 0 1 1.5 0.5 1 1 0.5 1.5 0 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Finite Horizon Problems Ch. 1
Deterministic Problems Ch. 2
Stochastic Problems
A Stage 1
Subproblem
Solve the stage 1 subproblems (using the optimal costs of stage 2
subproblems - in purple)
At each state of stage 1, we record the optimal cost-to-go and the optimal decision
Bertsekas Reinforcement Learning 26 / 36