SlideShare a Scribd company logo
1 of 33
Download to read offline
Solving Hidden-Semi-Markov-Mode Markov Decision
Problems
SUM 2014
Emmanuel Hadoux Aurélie Beynier Paul Weng
LIP6, UPMC (Paris 6)
September, the 17th 2014
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 1 / 28
Introduction Definitions
Sequential decision-making problems
Sequential decision-making = make decisions at consecutive timesteps
Markov Decision Process (MDP) (< S, A, T, R >):
S Set of states
A Set of actions
T Transition function over states (T : S × A → Pr(S))
R Reward function (R : S × A → R)
Non-stationary ⇒ T and/or R
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 2 / 28
Introduction Definitions
sailboat problem as an MDP
S Boat positions
A Sail orientations
T Position change
R 1 at the goal, 0
otherwise
Figure 1: sailboat problem [2]
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 3 / 28
Introduction Algorithms on MDPs
Algorithms on MDPs
T and/or R unknown:
Value or Policy iteration unusable
Reinforcement learning ⇒ No convergence guarantee with
non-stationarity
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 4 / 28
Existing models and algorithms
1 Introduction
2 Existing models and algorithms
3 HM-MDPs extension
4 Experimentations
5 Conclusion and perspectives
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 5 / 28
Existing models and algorithms HM-MDP
Hidden-Mode MDP (HM-MDP) [2]
Key idea
Non-stationary env. can be seen as a
composition of stationary env.
HM-MDP
Stat. MDPs, linked by a transition
function ⇒ M, C , ∀Mi ∈ M, Mi is
an MDP S, A, Ti, Ri .
M Set of modes
C Transition function
over modes
(C : M → Pr(M))
The new mode is drawn after each
decision.
Figure 2: 3 modes, 4 states, 1 action
HM-MDP [2].
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 6 / 28
Existing models and algorithms Exemple
sailboat problem as an HM-MDP
M = {Mi} Wind directions
S Boat positions
A Sail orientations
Ti, ∀i Position change,
according to the wind
Ri, ∀i 1 at the goal, 0
otherwise
C 0.5 same mode, 0.2
adjacent modes, 0.1
opposite mode
Figure 3: sailboat problem [2]
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 7 / 28
Existing models and algorithms Reformulation into a POMDP
Reformulation into a POMDP
An HM-MDP can be reformulated into a partially observable MDP
(POMDP).
POMDP
States cannot be directly observed.
⇒< S, A, O, T , R, Q >
O Set of observations
Q Observation function
(Q : S × A → Pr(O))
In the derived POMDP, O is equivalent to the set of states (S) of the
original HM-MDP.
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 8 / 28
Existing models and algorithms Solving an HM-MDP
Solving an HM-MDP
Exact solving of the HM-MDP [2]
More efficient than solving the derived POMDP
How it works
Inference of the current mode from the observation and the belief on the
previous mode:
µ (m ) ∝
m
C(m, m )Tm(s, a, s )µ(m) (1)
However, we cannot solve big instances this way.
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 9 / 28
Existing models and algorithms POMCP
Partially Observable Monte-Carlo Planning (POMCP) [4]
POMCP solves POMDPs
It uses Monte-Carlo sampling to avoid the curse of dimensionality
It uses a black-box simulator before acting in the real environment
(online)
It converges towards the optimal policy under some conditions
It can solve instances unreachable with the other methods
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 10 / 28
Existing models and algorithms POMCP
Partially Observable Monte-Carlo Planning (POMCP) [4]
POMCP solves POMDPs
It uses Monte-Carlo sampling to avoid the curse of dimensionality
It uses a black-box simulator before acting in the real environment
(online)
It converges towards the optimal policy under some conditions
It can solve instances unreachable with the other methods
How it works
1 It maintains particles to approximate the belief function
2 It samples those particles to get the best action
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 10 / 28
HM-MDPs extension
1 Introduction
2 Existing models and algorithms
3 HM-MDPs extension
4 Experimentations
5 Conclusion and perspectives
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 11 / 28
HM-MDPs extension HS3MDP
Hidden Semi-Markov Mode MDP (HS3MDP)
Hypothesis
Modes do not change at each timestep.
⇒ hi: the environment stays hi timesteps in mi
HS3MDP
We add a duration function H = P(h |m, m , h)
At each step:
If hi > 0, hi+1 = hi − 1 and mi+1 = mi
Else:
1 Draw mi+1 from C
2 Draw hi+1 from H
Solving an HS3MDP is similar to solving HM-MDP.
Indeed, they are equivalent but not as efficient.
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 12 / 28
HM-MDPs extension Solving an HS3MDP with POMCP
Solving an HS3MDP with POMCP
Original method:
Lack of particles with big states space
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
HM-MDPs extension Solving an HS3MDP with POMCP
Solving an HS3MDP with POMCP
Original method:
Lack of particles with big states space
Adding more particles implies doing more simulations
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
HM-MDPs extension Solving an HS3MDP with POMCP
Solving an HS3MDP with POMCP
Original method:
Lack of particles with big states space
Adding more particles implies doing more simulations
Our solution:
Replace particles drawing by drawing a belief state from µ(m, h)
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
HM-MDPs extension Solving an HS3MDP with POMCP
Solving an HS3MDP with POMCP
Original method:
Lack of particles with big states space
Adding more particles implies doing more simulations
Our solution:
Replace particles drawing by drawing a belief state from µ(m, h)
Modification of Equation (1):
µ (m , h ) ∝
m,h
µ(m, h)C(m, m )H(m, m , h, h )Tm(s, a, s ) (2)
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
HM-MDPs extension Solving an HS3MDP with POMCP
Solving an HS3MDP with POMCP
Original method:
Lack of particles with big states space
Adding more particles implies doing more simulations
Our solution:
Replace particles drawing by drawing a belief state from µ(m, h)
Modification of Equation (1):
µ (m , h ) ∝
m,h
µ(m, h)C(m, m )H(m, m , h, h )Tm(s, a, s ) (2)
Update the belief state with Equation (2)
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
Experimentations
1 Introduction
2 Existing models and algorithms
3 HM-MDPs extension
4 Experimentations
5 Conclusion and perspectives
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 14 / 28
Experimentations
Experimentations
Orig. Original POMCP on the derived POMDP
SA Structure adapted
SAER Structure adapted and exact representation
MO-SARSOP SARSOP on MO-MDP [3]
Finite-Grid Best algorithm of Cassandra’s POMDP-Toolbox
MO-IP [1] Incremental Pruning adapted for MO-MDP
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 15 / 28
Experimentations Sailboat
Results for sailboat
Sim. Orig. SA SAER MO-SARSOP
1 60 11.7% 6.7% 408.3%
2 63 30.2% 30.2% 384.1%
4 55 38.2% 54.5% 454.5%
8 70 8.6% 27.1% 335.7%
16 59 13.6% 88.1% 416.9%
32 66 28.8% 92.4% 362.1%
64 90 21.1% 45.6% 238.9%
128 94 53.2% 71.3% 224.5%
256 119 48.7% 76.5% 156.3%
512 159 31.4% 27.0% 91.8%
1024 177 20.9% 28.8% 72.3%
2048 206 13.6% 10.2% 48.1%
4096 226 12.4% 16.4% 35.0%
8192 227 20.7% 25.6% 34.4%
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 16 / 28
Experimentations Traffic
Traffic
8 states: Waiting sides × Light
sides
2 actions: Switch the left/right
light on
2 modes: Main incoming side
Given transitions and rewards
Figure 4: traffic problem [2]
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 17 / 28
Experimentations Traffic
Results for traffic
Sim. Orig. SA SAER Opt.
1 -3.42 0.0% 0.0% 38.5%
2 -2.86 3.0% 4.0% 26.5%
4 -2.80 8.1% 8.8% 25.0%
8 -2.68 6.0% 9.4% 21.7%
16 -2.60 8.0% 8.0% 19.2%
32 -2.45 5.3% 6.9% 14.3%
64 -2.47 10.0% 9.1% 14.9%
128 -2.34 4.3% 3.4% 10.4%
256 -2.41 8.5% 10.5% 12.7%
512 -2.32 5.6% 4.7% 9.3%
1024 -2.31 5.1% 7.0% 9.3%
2048 -2.38 9.0% 10.5% 11.8%
Table 2: Results for traffic, Opt. stands for Finite Grid, MO-IP and
MO-SARSOP
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 18 / 28
Experimentations Elevators
Elevators
f floors
e elevators
2f (f2f )e states
3e actions : Going up/down,
open the doors
3 modes : Rush up/down/both Figure 5: Elevator control problem [2]
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 19 / 28
Experimentations Elevators
Results for elevators
Sim. Orig. SA SAER
1 -10.56 0.0% 1.1%
2 -10.60 0.0% 0.0%
4 -10.50 2.2% 3.6%
8 -10.49 4.2% 3.9%
16 -10.44 5.2% 5.0%
32 -10.54 6.2% 6.2%
Table 3: Results for f = 7 and e = 1
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 20 / 28
Experimentations Elevators
Results for elevators
Sim. Orig. SA SAER
1 -7.41 1.0% 0.4%
2 -7.35 0.3% 0.0%
4 -7.44 1.5% 1.3%
8 -7.35 0.4% 0.0%
16 -7.30 19.1% 17.2%
32 -7.25 22.1% 21.6%
64 -7.17 24.3% 24.3%
128 -7.22 27.0% 27.0%
Table 4: Results for f = 4 and e = 2
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 21 / 28
Experimentations Random environments
Random environments
Fixed number of states, modes and actions
Random transition and reward functions with conditions
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 22 / 28
Experimentations Random environments
Results for random environments
Sim. Orig. SA SAER
1 0.41 0.0% 5.6%
2 0.41 4.9% 51.4%
4 0.42 11.5% 140.9%
8 0.44 30.9% 209.6%
16 0.48 34.6% 234.7%
32 0.58 46.0% 223.0%
64 0.77 53.1% 187.2%
128 1.08 45.7% 123.4%
256 1.52 33.5% 70.0%
512 1.98 19.6% 34.5%
1024 2.30 12.5% 17.3%
Table 5: Results with ns = 50, na = 5 and nm = 5
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 23 / 28
Experimentations Random environments
Results for random environments
Sim. Orig. SA SAER
1 0.39 0.1% 8.9%
2 0.39 21.0% 57.5%
4 0.40 9.9% 149.0%
8 0.41 24.0% 224.6%
16 0.43 33.0% 261.3%
32 0.48 58.2% 275.8%
64 0.60 76.2% 248.7%
128 0.83 75.4% 184.5%
256 1.16 64.1% 115.9%
512 1.61 41.5% 61.5%
1024 2.05 2.2% 28.8%
Table 6: Results with ns = 50, na = 5 and nm = 10
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 24 / 28
Experimentations Random environments
Results for random environments
Sim. Orig. SA SAER
1 0.39 0.8% 11.9%
2 0.40 2.6% 51.1%
4 0.40 2.7% 138.9%
8 0.41 11.8% 225.2%
16 0.41 22.3% 270.8%
32 0.45 42.9% 290.3%
64 0.51 77.5% 305.5%
128 0.63 102.2% 261.1%
256 0.85 102.7% 186.8%
512 1.23 73.3% 107.7%
1024 1.66 43.6% 55.3%
Table 7: Results with ns = 50, na = 5 and nm = 20
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 25 / 28
Conclusion and perspectives
Conclusion
In this work, we have seen:
How to efficiently represent a subset of sequential decision-making
problems in non-stationary environments (HM-MDP)
A generalization of this model with sojourn time (HS3MDP)
How to efficiently solve those problems on big instances by adapting
POMCP
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 26 / 28
Conclusion and perspectives
Perspectives
Several issues to explore:
Learn the model → HSMM learning or context detection
Adversarial case → bandits?
Extend to multi-agents problems
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 27 / 28
Conclusion and perspectives
References
Mauricio Araya-López, Vincent Thomas, Olivier Buffet, and François
Charpillet.
A closer look at MOMDPs.
In International Conference on Tools with Artificial Intelligence
(ICTAI), 2010.
Samuel Ping-Man Choi.
Reinforcement learning in nonstationary environments.
PhD thesis, Hong Kong University of Science and Technology, 2000.
Sylvie C.W. Ong, Shao Wei Png, David Hsu, and Wee Sun Lee.
POMDPs for robotic tasks with mixed observability.
In Robotics: Science & Systems, 2009.
David Silver and Joel Veness.
Monte-Carlo planning in large POMDPs.
In NIPS, pages 2164–2172, 2010.
E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 28 / 28

More Related Content

What's hot

Optimal Control System Design
Optimal Control System DesignOptimal Control System Design
Optimal Control System Design
M Reza Rahmati
 
09ของไหล
09ของไหล09ของไหล
09ของไหล
Doc Edu
 

What's hot (13)

2.5 pda-capwap - gray
2.5   pda-capwap - gray2.5   pda-capwap - gray
2.5 pda-capwap - gray
 
Hyperon and charm baryons masses from twisted mass Lattice QCD
Hyperon and charm baryons masses from twisted mass Lattice QCDHyperon and charm baryons masses from twisted mass Lattice QCD
Hyperon and charm baryons masses from twisted mass Lattice QCD
 
13 fixed wing fighter aircraft- flight performance - i
13 fixed wing fighter aircraft- flight performance - i13 fixed wing fighter aircraft- flight performance - i
13 fixed wing fighter aircraft- flight performance - i
 
Optimal Control System Design
Optimal Control System DesignOptimal Control System Design
Optimal Control System Design
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and Manifolds
 
09ของไหล
09ของไหล09ของไหล
09ของไหล
 
GPR Probing of Smoothly Layered Subsurface Medium: 3D Analytical Model
GPR Probing of Smoothly Layered Subsurface Medium: 3D Analytical ModelGPR Probing of Smoothly Layered Subsurface Medium: 3D Analytical Model
GPR Probing of Smoothly Layered Subsurface Medium: 3D Analytical Model
 
Panacm 2015 paper
Panacm 2015 paperPanacm 2015 paper
Panacm 2015 paper
 
12 performance of an aircraft with parabolic polar
12 performance of an aircraft with parabolic polar12 performance of an aircraft with parabolic polar
12 performance of an aircraft with parabolic polar
 
Class lectures on Hydrology by Rabindra Ranjan Saha Lecture 3
Class lectures on Hydrology by Rabindra Ranjan Saha  Lecture 3Class lectures on Hydrology by Rabindra Ranjan Saha  Lecture 3
Class lectures on Hydrology by Rabindra Ranjan Saha Lecture 3
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
 

Viewers also liked

Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
darwinrlo
 
Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...
Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...
Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...
Aparna Udupi
 
Behavior+tree+ai lite
Behavior+tree+ai liteBehavior+tree+ai lite
Behavior+tree+ai lite
勇浩 赖
 
Derya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesisDerya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesis
Derya SEZEN
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
Edward Balaban
 

Viewers also liked (20)

Semi markov process
Semi markov processSemi markov process
Semi markov process
 
Transformation electric & autonomous driving and solar energy
Transformation electric & autonomous driving and solar energyTransformation electric & autonomous driving and solar energy
Transformation electric & autonomous driving and solar energy
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...
Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...
Methods for Sensor Based Farrowing Prediction and Floor-heat Regulation: The ...
 
Behavior+tree+ai lite
Behavior+tree+ai liteBehavior+tree+ai lite
Behavior+tree+ai lite
 
Monte carlo tree search
Monte carlo tree searchMonte carlo tree search
Monte carlo tree search
 
2014 Fighting Game Artificial Intelligence Competition
2014 Fighting Game Artificial Intelligence Competition2014 Fighting Game Artificial Intelligence Competition
2014 Fighting Game Artificial Intelligence Competition
 
Derya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesisDerya_Sezen_POMDP_thesis
Derya_Sezen_POMDP_thesis
 
Advances in Game AI
Advances in Game AIAdvances in Game AI
Advances in Game AI
 
POMDP Seminar Backup3
POMDP Seminar Backup3POMDP Seminar Backup3
POMDP Seminar Backup3
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systems
 
Designing States, Actions, and Rewards for Using POMDP in Session Search
Designing States, Actions, and Rewards for Using POMDP in Session SearchDesigning States, Actions, and Rewards for Using POMDP in Session Search
Designing States, Actions, and Rewards for Using POMDP in Session Search
 
Inverse Reinforcement On POMDP
Inverse Reinforcement On POMDPInverse Reinforcement On POMDP
Inverse Reinforcement On POMDP
 
Hierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And ExecutionHierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And Execution
 
Mcts ai
Mcts aiMcts ai
Mcts ai
 
Application of Monte Carlo Tree Search in a Fighting Game AI (GCCE 2016)
Application of Monte Carlo Tree Search in a Fighting Game AI (GCCE 2016)Application of Monte Carlo Tree Search in a Fighting Game AI (GCCE 2016)
Application of Monte Carlo Tree Search in a Fighting Game AI (GCCE 2016)
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
2016 Fighting Game Artificial Intelligence Competition
2016 Fighting Game Artificial Intelligence Competition2016 Fighting Game Artificial Intelligence Competition
2016 Fighting Game Artificial Intelligence Competition
 
Mercedes - Autonomous Driving - The S500 Intelligent drive
Mercedes - Autonomous Driving - The S500 Intelligent driveMercedes - Autonomous Driving - The S500 Intelligent drive
Mercedes - Autonomous Driving - The S500 Intelligent drive
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 

Similar to Solving Hidden-Semi-Markov-Mode Markov Decision problems

Presentation for Cree Interview
Presentation for Cree InterviewPresentation for Cree Interview
Presentation for Cree Interview
dmtrombly
 
Secrets of supercomputing
Secrets of supercomputingSecrets of supercomputing
Secrets of supercomputing
fikrul islamy
 
Secrets of supercomputing
Secrets of supercomputingSecrets of supercomputing
Secrets of supercomputing
fikrul islamy
 

Similar to Solving Hidden-Semi-Markov-Mode Markov Decision problems (20)

Large strain computational solid dynamics: An upwind cell centred Finite Volu...
Large strain computational solid dynamics: An upwind cell centred Finite Volu...Large strain computational solid dynamics: An upwind cell centred Finite Volu...
Large strain computational solid dynamics: An upwind cell centred Finite Volu...
 
Smart Systems for Urban Water Demand Management
Smart Systems for Urban Water Demand ManagementSmart Systems for Urban Water Demand Management
Smart Systems for Urban Water Demand Management
 
Melles
MellesMelles
Melles
 
Impact of the Time Step in DEM Simulations on Granular Mixing Properties
Impact of the Time Step in DEM Simulations on Granular Mixing PropertiesImpact of the Time Step in DEM Simulations on Granular Mixing Properties
Impact of the Time Step in DEM Simulations on Granular Mixing Properties
 
Essentials of Chemical Reaction Engineering 1st Edition Fogler Solutions Manual
Essentials of Chemical Reaction Engineering 1st Edition Fogler Solutions ManualEssentials of Chemical Reaction Engineering 1st Edition Fogler Solutions Manual
Essentials of Chemical Reaction Engineering 1st Edition Fogler Solutions Manual
 
LAPDE ..PPT (STANDARD TYPES OF PDE).pptx
LAPDE ..PPT (STANDARD TYPES OF PDE).pptxLAPDE ..PPT (STANDARD TYPES OF PDE).pptx
LAPDE ..PPT (STANDARD TYPES OF PDE).pptx
 
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
 
A Self-Tuned Simulated Annealing Algorithm using Hidden Markov Mode
A Self-Tuned Simulated Annealing Algorithm using Hidden Markov ModeA Self-Tuned Simulated Annealing Algorithm using Hidden Markov Mode
A Self-Tuned Simulated Annealing Algorithm using Hidden Markov Mode
 
Applications of Homotopy perturbation Method and Sumudu Transform for Solving...
Applications of Homotopy perturbation Method and Sumudu Transform for Solving...Applications of Homotopy perturbation Method and Sumudu Transform for Solving...
Applications of Homotopy perturbation Method and Sumudu Transform for Solving...
 
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniqueDynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
 
Presentation for Cree Interview
Presentation for Cree InterviewPresentation for Cree Interview
Presentation for Cree Interview
 
We P3 09
We P3 09We P3 09
We P3 09
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
EAGE_prsentation_Anderson.pptx
EAGE_prsentation_Anderson.pptxEAGE_prsentation_Anderson.pptx
EAGE_prsentation_Anderson.pptx
 
Secrets of supercomputing
Secrets of supercomputingSecrets of supercomputing
Secrets of supercomputing
 
Secrets of supercomputing
Secrets of supercomputingSecrets of supercomputing
Secrets of supercomputing
 
Gy3312241229
Gy3312241229Gy3312241229
Gy3312241229
 
Acceleration Schemes Of The Discrete Velocity Method Gaseous Flows In Rectan...
Acceleration Schemes Of The Discrete Velocity Method  Gaseous Flows In Rectan...Acceleration Schemes Of The Discrete Velocity Method  Gaseous Flows In Rectan...
Acceleration Schemes Of The Discrete Velocity Method Gaseous Flows In Rectan...
 
Smoothed Particle Hydrodynamics
Smoothed Particle HydrodynamicsSmoothed Particle Hydrodynamics
Smoothed Particle Hydrodynamics
 
A 1 D Breakup Model For
A 1 D Breakup Model ForA 1 D Breakup Model For
A 1 D Breakup Model For
 

Recently uploaded

Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
GlendelCaroz
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Recently uploaded (20)

Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENSANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algae
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdf
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 

Solving Hidden-Semi-Markov-Mode Markov Decision problems

  • 1. Solving Hidden-Semi-Markov-Mode Markov Decision Problems SUM 2014 Emmanuel Hadoux Aurélie Beynier Paul Weng LIP6, UPMC (Paris 6) September, the 17th 2014 E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 1 / 28
  • 2. Introduction Definitions Sequential decision-making problems Sequential decision-making = make decisions at consecutive timesteps Markov Decision Process (MDP) (< S, A, T, R >): S Set of states A Set of actions T Transition function over states (T : S × A → Pr(S)) R Reward function (R : S × A → R) Non-stationary ⇒ T and/or R E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 2 / 28
  • 3. Introduction Definitions sailboat problem as an MDP S Boat positions A Sail orientations T Position change R 1 at the goal, 0 otherwise Figure 1: sailboat problem [2] E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 3 / 28
  • 4. Introduction Algorithms on MDPs Algorithms on MDPs T and/or R unknown: Value or Policy iteration unusable Reinforcement learning ⇒ No convergence guarantee with non-stationarity E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 4 / 28
  • 5. Existing models and algorithms 1 Introduction 2 Existing models and algorithms 3 HM-MDPs extension 4 Experimentations 5 Conclusion and perspectives E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 5 / 28
  • 6. Existing models and algorithms HM-MDP Hidden-Mode MDP (HM-MDP) [2] Key idea Non-stationary env. can be seen as a composition of stationary env. HM-MDP Stat. MDPs, linked by a transition function ⇒ M, C , ∀Mi ∈ M, Mi is an MDP S, A, Ti, Ri . M Set of modes C Transition function over modes (C : M → Pr(M)) The new mode is drawn after each decision. Figure 2: 3 modes, 4 states, 1 action HM-MDP [2]. E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 6 / 28
  • 7. Existing models and algorithms Exemple sailboat problem as an HM-MDP M = {Mi} Wind directions S Boat positions A Sail orientations Ti, ∀i Position change, according to the wind Ri, ∀i 1 at the goal, 0 otherwise C 0.5 same mode, 0.2 adjacent modes, 0.1 opposite mode Figure 3: sailboat problem [2] E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 7 / 28
  • 8. Existing models and algorithms Reformulation into a POMDP Reformulation into a POMDP An HM-MDP can be reformulated into a partially observable MDP (POMDP). POMDP States cannot be directly observed. ⇒< S, A, O, T , R, Q > O Set of observations Q Observation function (Q : S × A → Pr(O)) In the derived POMDP, O is equivalent to the set of states (S) of the original HM-MDP. E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 8 / 28
  • 9. Existing models and algorithms Solving an HM-MDP Solving an HM-MDP Exact solving of the HM-MDP [2] More efficient than solving the derived POMDP How it works Inference of the current mode from the observation and the belief on the previous mode: µ (m ) ∝ m C(m, m )Tm(s, a, s )µ(m) (1) However, we cannot solve big instances this way. E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 9 / 28
  • 10. Existing models and algorithms POMCP Partially Observable Monte-Carlo Planning (POMCP) [4] POMCP solves POMDPs It uses Monte-Carlo sampling to avoid the curse of dimensionality It uses a black-box simulator before acting in the real environment (online) It converges towards the optimal policy under some conditions It can solve instances unreachable with the other methods E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 10 / 28
  • 11. Existing models and algorithms POMCP Partially Observable Monte-Carlo Planning (POMCP) [4] POMCP solves POMDPs It uses Monte-Carlo sampling to avoid the curse of dimensionality It uses a black-box simulator before acting in the real environment (online) It converges towards the optimal policy under some conditions It can solve instances unreachable with the other methods How it works 1 It maintains particles to approximate the belief function 2 It samples those particles to get the best action E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 10 / 28
  • 12. HM-MDPs extension 1 Introduction 2 Existing models and algorithms 3 HM-MDPs extension 4 Experimentations 5 Conclusion and perspectives E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 11 / 28
  • 13. HM-MDPs extension HS3MDP Hidden Semi-Markov Mode MDP (HS3MDP) Hypothesis Modes do not change at each timestep. ⇒ hi: the environment stays hi timesteps in mi HS3MDP We add a duration function H = P(h |m, m , h) At each step: If hi > 0, hi+1 = hi − 1 and mi+1 = mi Else: 1 Draw mi+1 from C 2 Draw hi+1 from H Solving an HS3MDP is similar to solving HM-MDP. Indeed, they are equivalent but not as efficient. E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 12 / 28
  • 14. HM-MDPs extension Solving an HS3MDP with POMCP Solving an HS3MDP with POMCP Original method: Lack of particles with big states space E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
  • 15. HM-MDPs extension Solving an HS3MDP with POMCP Solving an HS3MDP with POMCP Original method: Lack of particles with big states space Adding more particles implies doing more simulations E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
  • 16. HM-MDPs extension Solving an HS3MDP with POMCP Solving an HS3MDP with POMCP Original method: Lack of particles with big states space Adding more particles implies doing more simulations Our solution: Replace particles drawing by drawing a belief state from µ(m, h) E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
  • 17. HM-MDPs extension Solving an HS3MDP with POMCP Solving an HS3MDP with POMCP Original method: Lack of particles with big states space Adding more particles implies doing more simulations Our solution: Replace particles drawing by drawing a belief state from µ(m, h) Modification of Equation (1): µ (m , h ) ∝ m,h µ(m, h)C(m, m )H(m, m , h, h )Tm(s, a, s ) (2) E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
  • 18. HM-MDPs extension Solving an HS3MDP with POMCP Solving an HS3MDP with POMCP Original method: Lack of particles with big states space Adding more particles implies doing more simulations Our solution: Replace particles drawing by drawing a belief state from µ(m, h) Modification of Equation (1): µ (m , h ) ∝ m,h µ(m, h)C(m, m )H(m, m , h, h )Tm(s, a, s ) (2) Update the belief state with Equation (2) E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 13 / 28
  • 19. Experimentations 1 Introduction 2 Existing models and algorithms 3 HM-MDPs extension 4 Experimentations 5 Conclusion and perspectives E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 14 / 28
  • 20. Experimentations Experimentations Orig. Original POMCP on the derived POMDP SA Structure adapted SAER Structure adapted and exact representation MO-SARSOP SARSOP on MO-MDP [3] Finite-Grid Best algorithm of Cassandra’s POMDP-Toolbox MO-IP [1] Incremental Pruning adapted for MO-MDP E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 15 / 28
  • 21. Experimentations Sailboat Results for sailboat Sim. Orig. SA SAER MO-SARSOP 1 60 11.7% 6.7% 408.3% 2 63 30.2% 30.2% 384.1% 4 55 38.2% 54.5% 454.5% 8 70 8.6% 27.1% 335.7% 16 59 13.6% 88.1% 416.9% 32 66 28.8% 92.4% 362.1% 64 90 21.1% 45.6% 238.9% 128 94 53.2% 71.3% 224.5% 256 119 48.7% 76.5% 156.3% 512 159 31.4% 27.0% 91.8% 1024 177 20.9% 28.8% 72.3% 2048 206 13.6% 10.2% 48.1% 4096 226 12.4% 16.4% 35.0% 8192 227 20.7% 25.6% 34.4% E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 16 / 28
  • 22. Experimentations Traffic Traffic 8 states: Waiting sides × Light sides 2 actions: Switch the left/right light on 2 modes: Main incoming side Given transitions and rewards Figure 4: traffic problem [2] E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 17 / 28
  • 23. Experimentations Traffic Results for traffic Sim. Orig. SA SAER Opt. 1 -3.42 0.0% 0.0% 38.5% 2 -2.86 3.0% 4.0% 26.5% 4 -2.80 8.1% 8.8% 25.0% 8 -2.68 6.0% 9.4% 21.7% 16 -2.60 8.0% 8.0% 19.2% 32 -2.45 5.3% 6.9% 14.3% 64 -2.47 10.0% 9.1% 14.9% 128 -2.34 4.3% 3.4% 10.4% 256 -2.41 8.5% 10.5% 12.7% 512 -2.32 5.6% 4.7% 9.3% 1024 -2.31 5.1% 7.0% 9.3% 2048 -2.38 9.0% 10.5% 11.8% Table 2: Results for traffic, Opt. stands for Finite Grid, MO-IP and MO-SARSOP E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 18 / 28
  • 24. Experimentations Elevators Elevators f floors e elevators 2f (f2f )e states 3e actions : Going up/down, open the doors 3 modes : Rush up/down/both Figure 5: Elevator control problem [2] E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 19 / 28
  • 25. Experimentations Elevators Results for elevators Sim. Orig. SA SAER 1 -10.56 0.0% 1.1% 2 -10.60 0.0% 0.0% 4 -10.50 2.2% 3.6% 8 -10.49 4.2% 3.9% 16 -10.44 5.2% 5.0% 32 -10.54 6.2% 6.2% Table 3: Results for f = 7 and e = 1 E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 20 / 28
  • 26. Experimentations Elevators Results for elevators Sim. Orig. SA SAER 1 -7.41 1.0% 0.4% 2 -7.35 0.3% 0.0% 4 -7.44 1.5% 1.3% 8 -7.35 0.4% 0.0% 16 -7.30 19.1% 17.2% 32 -7.25 22.1% 21.6% 64 -7.17 24.3% 24.3% 128 -7.22 27.0% 27.0% Table 4: Results for f = 4 and e = 2 E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 21 / 28
  • 27. Experimentations Random environments Random environments Fixed number of states, modes and actions Random transition and reward functions with conditions E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 22 / 28
  • 28. Experimentations Random environments Results for random environments Sim. Orig. SA SAER 1 0.41 0.0% 5.6% 2 0.41 4.9% 51.4% 4 0.42 11.5% 140.9% 8 0.44 30.9% 209.6% 16 0.48 34.6% 234.7% 32 0.58 46.0% 223.0% 64 0.77 53.1% 187.2% 128 1.08 45.7% 123.4% 256 1.52 33.5% 70.0% 512 1.98 19.6% 34.5% 1024 2.30 12.5% 17.3% Table 5: Results with ns = 50, na = 5 and nm = 5 E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 23 / 28
  • 29. Experimentations Random environments Results for random environments Sim. Orig. SA SAER 1 0.39 0.1% 8.9% 2 0.39 21.0% 57.5% 4 0.40 9.9% 149.0% 8 0.41 24.0% 224.6% 16 0.43 33.0% 261.3% 32 0.48 58.2% 275.8% 64 0.60 76.2% 248.7% 128 0.83 75.4% 184.5% 256 1.16 64.1% 115.9% 512 1.61 41.5% 61.5% 1024 2.05 2.2% 28.8% Table 6: Results with ns = 50, na = 5 and nm = 10 E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 24 / 28
  • 30. Experimentations Random environments Results for random environments Sim. Orig. SA SAER 1 0.39 0.8% 11.9% 2 0.40 2.6% 51.1% 4 0.40 2.7% 138.9% 8 0.41 11.8% 225.2% 16 0.41 22.3% 270.8% 32 0.45 42.9% 290.3% 64 0.51 77.5% 305.5% 128 0.63 102.2% 261.1% 256 0.85 102.7% 186.8% 512 1.23 73.3% 107.7% 1024 1.66 43.6% 55.3% Table 7: Results with ns = 50, na = 5 and nm = 20 E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 25 / 28
  • 31. Conclusion and perspectives Conclusion In this work, we have seen: How to efficiently represent a subset of sequential decision-making problems in non-stationary environments (HM-MDP) A generalization of this model with sojourn time (HS3MDP) How to efficiently solve those problems on big instances by adapting POMCP E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 26 / 28
  • 32. Conclusion and perspectives Perspectives Several issues to explore: Learn the model → HSMM learning or context detection Adversarial case → bandits? Extend to multi-agents problems E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 27 / 28
  • 33. Conclusion and perspectives References Mauricio Araya-López, Vincent Thomas, Olivier Buffet, and François Charpillet. A closer look at MOMDPs. In International Conference on Tools with Artificial Intelligence (ICTAI), 2010. Samuel Ping-Man Choi. Reinforcement learning in nonstationary environments. PhD thesis, Hong Kong University of Science and Technology, 2000. Sylvie C.W. Ong, Shao Wei Png, David Hsu, and Wee Sun Lee. POMDPs for robotic tasks with mixed observability. In Robotics: Science & Systems, 2009. David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In NIPS, pages 2164–2172, 2010. E. Hadoux, A. Beynier, P. Weng HS3MDP September, the 17th 2014 28 / 28